linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support
@ 2016-06-17  1:17 Darrick J. Wong
  2016-06-17  1:17 ` [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range Darrick J. Wong
                   ` (118 more replies)
  0 siblings, 119 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:17 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Hi all,

This is the sixth revision of a patchset that adds to XFS kernel
support for tracking reverse-mappings of physical blocks to file and
metadata (rmap); support for mapping multiple file logical blocks to
the same physical block (reflink); and implements the beginnings of
online metadata scrubbing.  Given the significant amount of design
assumptions that change with block sharing, rmap and reflink are
provided together.  There shouldn't be any incompatible on-disk format
changes, pending a thorough review of the patches within.

The reverse mapping implementation features a simple per-AG b+tree
containing tuples of (physical block, owner, offset, blockcount) with
the key being the first three fields.  The large record size will
enable us to reconstruct corrupt block mapping btrees (bmbt); the
large key size is necessary to identify uniquely each rmap record in
the presence of shared physical blocks.  In contrast to previous
iterations of this patchset, it is no longer a requirement that there
be a 1:1 correspondence between bmbt and rmapbt records; each rmapbt
record can cover multiple bmbt records.

The reflink implementation features a simple per-AG b+tree containing
tuples of (physical block, blockcount, refcount) with the key being
the physical block.  Copy on Write (CoW) is implemented by creating a
separate CoW fork and using the existing delayed allocation mechanism
to try to allocate as large of a replacement extent as possible before
committing the new data to media.  A CoW extent size hint allows
administrators to influence the size of the replacement extents, and
certain writes can be "promoted" to CoW when it would be advantageous
to reduce fragmentation.  The userspace interface to reflink and
dedupe are the VFS FICLONE, FICLONERANGE, and FIDEDUPERANGE ioctls,
which were previously private to btrfs.

Since the previous posting, I have made some major changes to the
underlying XFS common code.  First, I have extended the generic b+tree
implementation to support overlapping intervals, which is necessary
for the rmapbt on a reflink filesystem where there can be a number of
rmapbt records representing a physical block.  The new b+tree variant
introduces the notion of a "high key" for each record; it is the
highest key that can be used to identify a record.  On disk, an
overlapped-interval b+tree looks like a traditional b+tree except that
nodes store both the lowest key and the highest key accessible through
that subtree pointer.  There's a new interval query function that uses
both keys to iterate all records overlapping a given range of keys.
This change allows us to remove the old requirement that each bmbt
record correspond to a matching rmapbt record.

The second big change is to the xfs_bmap_free functions.  The existing
code implements a mechanism to defer metadata (specifically, free
space b+tree) updates across a transaction commit by logging redo
items that can be replayed during recovery.  It is an elegant way to
avoid running afoul of AG locking order rules /and/ it can in theory
be used to get around running out of transaction reservation.  That
said, I have refactored it into a generic "deferred operations"
mechanism that can defer arbitrary types of work to a subsequent
rolled transaction.  The framework thus allows me to schedule rmapbt,
refcountbt, and bmbt updates while maintaining correct redo in case of
failure.  Remapping activities for reflink and CoW are now atomic.

The third big change is the establishment of a per-AG block
reservation mechanism.  This "hides" some blocks from the regular
block allocator; refcountbt and rmapbt expansions use these blocks to
handle the removal of the assumption that file mapping operations
always involve block allocation.  This gets us into trouble when a
file allocates an entire AG, is reflinked by other files, and
subsequent CoWs cause record splits in the rmap and reflink btrees.

At the very end of the patchset is an initial implementation of a
GETFSMAPX ioctl for userland to query the physical block mapping of a
filesystem; and metadata scrubbing for XFS.  The scrubber iterates
the per-AG btrees and does some simple cross-checking when possible;
I built it to check the functionality of the new b+tree code.

The first few patches fix various vfs/xfs bugs, adds an enhancement to
the xfs_buf tracepoints so that we can analyze buffer deadlocks, and
merges difference between the kernel and userspace libxfs so that the
rest of the patches apply consistently.

There are still two functionality gaps: the extent swap ioctl isn't
functional when rmap is enabled; and rmap cannot (yet) coexist with
realtime devices.  These will be addressed in the next sprint.

If you're going to start using this mess, you probably ought to just
pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3].
There are also updates for xfs-docs[4].  The kernel patches should
apply to dchinner's for-next; xfsprogs patches to for-next; and
xfstest to master.  NOTE however that the kernel git tree already has
the five for-next patches included.

The patches have been xfstested with x64, i386, and armv7l--arm64,
ppc64, and ppc64le no longer boot in qemu.  All three architectures
pass all 'clone' group tests except xfs/128 (which is the swapext
test), and AFAICT don't cause any new failures for the 'auto' group.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/djwong-devel
[2] https://github.com/djwong/xfsprogs/tree/djwong-devel
[3] https://github.com/djwong/xfstests/tree/djwong-devel
[4] https://github.com/djwong/xfs-documentation/tree/djwong-devel

^ permalink raw reply	[flat|nested] 236+ messages in thread

* [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
@ 2016-06-17  1:17 ` Darrick J. Wong
  2016-06-17 11:32   ` Christoph Hellwig
  2016-06-17  1:18 ` [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE Darrick J. Wong
                   ` (117 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:17 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

All the VFS functions in the dedupe ioctl path return int status, so
the ioctl handler ought to as well.

Found by Coverity, CID 1350952.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ioctl.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/ioctl.c b/fs/ioctl.c
index 116a333..db3d033 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -568,7 +568,7 @@ static int ioctl_fsthaw(struct file *filp)
 	return thaw_super(sb);
 }
 
-static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
+static int ioctl_file_dedupe_range(struct file *file, void __user *arg)
 {
 	struct file_dedupe_range __user *argp = arg;
 	struct file_dedupe_range *same = NULL;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
  2016-06-17  1:17 ` [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 11:41   ` Christoph Hellwig
  2016-06-17  1:18 ` [PATCH 003/119] xfs: check offsets of variable length structures Darrick J. Wong
                   ` (116 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Introduce XFLAGs for the new XFS reflink inode flag and the CoW extent
size hint, and actually plumb the CoW extent size hint into the fsxattr
structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 include/uapi/linux/fs.h |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3b00f7c..fb371a5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -157,7 +157,8 @@ struct fsxattr {
 	__u32		fsx_extsize;	/* extsize field value (get/set)*/
 	__u32		fsx_nextents;	/* nextents field value (get)	*/
 	__u32		fsx_projid;	/* project identifier (get/set) */
-	unsigned char	fsx_pad[12];
+	__u32		fsx_cowextsize;	/* CoW extsize field value (get/set)*/
+	unsigned char	fsx_pad[8];
 };
 
 /*
@@ -178,6 +179,8 @@ struct fsxattr {
 #define FS_XFLAG_NODEFRAG	0x00002000	/* do not defragment */
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
+#define FS_XFLAG_REFLINK	0x00010000	/* file is reflinked */
+#define FS_XFLAG_COWEXTSIZE	0x00020000	/* CoW extent size allocator hint */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 003/119] xfs: check offsets of variable length structures
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
  2016-06-17  1:17 ` [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range Darrick J. Wong
  2016-06-17  1:18 ` [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 11:33   ` Christoph Hellwig
  2016-06-17 17:34   ` Brian Foster
  2016-06-17  1:18 ` [PATCH 004/119] xfs: enable buffer deadlock postmortem diagnosis via ftrace Darrick J. Wong
                   ` (115 subsequent siblings)
  118 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Some of the directory/attr structures contain variable-length objects,
so the enclosing structure doesn't have a meaningful fixed size at
compile time.  We can check the offsets of the members before the
variable-length member, so do those.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_ondisk.h |   25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
index 184c44e..0272301 100644
--- a/fs/xfs/xfs_ondisk.h
+++ b/fs/xfs/xfs_ondisk.h
@@ -22,6 +22,11 @@
 	BUILD_BUG_ON_MSG(sizeof(structname) != (size), "XFS: sizeof(" \
 		#structname ") is wrong, expected " #size)
 
+#define XFS_CHECK_OFFSET(structname, member, off) \
+	BUILD_BUG_ON_MSG(offsetof(structname, member) != (off), \
+		"XFS: offsetof(" #structname ", " #member ") is wrong, " \
+		"expected " #off)
+
 static inline void __init
 xfs_check_ondisk_structs(void)
 {
@@ -75,15 +80,28 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(xfs_attr_leaf_name_remote_t,	12);
 	 */
 
+	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, valuelen,	0);
+	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, namelen,	2);
+	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, nameval,	3);
+	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valueblk,	0);
+	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valuelen,	4);
+	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, namelen,	8);
+	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, name,	9);
 	XFS_CHECK_STRUCT_SIZE(xfs_attr_leafblock_t,		40);
-	XFS_CHECK_STRUCT_SIZE(xfs_attr_shortform_t,		8);
+	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.totsize,	0);
+	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.count,	2);
+	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].namelen,	4);
+	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].valuelen, 5);
+	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].flags,	6);
+	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].nameval,	7);
 	XFS_CHECK_STRUCT_SIZE(xfs_da_blkinfo_t,			12);
 	XFS_CHECK_STRUCT_SIZE(xfs_da_intnode_t,			16);
 	XFS_CHECK_STRUCT_SIZE(xfs_da_node_entry_t,		8);
 	XFS_CHECK_STRUCT_SIZE(xfs_da_node_hdr_t,		16);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_free_t,		4);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_hdr_t,		16);
-	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_unused_t,		6);
+	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, freetag,	0);
+	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, length,	2);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_hdr_t,		16);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_t,			16);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_ino4_t,			4);
@@ -94,6 +112,9 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_t,			16);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_tail_t,		4);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_entry_t,		3);
+	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, namelen,		0);
+	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, offset,		1);
+	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, name,		3);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_hdr_t,		10);
 	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_off_t,		2);
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 004/119] xfs: enable buffer deadlock postmortem diagnosis via ftrace
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 003/119] xfs: check offsets of variable length structures Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 11:34   ` Christoph Hellwig
  2016-06-17  1:18 ` [PATCH 005/119] xfs: check for a valid error_tag in errortag_add Darrick J. Wong
                   ` (114 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create a second buf_trylock tracepoint so that we can distinguish
between a successful and a failed trylock.  With this piece, we can
use a script to look at the ftrace output to detect buffer deadlocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf.c   |    3 ++-
 fs/xfs/xfs_trace.h |    1 +
 2 files changed, 3 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index efa2a73..2333db7 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -947,7 +947,8 @@ xfs_buf_trylock(
 	if (locked)
 		XB_SET_OWNER(bp);
 
-	trace_xfs_buf_trylock(bp, _RET_IP_);
+	locked ? trace_xfs_buf_trylock(bp, _RET_IP_) :
+		 trace_xfs_buf_trylock_fail(bp, _RET_IP_);
 	return locked;
 }
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ea94ee0..68f27f7 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -354,6 +354,7 @@ DEFINE_BUF_EVENT(xfs_buf_submit_wait);
 DEFINE_BUF_EVENT(xfs_buf_bawrite);
 DEFINE_BUF_EVENT(xfs_buf_lock);
 DEFINE_BUF_EVENT(xfs_buf_lock_done);
+DEFINE_BUF_EVENT(xfs_buf_trylock_fail);
 DEFINE_BUF_EVENT(xfs_buf_trylock);
 DEFINE_BUF_EVENT(xfs_buf_unlock);
 DEFINE_BUF_EVENT(xfs_buf_iowait);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 005/119] xfs: check for a valid error_tag in errortag_add
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 004/119] xfs: enable buffer deadlock postmortem diagnosis via ftrace Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 11:34   ` Christoph Hellwig
  2016-06-17  1:18 ` [PATCH 006/119] xfs: port differences from xfsprogs libxfs Darrick J. Wong
                   ` (113 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Currently we don't check the error_tag when someone's trying to set up
error injection testing.  If userspace passes in a value we don't know
about, send back an error.  This will help xfstests to _notrun a test
that uses error injection to test things like log replay.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_error.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 88693a9..355619a 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -61,6 +61,9 @@ xfs_errortag_add(int error_tag, xfs_mount_t *mp)
 	int len;
 	int64_t fsid;
 
+	if (error_tag >= XFS_ERRTAG_MAX)
+		return -EINVAL;
+
 	memcpy(&fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t));
 
 	for (i = 0; i < XFS_NUM_INJECT_ERROR; i++)  {


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 006/119] xfs: port differences from xfsprogs libxfs
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 005/119] xfs: check for a valid error_tag in errortag_add Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 15:06   ` Christoph Hellwig
  2016-06-20  0:21   ` Dave Chinner
  2016-06-17  1:18 ` [PATCH 007/119] xfs: rearrange xfs_bmap_add_free parameters Darrick J. Wong
                   ` (112 subsequent siblings)
  118 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Port various differences between xfsprogs and the kernel.  This
cleans up both so that we can develop rmap and reflink on the
same libxfs code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c      |    2 ++
 fs/xfs/libxfs/xfs_attr_leaf.h  |    2 +-
 fs/xfs/libxfs/xfs_bmap.c       |    2 +-
 fs/xfs/libxfs/xfs_bmap.h       |    6 ++++++
 fs/xfs/libxfs/xfs_btree.c      |    4 ++++
 fs/xfs/libxfs/xfs_btree.h      |    4 ++--
 fs/xfs/libxfs/xfs_dir2.h       |    2 ++
 fs/xfs/libxfs/xfs_dir2_priv.h  |    1 -
 fs/xfs/libxfs/xfs_dquot_buf.c  |   10 ++++++++++
 fs/xfs/libxfs/xfs_format.h     |    1 -
 fs/xfs/libxfs/xfs_ialloc.c     |    4 ++--
 fs/xfs/libxfs/xfs_inode_buf.c  |   19 +++++++++++++++----
 fs/xfs/libxfs/xfs_inode_buf.h  |    6 ++++--
 fs/xfs/libxfs/xfs_log_format.h |    4 ++--
 fs/xfs/libxfs/xfs_rtbitmap.c   |    1 -
 fs/xfs/libxfs/xfs_sb.c         |    4 ++++
 fs/xfs/libxfs/xfs_types.h      |    3 +++
 17 files changed, 58 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 99b077c..58bdca7 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2415,7 +2415,9 @@ xfs_alloc_read_agf(
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
 		spin_lock_init(&pag->pagb_lock);
 		pag->pagb_count = 0;
+#ifdef __KERNEL__
 		pag->pagb_tree = RB_ROOT;
+#endif
 		pag->pagf_init = 1;
 	}
 #ifdef DEBUG
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 4f2aed0..8ef420a 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -51,7 +51,7 @@ int	xfs_attr_shortform_getvalue(struct xfs_da_args *args);
 int	xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
 int	xfs_attr_shortform_remove(struct xfs_da_args *args);
 int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
-int	xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
+int	xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
 void	xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
 
 /*
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 932381c..499e980 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1425,7 +1425,7 @@ xfs_bmap_search_multi_extents(
  * Else, *lastxp will be set to the index of the found
  * entry; *gotp will contain the entry.
  */
-STATIC xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
+xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
 xfs_bmap_search_extents(
 	xfs_inode_t     *ip,            /* incore inode pointer */
 	xfs_fileoff_t   bno,            /* block number searched for */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 423a34e..79e3ebe 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -231,4 +231,10 @@ int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
 		int num_exts);
 int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
 
+struct xfs_bmbt_rec_host *
+	xfs_bmap_search_extents(struct xfs_inode *ip, xfs_fileoff_t bno,
+				int fork, int *eofp, xfs_extnum_t *lastxp,
+				struct xfs_bmbt_irec *gotp,
+				struct xfs_bmbt_irec *prevp);
+
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 1f88e1c..105979d 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2532,6 +2532,7 @@ error0:
 	return error;
 }
 
+#ifdef __KERNEL__
 struct xfs_btree_split_args {
 	struct xfs_btree_cur	*cur;
 	int			level;
@@ -2609,6 +2610,9 @@ xfs_btree_split(
 	destroy_work_on_stack(&args.work);
 	return args.result;
 }
+#else /* !KERNEL */
+#define xfs_btree_split	__xfs_btree_split
+#endif
 
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 2e874be..9a88839 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -95,7 +95,7 @@ do {    \
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(__mp, bmbt, stat); break; \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
-	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
+	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
 
@@ -115,7 +115,7 @@ do {    \
 		__XFS_BTREE_STATS_ADD(__mp, ibt, stat, val); break; \
 	case XFS_BTNUM_FINO:	\
 		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
-	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \
+	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
 
diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index e553536..0a62e73 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -177,6 +177,8 @@ extern struct xfs_dir2_data_free *xfs_dir2_data_freefind(
 		struct xfs_dir2_data_hdr *hdr, struct xfs_dir2_data_free *bf,
 		struct xfs_dir2_data_unused *dup);
 
+extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino);
+
 extern const struct xfs_buf_ops xfs_dir3_block_buf_ops;
 extern const struct xfs_buf_ops xfs_dir3_leafn_buf_ops;
 extern const struct xfs_buf_ops xfs_dir3_leaf1_buf_ops;
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index ef9f6ea..d04547f 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -21,7 +21,6 @@
 struct dir_context;
 
 /* xfs_dir2.c */
-extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino);
 extern int xfs_dir2_grow_inode(struct xfs_da_args *args, int space,
 				xfs_dir2_db_t *dbp);
 extern int xfs_dir_cilookup_result(struct xfs_da_args *args,
diff --git a/fs/xfs/libxfs/xfs_dquot_buf.c b/fs/xfs/libxfs/xfs_dquot_buf.c
index 3cc3cf7..06b574d 100644
--- a/fs/xfs/libxfs/xfs_dquot_buf.c
+++ b/fs/xfs/libxfs/xfs_dquot_buf.c
@@ -31,10 +31,16 @@
 #include "xfs_cksum.h"
 #include "xfs_trace.h"
 
+/*
+ * XXX: kernel implementation causes ndquots calc to go real
+ * bad. Just leaving the existing userspace calc here right now.
+ */
 int
 xfs_calc_dquots_per_chunk(
 	unsigned int		nbblks)	/* basic block units */
 {
+#ifdef __KERNEL__
+	/* kernel code that goes wrong in userspace! */
 	unsigned int	ndquots;
 
 	ASSERT(nbblks > 0);
@@ -42,6 +48,10 @@ xfs_calc_dquots_per_chunk(
 	do_div(ndquots, sizeof(xfs_dqblk_t));
 
 	return ndquots;
+#else
+	ASSERT(nbblks > 0);
+	return BBTOB(nbblks) / sizeof(xfs_dqblk_t);
+#endif
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index dc97eb21..ba528b3 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -835,7 +835,6 @@ typedef struct xfs_timestamp {
  * padding field for v3 inodes.
  */
 #define	XFS_DINODE_MAGIC		0x494e	/* 'IN' */
-#define XFS_DINODE_GOOD_VERSION(v)	((v) >= 1 && (v) <= 3)
 typedef struct xfs_dinode {
 	__be16		di_magic;	/* inode magic # = XFS_DINODE_MAGIC */
 	__be16		di_mode;	/* mode and type of file */
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 22297f9..77b5990 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2340,7 +2340,7 @@ xfs_imap(
 
 		imap->im_blkno = XFS_AGB_TO_DADDR(mp, agno, agbno);
 		imap->im_len = XFS_FSB_TO_BB(mp, 1);
-		imap->im_boffset = (ushort)(offset << mp->m_sb.sb_inodelog);
+		imap->im_boffset = (unsigned short)(offset << mp->m_sb.sb_inodelog);
 		return 0;
 	}
 
@@ -2368,7 +2368,7 @@ out_map:
 
 	imap->im_blkno = XFS_AGB_TO_DADDR(mp, agno, cluster_agbno);
 	imap->im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
-	imap->im_boffset = (ushort)(offset << mp->m_sb.sb_inodelog);
+	imap->im_boffset = (unsigned short)(offset << mp->m_sb.sb_inodelog);
 
 	/*
 	 * If the inode number maps to a block outside the bounds
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 9d9559e..794fa66 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -56,6 +56,17 @@ xfs_inobp_check(
 }
 #endif
 
+bool
+xfs_dinode_good_version(
+	struct xfs_mount *mp,
+	__u8		version)
+{
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		return version == 3;
+
+	return version == 1 || version == 2;
+}
+
 /*
  * If we are doing readahead on an inode buffer, we might be in log recovery
  * reading an inode allocation buffer that hasn't yet been replayed, and hence
@@ -90,7 +101,7 @@ xfs_inode_buf_verify(
 
 		dip = xfs_buf_offset(bp, (i << mp->m_sb.sb_inodelog));
 		di_ok = dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
-			    XFS_DINODE_GOOD_VERSION(dip->di_version);
+			xfs_dinode_good_version(mp, dip->di_version);
 		if (unlikely(XFS_TEST_ERROR(!di_ok, mp,
 						XFS_ERRTAG_ITOBP_INOTOBP,
 						XFS_RANDOM_ITOBP_INOTOBP))) {
@@ -369,7 +380,7 @@ xfs_log_dinode_to_disk(
 static bool
 xfs_dinode_verify(
 	struct xfs_mount	*mp,
-	struct xfs_inode	*ip,
+	xfs_ino_t		ino,
 	struct xfs_dinode	*dip)
 {
 	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
@@ -384,7 +395,7 @@ xfs_dinode_verify(
 	if (!xfs_verify_cksum((char *)dip, mp->m_sb.sb_inodesize,
 			      XFS_DINODE_CRC_OFF))
 		return false;
-	if (be64_to_cpu(dip->di_ino) != ip->i_ino)
+	if (be64_to_cpu(dip->di_ino) != ino)
 		return false;
 	if (!uuid_equal(&dip->di_uuid, &mp->m_sb.sb_meta_uuid))
 		return false;
@@ -459,7 +470,7 @@ xfs_iread(
 		return error;
 
 	/* even unallocated inodes are verified */
-	if (!xfs_dinode_verify(mp, ip, dip)) {
+	if (!xfs_dinode_verify(mp, ip->i_ino, dip)) {
 		xfs_alert(mp, "%s: validation failed for inode %lld failed",
 				__func__, ip->i_ino);
 
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 7c4dd32..958c543 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -57,8 +57,8 @@ struct xfs_icdinode {
  */
 struct xfs_imap {
 	xfs_daddr_t	im_blkno;	/* starting BB of inode chunk */
-	ushort		im_len;		/* length in BBs of inode chunk */
-	ushort		im_boffset;	/* inode offset in block in bytes */
+	unsigned short	im_len;		/* length in BBs of inode chunk */
+	unsigned short	im_boffset;	/* inode offset in block in bytes */
 };
 
 int	xfs_imap_to_bp(struct xfs_mount *, struct xfs_trans *,
@@ -73,6 +73,8 @@ void	xfs_inode_from_disk(struct xfs_inode *ip, struct xfs_dinode *from);
 void	xfs_log_dinode_to_disk(struct xfs_log_dinode *from,
 			       struct xfs_dinode *to);
 
+bool	xfs_dinode_good_version(struct xfs_mount *mp, __u8 version);
+
 #if defined(DEBUG)
 void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
 #else
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index e8f49c0..e5baba3 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -462,8 +462,8 @@ static inline uint xfs_log_dinode_size(int version)
 typedef struct xfs_buf_log_format {
 	unsigned short	blf_type;	/* buf log item type indicator */
 	unsigned short	blf_size;	/* size of this item */
-	ushort		blf_flags;	/* misc state */
-	ushort		blf_len;	/* number of blocks in this buf */
+	unsigned short	blf_flags;	/* misc state */
+	unsigned short	blf_len;	/* number of blocks in this buf */
 	__int64_t	blf_blkno;	/* starting blkno of this buf */
 	unsigned int	blf_map_size;	/* used size of data bitmap in words */
 	unsigned int	blf_data_map[XFS_BLF_DATAMAP_SIZE]; /* dirty bitmap */
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index e2e1106..ea45584 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1016,4 +1016,3 @@ xfs_rtfree_extent(
 	}
 	return 0;
 }
-
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 12ca867..09d6fd0 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -261,6 +261,7 @@ xfs_mount_validate_sb(
 	/*
 	 * Until this is fixed only page-sized or smaller data blocks work.
 	 */
+#ifdef __KERNEL__
 	if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) {
 		xfs_warn(mp,
 		"File system with blocksize %d bytes. "
@@ -268,6 +269,7 @@ xfs_mount_validate_sb(
 				sbp->sb_blocksize, PAGE_SIZE);
 		return -ENOSYS;
 	}
+#endif
 
 	/*
 	 * Currently only very few inode sizes are supported.
@@ -291,10 +293,12 @@ xfs_mount_validate_sb(
 		return -EFBIG;
 	}
 
+#ifdef __KERNEL__
 	if (check_inprogress && sbp->sb_inprogress) {
 		xfs_warn(mp, "Offline file system operation in progress!");
 		return -EFSCORRUPTED;
 	}
+#endif
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index b79dc66..f0d145a 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -75,11 +75,14 @@ typedef __int64_t	xfs_sfiloff_t;	/* signed block number in a file */
  * Minimum and maximum blocksize and sectorsize.
  * The blocksize upper limit is pretty much arbitrary.
  * The sectorsize upper limit is due to sizeof(sb_sectsize).
+ * CRC enable filesystems use 512 byte inodes, meaning 512 byte block sizes
+ * cannot be used.
  */
 #define XFS_MIN_BLOCKSIZE_LOG	9	/* i.e. 512 bytes */
 #define XFS_MAX_BLOCKSIZE_LOG	16	/* i.e. 65536 bytes */
 #define XFS_MIN_BLOCKSIZE	(1 << XFS_MIN_BLOCKSIZE_LOG)
 #define XFS_MAX_BLOCKSIZE	(1 << XFS_MAX_BLOCKSIZE_LOG)
+#define XFS_MIN_CRC_BLOCKSIZE	(1 << (XFS_MIN_BLOCKSIZE_LOG + 1))
 #define XFS_MIN_SECTORSIZE_LOG	9	/* i.e. 512 bytes */
 #define XFS_MAX_SECTORSIZE_LOG	15	/* i.e. 32768 bytes */
 #define XFS_MIN_SECTORSIZE	(1 << XFS_MIN_SECTORSIZE_LOG)


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 007/119] xfs: rearrange xfs_bmap_add_free parameters
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 006/119] xfs: port differences from xfsprogs libxfs Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 11:39   ` Christoph Hellwig
  2016-06-17  1:18 ` [PATCH 008/119] xfs: separate freelist fixing into a separate helper Darrick J. Wong
                   ` (111 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-fsdevel, vishal.l.verma, xfs, Christoph Hellwig, Dave Chinner

This is already in xfsprogs' libxfs, so port it to the kernel.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c       |   12 ++++++------
 fs/xfs/libxfs/xfs_bmap.h       |    4 ++--
 fs/xfs/libxfs/xfs_bmap_btree.c |    2 +-
 fs/xfs/libxfs/xfs_ialloc.c     |    9 ++++-----
 4 files changed, 13 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 499e980..ea7b3df 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -570,10 +570,10 @@ xfs_bmap_validate_ret(
  */
 void
 xfs_bmap_add_free(
+	struct xfs_mount	*mp,		/* mount point structure */
+	struct xfs_bmap_free	*flist,		/* list of extents */
 	xfs_fsblock_t		bno,		/* fs block number of extent */
-	xfs_filblks_t		len,		/* length of extent */
-	xfs_bmap_free_t		*flist,		/* list of extents */
-	xfs_mount_t		*mp)		/* mount point structure */
+	xfs_filblks_t		len)		/* length of extent */
 {
 	xfs_bmap_free_item_t	*cur;		/* current (next) element */
 	xfs_bmap_free_item_t	*new;		/* new element */
@@ -699,7 +699,7 @@ xfs_bmap_btree_to_extents(
 	cblock = XFS_BUF_TO_BLOCK(cbp);
 	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
 		return error;
-	xfs_bmap_add_free(cbno, 1, cur->bc_private.b.flist, mp);
+	xfs_bmap_add_free(mp, cur->bc_private.b.flist, cbno, 1);
 	ip->i_d.di_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
@@ -5073,8 +5073,8 @@ xfs_bmap_del_extent(
 	 * If we need to, add to list of extents to delete.
 	 */
 	if (do_fx)
-		xfs_bmap_add_free(del->br_startblock, del->br_blockcount, flist,
-			mp);
+		xfs_bmap_add_free(mp, flist, del->br_startblock,
+			del->br_blockcount);
 	/*
 	 * Adjust inode # blocks in the file.
 	 */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 79e3ebe..0b2f72c 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -191,8 +191,8 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
 
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
-void	xfs_bmap_add_free(xfs_fsblock_t bno, xfs_filblks_t len,
-		struct xfs_bmap_free *flist, struct xfs_mount *mp);
+void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_bmap_free *flist,
+			  xfs_fsblock_t bno, xfs_filblks_t len);
 void	xfs_bmap_cancel(struct xfs_bmap_free *flist);
 int	xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist,
 			struct xfs_inode *ip);
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 6282f6e..db0c71e 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -526,7 +526,7 @@ xfs_bmbt_free_block(
 	struct xfs_trans	*tp = cur->bc_tp;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
 
-	xfs_bmap_add_free(fsbno, 1, cur->bc_private.b.flist, mp);
+	xfs_bmap_add_free(mp, cur->bc_private.b.flist, fsbno, 1);
 	ip->i_d.di_nblocks--;
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 77b5990..9d0003c 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1828,9 +1828,8 @@ xfs_difree_inode_chunk(
 
 	if (!xfs_inobt_issparse(rec->ir_holemask)) {
 		/* not sparse, calculate extent info directly */
-		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno,
-				  XFS_AGINO_TO_AGBNO(mp, rec->ir_startino)),
-				  mp->m_ialloc_blks, flist, mp);
+		xfs_bmap_add_free(mp, flist, XFS_AGB_TO_FSB(mp, agno, sagbno),
+				  mp->m_ialloc_blks);
 		return;
 	}
 
@@ -1873,8 +1872,8 @@ xfs_difree_inode_chunk(
 
 		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
 		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
-		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp, agno, agbno), contigblk,
-				  flist, mp);
+		xfs_bmap_add_free(mp, flist, XFS_AGB_TO_FSB(mp, agno, agbno),
+				  contigblk);
 
 		/* reset range to current bit and carry on... */
 		startidx = endidx = nextbit;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 008/119] xfs: separate freelist fixing into a separate helper
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 007/119] xfs: rearrange xfs_bmap_add_free parameters Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 11:52   ` Christoph Hellwig
  2016-06-21  1:40   ` Dave Chinner
  2016-06-17  1:18 ` [PATCH 009/119] xfs: convert list of extents to free into a regular list Darrick J. Wong
                   ` (110 subsequent siblings)
  118 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

From: Dave Chinner <david@fromorbit.com>

Break up xfs_free_extent() into a helper that fixes the freelist.
This helper will be used subsequently to ensure the freelist during
deferred rmap processing.

Signed-off-by: Dave Chinner <david@fromorbit.com>
[darrick: refactor to put this at the head of the patchset]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c |   82 +++++++++++++++++++++++++++++----------------
 fs/xfs/libxfs/xfs_alloc.h |    2 +
 2 files changed, 55 insertions(+), 29 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 58bdca7..1c76a0e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2660,55 +2660,79 @@ error0:
 	return error;
 }
 
-/*
- * Free an extent.
- * Just break up the extent address and hand off to xfs_free_ag_extent
- * after fixing up the freelist.
- */
-int				/* error */
-xfs_free_extent(
-	xfs_trans_t	*tp,	/* transaction pointer */
-	xfs_fsblock_t	bno,	/* starting block number of extent */
-	xfs_extlen_t	len)	/* length of extent */
+/* Ensure that the freelist is at full capacity. */
+int
+xfs_free_extent_fix_freelist(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	struct xfs_buf		**agbp)
 {
-	xfs_alloc_arg_t	args;
-	int		error;
+	xfs_alloc_arg_t		args;
+	int			error;
 
-	ASSERT(len != 0);
 	memset(&args, 0, sizeof(xfs_alloc_arg_t));
 	args.tp = tp;
 	args.mp = tp->t_mountp;
+	args.agno = agno;
 
 	/*
 	 * validate that the block number is legal - the enables us to detect
 	 * and handle a silent filesystem corruption rather than crashing.
 	 */
-	args.agno = XFS_FSB_TO_AGNO(args.mp, bno);
 	if (args.agno >= args.mp->m_sb.sb_agcount)
 		return -EFSCORRUPTED;
 
-	args.agbno = XFS_FSB_TO_AGBNO(args.mp, bno);
-	if (args.agbno >= args.mp->m_sb.sb_agblocks)
-		return -EFSCORRUPTED;
-
 	args.pag = xfs_perag_get(args.mp, args.agno);
 	ASSERT(args.pag);
 
 	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
 	if (error)
-		goto error0;
+		goto out;
+
+	*agbp = args.agbp;
+out:
+	xfs_perag_put(args.pag);
+	return error;
+}
+
+/*
+ * Free an extent.
+ * Just break up the extent address and hand off to xfs_free_ag_extent
+ * after fixing up the freelist.
+ */
+int				/* error */
+xfs_free_extent(
+	struct xfs_trans	*tp,	/* transaction pointer */
+	xfs_fsblock_t		bno,	/* starting block number of extent */
+	xfs_extlen_t		len)	/* length of extent */
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*agbp;
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
+	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
+	int			error;
+
+	ASSERT(len != 0);
+
+	error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
+	if (error)
+		return error;
+
+	XFS_WANT_CORRUPTED_GOTO(mp, agbno < mp->m_sb.sb_agblocks, err);
 
 	/* validate the extent size is legal now we have the agf locked */
-	if (args.agbno + len >
-			be32_to_cpu(XFS_BUF_TO_AGF(args.agbp)->agf_length)) {
-		error = -EFSCORRUPTED;
-		goto error0;
-	}
+	XFS_WANT_CORRUPTED_GOTO(mp,
+			agbno + len <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length),
+			err);
 
-	error = xfs_free_ag_extent(tp, args.agbp, args.agno, args.agbno, len, 0);
-	if (!error)
-		xfs_extent_busy_insert(tp, args.agno, args.agbno, len, 0);
-error0:
-	xfs_perag_put(args.pag);
+	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, 0);
+	if (error)
+		goto err;
+
+	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
+	return 0;
+
+err:
+	xfs_trans_brelse(tp, agbp);
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 92a66ba..cf268b2 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -229,5 +229,7 @@ xfs_alloc_get_rec(
 int xfs_read_agf(struct xfs_mount *mp, struct xfs_trans *tp,
 			xfs_agnumber_t agno, int flags, struct xfs_buf **bpp);
 int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
+int xfs_free_extent_fix_freelist(struct xfs_trans *tp, xfs_agnumber_t agno,
+		struct xfs_buf **agbp);
 
 #endif	/* __XFS_ALLOC_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 009/119] xfs: convert list of extents to free into a regular list
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 008/119] xfs: separate freelist fixing into a separate helper Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-17 11:59   ` Christoph Hellwig
  2016-06-17  1:18 ` [PATCH 010/119] xfs: create a standard btree size calculator code Darrick J. Wong
                   ` (109 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

In struct xfs_bmap_free, convert the open-coded free extent list to
a regular list, then use list_sort to sort it prior to processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   39 +++++++++++----------------------------
 fs/xfs/libxfs/xfs_bmap.h |   14 ++++++++------
 fs/xfs/xfs_bmap_util.c   |   32 +++++++++++++++++++++++++-------
 fs/xfs/xfs_bmap_util.h   |    1 -
 fs/xfs/xfs_super.c       |    5 +++--
 5 files changed, 47 insertions(+), 44 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ea7b3df..a5d207a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -575,9 +575,7 @@ xfs_bmap_add_free(
 	xfs_fsblock_t		bno,		/* fs block number of extent */
 	xfs_filblks_t		len)		/* length of extent */
 {
-	xfs_bmap_free_item_t	*cur;		/* current (next) element */
-	xfs_bmap_free_item_t	*new;		/* new element */
-	xfs_bmap_free_item_t	*prev;		/* previous element */
+	struct xfs_bmap_free_item	*new;		/* new element */
 #ifdef DEBUG
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
@@ -597,17 +595,7 @@ xfs_bmap_add_free(
 	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
 	new->xbfi_startblock = bno;
 	new->xbfi_blockcount = (xfs_extlen_t)len;
-	for (prev = NULL, cur = flist->xbf_first;
-	     cur != NULL;
-	     prev = cur, cur = cur->xbfi_next) {
-		if (cur->xbfi_startblock >= bno)
-			break;
-	}
-	if (prev)
-		prev->xbfi_next = new;
-	else
-		flist->xbf_first = new;
-	new->xbfi_next = cur;
+	list_add(&new->xbfi_list, &flist->xbf_flist);
 	flist->xbf_count++;
 }
 
@@ -617,14 +605,10 @@ xfs_bmap_add_free(
  */
 void
 xfs_bmap_del_free(
-	xfs_bmap_free_t		*flist,	/* free item list header */
-	xfs_bmap_free_item_t	*prev,	/* previous item on list, if any */
-	xfs_bmap_free_item_t	*free)	/* list item to be freed */
+	struct xfs_bmap_free		*flist,	/* free item list header */
+	struct xfs_bmap_free_item	*free)	/* list item to be freed */
 {
-	if (prev)
-		prev->xbfi_next = free->xbfi_next;
-	else
-		flist->xbf_first = free->xbfi_next;
+	list_del(&free->xbfi_list);
 	flist->xbf_count--;
 	kmem_zone_free(xfs_bmap_free_item_zone, free);
 }
@@ -634,17 +618,16 @@ xfs_bmap_del_free(
  */
 void
 xfs_bmap_cancel(
-	xfs_bmap_free_t		*flist)	/* list of bmap_free_items */
+	struct xfs_bmap_free		*flist)	/* list of bmap_free_items */
 {
-	xfs_bmap_free_item_t	*free;	/* free list item */
-	xfs_bmap_free_item_t	*next;
+	struct xfs_bmap_free_item	*free;	/* free list item */
 
 	if (flist->xbf_count == 0)
 		return;
-	ASSERT(flist->xbf_first != NULL);
-	for (free = flist->xbf_first; free; free = next) {
-		next = free->xbfi_next;
-		xfs_bmap_del_free(flist, NULL, free);
+	while (!list_empty(&flist->xbf_flist)) {
+		free = list_first_entry(&flist->xbf_flist,
+				struct xfs_bmap_free_item, xbfi_list);
+		xfs_bmap_del_free(flist, free);
 	}
 	ASSERT(flist->xbf_count == 0);
 }
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 0b2f72c..0ef4c6b 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -62,12 +62,12 @@ struct xfs_bmalloca {
  * List of extents to be free "later".
  * The list is kept sorted on xbf_startblock.
  */
-typedef struct xfs_bmap_free_item
+struct xfs_bmap_free_item
 {
 	xfs_fsblock_t		xbfi_startblock;/* starting fs block number */
 	xfs_extlen_t		xbfi_blockcount;/* number of blocks in extent */
-	struct xfs_bmap_free_item *xbfi_next;	/* link to next entry */
-} xfs_bmap_free_item_t;
+	struct list_head	xbfi_list;
+};
 
 /*
  * Header for free extent list.
@@ -85,7 +85,7 @@ typedef struct xfs_bmap_free_item
  */
 typedef	struct xfs_bmap_free
 {
-	xfs_bmap_free_item_t	*xbf_first;	/* list of to-be-free extents */
+	struct list_head	xbf_flist;	/* list of to-be-free extents */
 	int			xbf_count;	/* count of items on list */
 	int			xbf_low;	/* alloc in low mode */
 } xfs_bmap_free_t;
@@ -141,8 +141,10 @@ static inline int xfs_bmapi_aflag(int w)
 
 static inline void xfs_bmap_init(xfs_bmap_free_t *flp, xfs_fsblock_t *fbp)
 {
-	((flp)->xbf_first = NULL, (flp)->xbf_count = 0, \
-		(flp)->xbf_low = 0, *(fbp) = NULLFSBLOCK);
+	INIT_LIST_HEAD(&flp->xbf_flist);
+	flp->xbf_count = 0;
+	flp->xbf_low = 0;
+	*fbp = NULLFSBLOCK;
 }
 
 /*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 28c42fb..1aac0ba 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -79,6 +79,23 @@ xfs_zero_extent(
 		GFP_NOFS, true);
 }
 
+/* Sort bmap items by AG. */
+static int
+xfs_bmap_free_list_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xfs_mount	*mp = priv;
+	struct xfs_bmap_free_item	*ra;
+	struct xfs_bmap_free_item	*rb;
+
+	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
+	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
+	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
+		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
+}
+
 /*
  * Routine to be called at transaction's end by xfs_bmapi, xfs_bunmapi
  * caller.  Frees all the extents that need freeing, which must be done
@@ -99,14 +116,15 @@ xfs_bmap_finish(
 	int				error;	/* error return value */
 	int				committed;/* xact committed or not */
 	struct xfs_bmap_free_item	*free;	/* free extent item */
-	struct xfs_bmap_free_item	*next;	/* next item on free list */
 
 	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
 	if (flist->xbf_count == 0)
 		return 0;
 
+	list_sort((*tp)->t_mountp, &flist->xbf_flist, xfs_bmap_free_list_cmp);
+
 	efi = xfs_trans_get_efi(*tp, flist->xbf_count);
-	for (free = flist->xbf_first; free; free = free->xbfi_next)
+	list_for_each_entry(free, &flist->xbf_flist, xbfi_list)
 		xfs_trans_log_efi_extent(*tp, efi, free->xbfi_startblock,
 			free->xbfi_blockcount);
 
@@ -136,15 +154,15 @@ xfs_bmap_finish(
 	 * on error.
 	 */
 	efd = xfs_trans_get_efd(*tp, efi, flist->xbf_count);
-	for (free = flist->xbf_first; free != NULL; free = next) {
-		next = free->xbfi_next;
-
+	while (!list_empty(&flist->xbf_flist)) {
+		free = list_first_entry(&flist->xbf_flist,
+				struct xfs_bmap_free_item, xbfi_list);
 		error = xfs_trans_free_extent(*tp, efd, free->xbfi_startblock,
 					      free->xbfi_blockcount);
 		if (error)
 			return error;
 
-		xfs_bmap_del_free(flist, NULL, free);
+		xfs_bmap_del_free(flist, free);
 	}
 
 	return 0;
@@ -797,7 +815,7 @@ xfs_bmap_punch_delalloc_range(
 		if (error)
 			break;
 
-		ASSERT(!flist.xbf_count && !flist.xbf_first);
+		ASSERT(!flist.xbf_count && list_empty(&flist.xbf_flist));
 next_block:
 		start_fsb++;
 		remaining--;
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 1492348..f200714 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -41,7 +41,6 @@ int	xfs_getbmap(struct xfs_inode *ip, struct getbmapx *bmv,
 
 /* functions in xfs_bmap.c that are only needed by xfs_bmap_util.c */
 void	xfs_bmap_del_free(struct xfs_bmap_free *flist,
-			  struct xfs_bmap_free_item *prev,
 			  struct xfs_bmap_free_item *free);
 int	xfs_bmap_extsize_align(struct xfs_mount *mp, struct xfs_bmbt_irec *gotp,
 			       struct xfs_bmbt_irec *prevp, xfs_extlen_t extsz,
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4700f09..09722a7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1692,8 +1692,9 @@ xfs_init_zones(void)
 	if (!xfs_log_ticket_zone)
 		goto out_free_ioend_bioset;
 
-	xfs_bmap_free_item_zone = kmem_zone_init(sizeof(xfs_bmap_free_item_t),
-						"xfs_bmap_free_item");
+	xfs_bmap_free_item_zone = kmem_zone_init(
+			sizeof(struct xfs_bmap_free_item),
+			"xfs_bmap_free_item");
 	if (!xfs_bmap_free_item_zone)
 		goto out_destroy_log_ticket_zone;
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 010/119] xfs: create a standard btree size calculator code
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 009/119] xfs: convert list of extents to free into a regular list Darrick J. Wong
@ 2016-06-17  1:18 ` Darrick J. Wong
  2016-06-20 14:31   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 011/119] xfs: refactor btree maxlevels computation Darrick J. Wong
                   ` (108 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:18 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create a helper to generate AG btree height calculator functions.
This will be used (much) later when we get to the refcount btree.

v2: Use a helper function instead of a macro.
v3: We can (theoretically) store more than 2^32 records in a btree, so
    widen the fields to accept that.
v4: Don't modify xfs_bmap_worst_indlen; the purpose of /that/ function
    is to estimate the worst-case number of blocks needed for a bmbt
    expansion, not to calculate the space required to store nr records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c |   27 +++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h |    3 +++
 2 files changed, 30 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 105979d..5eb4e40 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4156,3 +4156,30 @@ xfs_btree_sblock_verify(
 
 	return true;
 }
+
+/*
+ * Calculate the number of blocks needed to store a given number of records
+ * in a short-format (per-AG metadata) btree.
+ */
+xfs_extlen_t
+xfs_btree_calc_size(
+	struct xfs_mount	*mp,
+	uint			*limits,
+	unsigned long long	len)
+{
+	int			level;
+	int			maxrecs;
+	xfs_extlen_t		rval;
+
+	maxrecs = limits[0];
+	for (level = 0, rval = 0; len > 0; level++) {
+		len += maxrecs - 1;
+		do_div(len, maxrecs);
+		rval += len;
+		if (len == 1)
+			return rval;
+		if (level == 0)
+			maxrecs = limits[1];
+	}
+	return rval;
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 9a88839..b330f19 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -475,4 +475,7 @@ static inline int xfs_btree_get_level(struct xfs_btree_block *block)
 bool xfs_btree_sblock_v5hdr_verify(struct xfs_buf *bp);
 bool xfs_btree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
 
+xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
+		unsigned long long len);
+
 #endif	/* __XFS_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 011/119] xfs: refactor btree maxlevels computation
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2016-06-17  1:18 ` [PATCH 010/119] xfs: create a standard btree size calculator code Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-20 14:31   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion Darrick J. Wong
                   ` (107 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create a common function to calculate the maximum height of a per-AG
btree.  This will eventually be used by the rmapbt and refcountbt code
to calculate appropriate maxlevels values for each.  This is important
because the verifiers and the transaction block reservations depend on
accurate estimates of many blocks are needed to satisfy a btree split.

We were mistakenly using the max bnobt height for all the btrees,
which creates a dangerous situation since the larger records and keys
in an rmapbt make it very possible that the rmapbt will be taller than
the bnobt and so we can run out of transaction block reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c  |   15 ++-------------
 fs/xfs/libxfs/xfs_btree.c  |   19 +++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h  |    2 ++
 fs/xfs/libxfs/xfs_ialloc.c |   19 +++++--------------
 4 files changed, 28 insertions(+), 27 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 1c76a0e..c366889 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1839,19 +1839,8 @@ void
 xfs_alloc_compute_maxlevels(
 	xfs_mount_t	*mp)	/* file system mount structure */
 {
-	int		level;
-	uint		maxblocks;
-	uint		maxleafents;
-	int		minleafrecs;
-	int		minnoderecs;
-
-	maxleafents = (mp->m_sb.sb_agblocks + 1) / 2;
-	minleafrecs = mp->m_alloc_mnr[0];
-	minnoderecs = mp->m_alloc_mnr[1];
-	maxblocks = (maxleafents + minleafrecs - 1) / minleafrecs;
-	for (level = 1; maxblocks > 1; level++)
-		maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
-	mp->m_ag_maxlevels = level;
+	mp->m_ag_maxlevels = xfs_btree_compute_maxlevels(mp, mp->m_alloc_mnr,
+			(mp->m_sb.sb_agblocks + 1) / 2);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 5eb4e40..046fbcf 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4158,6 +4158,25 @@ xfs_btree_sblock_verify(
 }
 
 /*
+ * Calculate the number of btree levels needed to store a given number of
+ * records in a short-format btree.
+ */
+uint
+xfs_btree_compute_maxlevels(
+	struct xfs_mount	*mp,
+	uint			*limits,
+	unsigned long		len)
+{
+	uint			level;
+	unsigned long		maxblocks;
+
+	maxblocks = (len + limits[0] - 1) / limits[0];
+	for (level = 1; maxblocks > 1; level++)
+		maxblocks = (maxblocks + limits[1] - 1) / limits[1];
+	return level;
+}
+
+/*
  * Calculate the number of blocks needed to store a given number of records
  * in a short-format (per-AG metadata) btree.
  */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index b330f19..b955e5d 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -477,5 +477,7 @@ bool xfs_btree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
 
 xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
 		unsigned long long len);
+uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
+		unsigned long len);
 
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 9d0003c..cda7269 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2394,20 +2394,11 @@ void
 xfs_ialloc_compute_maxlevels(
 	xfs_mount_t	*mp)		/* file system mount structure */
 {
-	int		level;
-	uint		maxblocks;
-	uint		maxleafents;
-	int		minleafrecs;
-	int		minnoderecs;
-
-	maxleafents = (1LL << XFS_INO_AGINO_BITS(mp)) >>
-		XFS_INODES_PER_CHUNK_LOG;
-	minleafrecs = mp->m_inobt_mnr[0];
-	minnoderecs = mp->m_inobt_mnr[1];
-	maxblocks = (maxleafents + minleafrecs - 1) / minleafrecs;
-	for (level = 1; maxblocks > 1; level++)
-		maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
-	mp->m_in_maxlevels = level;
+	uint		inodes;
+
+	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
+	mp->m_in_maxlevels = xfs_btree_compute_maxlevels(mp, mp->m_inobt_mnr,
+							 inodes);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 011/119] xfs: refactor btree maxlevels computation Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-21 13:00   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 013/119] xfs: support btrees with overlapping intervals for keys Darrick J. Wong
                   ` (106 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When a btree block has to be split, we pass the new block's ptr from
xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
however, we pass the block's key through the cursor's record.  It is a
little weird to "initialize" a record from a key since the non-key
attributes will have garbage values.

When we go to add support for interval queries, we have to be able to
pass the lowest and highest keys accessible via a pointer.  There's no
clean way to pass this back through the cursor's record field.
Therefore, pass the key directly back to xfs_btree_insert() the same
way that we pass the btree_ptr.

As a bonus, we no longer need init_rec_from_key and can drop it from the
codebase.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc_btree.c  |   12 ----------
 fs/xfs/libxfs/xfs_bmap_btree.c   |   12 ----------
 fs/xfs/libxfs/xfs_btree.c        |   44 +++++++++++++++++++-------------------
 fs/xfs/libxfs/xfs_btree.h        |    2 --
 fs/xfs/libxfs/xfs_ialloc_btree.c |   10 ---------
 5 files changed, 22 insertions(+), 58 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index d9b42425..5ba2dac 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -212,17 +212,6 @@ xfs_allocbt_init_key_from_rec(
 }
 
 STATIC void
-xfs_allocbt_init_rec_from_key(
-	union xfs_btree_key	*key,
-	union xfs_btree_rec	*rec)
-{
-	ASSERT(key->alloc.ar_startblock != 0);
-
-	rec->alloc.ar_startblock = key->alloc.ar_startblock;
-	rec->alloc.ar_blockcount = key->alloc.ar_blockcount;
-}
-
-STATIC void
 xfs_allocbt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
@@ -406,7 +395,6 @@ static const struct xfs_btree_ops xfs_allocbt_ops = {
 	.get_minrecs		= xfs_allocbt_get_minrecs,
 	.get_maxrecs		= xfs_allocbt_get_maxrecs,
 	.init_key_from_rec	= xfs_allocbt_init_key_from_rec,
-	.init_rec_from_key	= xfs_allocbt_init_rec_from_key,
 	.init_rec_from_cur	= xfs_allocbt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_allocbt_init_ptr_from_cur,
 	.key_diff		= xfs_allocbt_key_diff,
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index db0c71e..714b387 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -600,17 +600,6 @@ xfs_bmbt_init_key_from_rec(
 }
 
 STATIC void
-xfs_bmbt_init_rec_from_key(
-	union xfs_btree_key	*key,
-	union xfs_btree_rec	*rec)
-{
-	ASSERT(key->bmbt.br_startoff != 0);
-
-	xfs_bmbt_disk_set_allf(&rec->bmbt, be64_to_cpu(key->bmbt.br_startoff),
-			       0, 0, XFS_EXT_NORM);
-}
-
-STATIC void
 xfs_bmbt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
@@ -760,7 +749,6 @@ static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.get_minrecs		= xfs_bmbt_get_minrecs,
 	.get_dmaxrecs		= xfs_bmbt_get_dmaxrecs,
 	.init_key_from_rec	= xfs_bmbt_init_key_from_rec,
-	.init_rec_from_key	= xfs_bmbt_init_rec_from_key,
 	.init_rec_from_cur	= xfs_bmbt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_bmbt_init_ptr_from_cur,
 	.key_diff		= xfs_bmbt_key_diff,
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 046fbcf..a096539 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2862,10 +2862,9 @@ xfs_btree_make_block_unfull(
 	int			*index,	/* new tree index */
 	union xfs_btree_ptr	*nptr,	/* new btree ptr */
 	struct xfs_btree_cur	**ncur,	/* new btree cursor */
-	union xfs_btree_rec	*nrec,	/* new record */
+	union xfs_btree_key	*key, /* key of new block */
 	int			*stat)
 {
-	union xfs_btree_key	key;	/* new btree key value */
 	int			error = 0;
 
 	if ((cur->bc_flags & XFS_BTREE_ROOT_IN_INODE) &&
@@ -2910,13 +2909,12 @@ xfs_btree_make_block_unfull(
 	 * If this works we have to re-set our variables because we
 	 * could be in a different block now.
 	 */
-	error = xfs_btree_split(cur, level, nptr, &key, ncur, stat);
+	error = xfs_btree_split(cur, level, nptr, key, ncur, stat);
 	if (error || *stat == 0)
 		return error;
 
 
 	*index = cur->bc_ptrs[level];
-	cur->bc_ops->init_rec_from_key(&key, nrec);
 	return 0;
 }
 
@@ -2929,16 +2927,16 @@ xfs_btree_insrec(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level,	/* level to insert record at */
 	union xfs_btree_ptr	*ptrp,	/* i/o: block number inserted */
-	union xfs_btree_rec	*recp,	/* i/o: record data inserted */
+	union xfs_btree_key	*key,	/* i/o: block key for ptrp */
 	struct xfs_btree_cur	**curp,	/* output: new cursor replacing cur */
 	int			*stat)	/* success/failure */
 {
 	struct xfs_btree_block	*block;	/* btree block */
 	struct xfs_buf		*bp;	/* buffer for block */
-	union xfs_btree_key	key;	/* btree key */
 	union xfs_btree_ptr	nptr;	/* new block ptr */
 	struct xfs_btree_cur	*ncur;	/* new btree cursor */
-	union xfs_btree_rec	nrec;	/* new record count */
+	union xfs_btree_key	nkey;	/* new block key */
+	union xfs_btree_rec	rec;	/* record to insert */
 	int			optr;	/* old key/record index */
 	int			ptr;	/* key/record index */
 	int			numrecs;/* number of records */
@@ -2947,8 +2945,14 @@ xfs_btree_insrec(
 	int			i;
 #endif
 
+	/* Make a key out of the record data to be inserted, and save it. */
+	if (level == 0) {
+		cur->bc_ops->init_rec_from_cur(cur, &rec);
+		cur->bc_ops->init_key_from_rec(key, &rec);
+	}
+
 	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
-	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, recp);
+	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, &rec);
 
 	ncur = NULL;
 
@@ -2973,9 +2977,6 @@ xfs_btree_insrec(
 		return 0;
 	}
 
-	/* Make a key out of the record data to be inserted, and save it. */
-	cur->bc_ops->init_key_from_rec(&key, recp);
-
 	optr = ptr;
 
 	XFS_BTREE_STATS_INC(cur, insrec);
@@ -2992,10 +2993,10 @@ xfs_btree_insrec(
 	/* Check that the new entry is being inserted in the right place. */
 	if (ptr <= numrecs) {
 		if (level == 0) {
-			ASSERT(cur->bc_ops->recs_inorder(cur, recp,
+			ASSERT(cur->bc_ops->recs_inorder(cur, &rec,
 				xfs_btree_rec_addr(cur, ptr, block)));
 		} else {
-			ASSERT(cur->bc_ops->keys_inorder(cur, &key,
+			ASSERT(cur->bc_ops->keys_inorder(cur, key,
 				xfs_btree_key_addr(cur, ptr, block)));
 		}
 	}
@@ -3008,7 +3009,7 @@ xfs_btree_insrec(
 	xfs_btree_set_ptr_null(cur, &nptr);
 	if (numrecs == cur->bc_ops->get_maxrecs(cur, level)) {
 		error = xfs_btree_make_block_unfull(cur, level, numrecs,
-					&optr, &ptr, &nptr, &ncur, &nrec, stat);
+					&optr, &ptr, &nptr, &ncur, &nkey, stat);
 		if (error || *stat == 0)
 			goto error0;
 	}
@@ -3058,7 +3059,7 @@ xfs_btree_insrec(
 #endif
 
 		/* Now put the new data in, bump numrecs and log it. */
-		xfs_btree_copy_keys(cur, kp, &key, 1);
+		xfs_btree_copy_keys(cur, kp, key, 1);
 		xfs_btree_copy_ptrs(cur, pp, ptrp, 1);
 		numrecs++;
 		xfs_btree_set_numrecs(block, numrecs);
@@ -3079,7 +3080,7 @@ xfs_btree_insrec(
 		xfs_btree_shift_recs(cur, rp, 1, numrecs - ptr + 1);
 
 		/* Now put the new data in, bump numrecs and log it. */
-		xfs_btree_copy_recs(cur, rp, recp, 1);
+		xfs_btree_copy_recs(cur, rp, &rec, 1);
 		xfs_btree_set_numrecs(block, ++numrecs);
 		xfs_btree_log_recs(cur, bp, ptr, numrecs);
 #ifdef DEBUG
@@ -3095,7 +3096,7 @@ xfs_btree_insrec(
 
 	/* If we inserted at the start of a block, update the parents' keys. */
 	if (optr == 1) {
-		error = xfs_btree_updkey(cur, &key, level + 1);
+		error = xfs_btree_updkey(cur, key, level + 1);
 		if (error)
 			goto error0;
 	}
@@ -3105,7 +3106,7 @@ xfs_btree_insrec(
 	 * we are at the far right edge of the tree, update it.
 	 */
 	if (xfs_btree_is_lastrec(cur, block, level)) {
-		cur->bc_ops->update_lastrec(cur, block, recp,
+		cur->bc_ops->update_lastrec(cur, block, &rec,
 					    ptr, LASTREC_INSREC);
 	}
 
@@ -3115,7 +3116,7 @@ xfs_btree_insrec(
 	 */
 	*ptrp = nptr;
 	if (!xfs_btree_ptr_is_null(cur, &nptr)) {
-		*recp = nrec;
+		*key = nkey;
 		*curp = ncur;
 	}
 
@@ -3146,14 +3147,13 @@ xfs_btree_insert(
 	union xfs_btree_ptr	nptr;	/* new block number (split result) */
 	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
 	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
-	union xfs_btree_rec	rec;	/* record to insert */
+	union xfs_btree_key	key;	/* key of block to insert */
 
 	level = 0;
 	ncur = NULL;
 	pcur = cur;
 
 	xfs_btree_set_ptr_null(cur, &nptr);
-	cur->bc_ops->init_rec_from_cur(cur, &rec);
 
 	/*
 	 * Loop going up the tree, starting at the leaf level.
@@ -3165,7 +3165,7 @@ xfs_btree_insert(
 		 * Insert nrec/nptr into this level of the tree.
 		 * Note if we fail, nptr will be null.
 		 */
-		error = xfs_btree_insrec(pcur, level, &nptr, &rec, &ncur, &i);
+		error = xfs_btree_insrec(pcur, level, &nptr, &key, &ncur, &i);
 		if (error) {
 			if (pcur != cur)
 				xfs_btree_del_cursor(pcur, XFS_BTREE_ERROR);
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index b955e5d..b99c018 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -158,8 +158,6 @@ struct xfs_btree_ops {
 	/* init values of btree structures */
 	void	(*init_key_from_rec)(union xfs_btree_key *key,
 				     union xfs_btree_rec *rec);
-	void	(*init_rec_from_key)(union xfs_btree_key *key,
-				     union xfs_btree_rec *rec);
 	void	(*init_rec_from_cur)(struct xfs_btree_cur *cur,
 				     union xfs_btree_rec *rec);
 	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 89c21d7..88da2ad 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -146,14 +146,6 @@ xfs_inobt_init_key_from_rec(
 }
 
 STATIC void
-xfs_inobt_init_rec_from_key(
-	union xfs_btree_key	*key,
-	union xfs_btree_rec	*rec)
-{
-	rec->inobt.ir_startino = key->inobt.ir_startino;
-}
-
-STATIC void
 xfs_inobt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
@@ -314,7 +306,6 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 	.get_minrecs		= xfs_inobt_get_minrecs,
 	.get_maxrecs		= xfs_inobt_get_maxrecs,
 	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
-	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
 	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_inobt_init_ptr_from_cur,
 	.key_diff		= xfs_inobt_key_diff,
@@ -336,7 +327,6 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 	.get_minrecs		= xfs_inobt_get_minrecs,
 	.get_maxrecs		= xfs_inobt_get_maxrecs,
 	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
-	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
 	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
 	.key_diff		= xfs_inobt_key_diff,


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-22 15:17   ` Brian Foster
  2016-07-06  4:59   ` Dave Chinner
  2016-06-17  1:19 ` [PATCH 014/119] xfs: introduce interval queries on btrees Darrick J. Wong
                   ` (105 subsequent siblings)
  118 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On a filesystem with both reflink and reverse mapping enabled, it's
possible to have multiple rmap records referring to the same blocks on
disk.  When overlapping intervals are possible, querying a classic
btree to find all records intersecting a given interval is inefficient
because we cannot use the left side of the search interval to filter
out non-matching records the same way that we can use the existing
btree key to filter out records coming after the right side of the
search interval.  This will become important once we want to use the
rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.

(For the non-overlapping case, we can perform such queries trivially
by starting at the left side of the interval and walking the tree
until we pass the right side.)

Therefore, extend the btree code to come closer to supporting
intervals as a first-class record attribute.  This involves widening
the btree node's key space to store both the lowest key reachable via
the node pointer (as the btree does now) and the highest key reachable
via the same pointer and teaching the btree modifying functions to
keep the highest-key records up to date.

This behavior can be turned on via a new btree ops flag so that btrees
that cannot store overlapping intervals don't pay the overhead costs
in terms of extra code and disk format changes.

v2: When we're deleting a record in a btree that supports overlapped
interval records and the deletion results in two btree blocks being
joined, we defer updating the high/low keys until after all possible
joining (at higher levels in the tree) have finished.  At this point,
the btree pointers at all levels have been updated to remove the empty
blocks and we can update the low and high keys.

When we're doing this, we must be careful to update the keys of all
node pointers up to the root instead of stopping at the first set of
keys that don't need updating.  This is because it's possible for a
single deletion to cause joining of multiple levels of tree, and so
we need to update everything going back to the root.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c |  379 +++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_btree.h |   16 ++
 fs/xfs/xfs_trace.h        |   36 ++++
 3 files changed, 395 insertions(+), 36 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a096539..afcafd6 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -52,6 +52,11 @@ static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
 	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
 
 
+struct xfs_btree_double_key {
+	union xfs_btree_key	low;
+	union xfs_btree_key	high;
+};
+
 STATIC int				/* error (0 or EFSCORRUPTED) */
 xfs_btree_check_lblock(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
@@ -428,6 +433,30 @@ xfs_btree_dup_cursor(
  * into a btree block (xfs_btree_*_offset) or return a pointer to the given
  * record, key or pointer (xfs_btree_*_addr).  Note that all addressing
  * inside the btree block is done using indices starting at one, not zero!
+ *
+ * If XFS_BTREE_OVERLAPPING is set, then this btree supports keys containing
+ * overlapping intervals.  In such a tree, records are still sorted lowest to
+ * highest and indexed by the smallest key value that refers to the record.
+ * However, nodes are different: each pointer has two associated keys -- one
+ * indexing the lowest key available in the block(s) below (the same behavior
+ * as the key in a regular btree) and another indexing the highest key
+ * available in the block(s) below.  Because records are /not/ sorted by the
+ * highest key, all leaf block updates require us to compute the highest key
+ * that matches any record in the leaf and to recursively update the high keys
+ * in the nodes going further up in the tree, if necessary.  Nodes look like
+ * this:
+ *
+ *		+--------+-----+-----+-----+-----+-----+-------+-------+-----+
+ * Non-Leaf:	| header | lo1 | hi1 | lo2 | hi2 | ... | ptr 1 | ptr 2 | ... |
+ *		+--------+-----+-----+-----+-----+-----+-------+-------+-----+
+ *
+ * To perform an interval query on an overlapped tree, perform the usual
+ * depth-first search and use the low and high keys to decide if we can skip
+ * that particular node.  If a leaf node is reached, return the records that
+ * intersect the interval.  Note that an interval query may return numerous
+ * entries.  For a non-overlapped tree, simply search for the record associated
+ * with the lowest key and iterate forward until a non-matching record is
+ * found.
  */
 
 /*
@@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct xfs_btree_cur *cur)
 	return XFS_BTREE_SBLOCK_LEN;
 }
 
+/* Return size of btree block keys for this btree instance. */
+static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur)
+{
+	size_t			len;
+
+	len = cur->bc_ops->key_len;
+	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
+		len *= 2;
+	return len;
+}
+
 /*
  * Return size of btree block pointers for this btree instance.
  */
@@ -475,7 +515,19 @@ xfs_btree_key_offset(
 	int			n)
 {
 	return xfs_btree_block_len(cur) +
-		(n - 1) * cur->bc_ops->key_len;
+		(n - 1) * xfs_btree_key_len(cur);
+}
+
+/*
+ * Calculate offset of the n-th high key in a btree block.
+ */
+STATIC size_t
+xfs_btree_high_key_offset(
+	struct xfs_btree_cur	*cur,
+	int			n)
+{
+	return xfs_btree_block_len(cur) +
+		(n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len;
 }
 
 /*
@@ -488,7 +540,7 @@ xfs_btree_ptr_offset(
 	int			level)
 {
 	return xfs_btree_block_len(cur) +
-		cur->bc_ops->get_maxrecs(cur, level) * cur->bc_ops->key_len +
+		cur->bc_ops->get_maxrecs(cur, level) * xfs_btree_key_len(cur) +
 		(n - 1) * xfs_btree_ptr_len(cur);
 }
 
@@ -519,6 +571,19 @@ xfs_btree_key_addr(
 }
 
 /*
+ * Return a pointer to the n-th high key in the btree block.
+ */
+STATIC union xfs_btree_key *
+xfs_btree_high_key_addr(
+	struct xfs_btree_cur	*cur,
+	int			n,
+	struct xfs_btree_block	*block)
+{
+	return (union xfs_btree_key *)
+		((char *)block + xfs_btree_high_key_offset(cur, n));
+}
+
+/*
  * Return a pointer to the n-th block pointer in the btree block.
  */
 STATIC union xfs_btree_ptr *
@@ -1217,7 +1282,7 @@ xfs_btree_copy_keys(
 	int			numkeys)
 {
 	ASSERT(numkeys >= 0);
-	memcpy(dst_key, src_key, numkeys * cur->bc_ops->key_len);
+	memcpy(dst_key, src_key, numkeys * xfs_btree_key_len(cur));
 }
 
 /*
@@ -1263,8 +1328,8 @@ xfs_btree_shift_keys(
 	ASSERT(numkeys >= 0);
 	ASSERT(dir == 1 || dir == -1);
 
-	dst_key = (char *)key + (dir * cur->bc_ops->key_len);
-	memmove(dst_key, key, numkeys * cur->bc_ops->key_len);
+	dst_key = (char *)key + (dir * xfs_btree_key_len(cur));
+	memmove(dst_key, key, numkeys * xfs_btree_key_len(cur));
 }
 
 /*
@@ -1879,6 +1944,180 @@ error0:
 	return error;
 }
 
+/* Determine the low and high keys of a leaf block */
+STATIC void
+xfs_btree_find_leaf_keys(
+	struct xfs_btree_cur	*cur,
+	struct xfs_btree_block	*block,
+	union xfs_btree_key	*low,
+	union xfs_btree_key	*high)
+{
+	int			n;
+	union xfs_btree_rec	*rec;
+	union xfs_btree_key	max_hkey;
+	union xfs_btree_key	hkey;
+
+	rec = xfs_btree_rec_addr(cur, 1, block);
+	cur->bc_ops->init_key_from_rec(low, rec);
+
+	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
+		return;
+
+	cur->bc_ops->init_high_key_from_rec(&max_hkey, rec);
+	for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
+		rec = xfs_btree_rec_addr(cur, n, block);
+		cur->bc_ops->init_high_key_from_rec(&hkey, rec);
+		if (cur->bc_ops->diff_two_keys(cur, &max_hkey, &hkey) > 0)
+			max_hkey = hkey;
+	}
+
+	*high = max_hkey;
+}
+
+/* Determine the low and high keys of a node block */
+STATIC void
+xfs_btree_find_node_keys(
+	struct xfs_btree_cur	*cur,
+	struct xfs_btree_block	*block,
+	union xfs_btree_key	*low,
+	union xfs_btree_key	*high)
+{
+	int			n;
+	union xfs_btree_key	*hkey;
+	union xfs_btree_key	*max_hkey;
+
+	*low = *xfs_btree_key_addr(cur, 1, block);
+
+	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
+		return;
+
+	max_hkey = xfs_btree_high_key_addr(cur, 1, block);
+	for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
+		hkey = xfs_btree_high_key_addr(cur, n, block);
+		if (cur->bc_ops->diff_two_keys(cur, max_hkey, hkey) > 0)
+			max_hkey = hkey;
+	}
+
+	*high = *max_hkey;
+}
+
+/*
+ * Update parental low & high keys from some block all the way back to the
+ * root of the btree.
+ */
+STATIC int
+__xfs_btree_updkeys(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	struct xfs_btree_block	*block,
+	struct xfs_buf		*bp0,
+	bool			force_all)
+{
+	union xfs_btree_key	lkey;	/* keys from current level */
+	union xfs_btree_key	hkey;
+	union xfs_btree_key	*nlkey;	/* keys from the next level up */
+	union xfs_btree_key	*nhkey;
+	struct xfs_buf		*bp;
+	int			ptr = -1;
+
+	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
+		return 0;
+
+	if (level + 1 >= cur->bc_nlevels)
+		return 0;
+
+	trace_xfs_btree_updkeys(cur, level, bp0);
+
+	if (level == 0)
+		xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey);
+	else
+		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
+	for (level++; level < cur->bc_nlevels; level++) {
+		block = xfs_btree_get_block(cur, level, &bp);
+		trace_xfs_btree_updkeys(cur, level, bp);
+		ptr = cur->bc_ptrs[level];
+		nlkey = xfs_btree_key_addr(cur, ptr, block);
+		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
+		if (!(cur->bc_ops->diff_two_keys(cur, nlkey, &lkey) != 0 ||
+		      cur->bc_ops->diff_two_keys(cur, nhkey, &hkey) != 0) &&
+		    !force_all)
+			break;
+		memcpy(nlkey, &lkey, cur->bc_ops->key_len);
+		memcpy(nhkey, &hkey, cur->bc_ops->key_len);
+		xfs_btree_log_keys(cur, bp, ptr, ptr);
+		if (level + 1 >= cur->bc_nlevels)
+			break;
+		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
+	}
+
+	return 0;
+}
+
+/*
+ * Update all the keys from a sibling block at some level in the cursor back
+ * to the root, stopping when we find a key pair that doesn't need updating.
+ */
+STATIC int
+xfs_btree_sibling_updkeys(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	int			ptr,
+	struct xfs_btree_block	*block,
+	struct xfs_buf		*bp0)
+{
+	struct xfs_btree_cur	*ncur;
+	int			stat;
+	int			error;
+
+	error = xfs_btree_dup_cursor(cur, &ncur);
+	if (error)
+		return error;
+
+	if (level + 1 >= ncur->bc_nlevels)
+		error = -EDOM;
+	else if (ptr == XFS_BB_RIGHTSIB)
+		error = xfs_btree_increment(ncur, level + 1, &stat);
+	else if (ptr == XFS_BB_LEFTSIB)
+		error = xfs_btree_decrement(ncur, level + 1, &stat);
+	else
+		error = -EBADE;
+	if (error || !stat)
+		return error;
+
+	error = __xfs_btree_updkeys(ncur, level, block, bp0, false);
+	xfs_btree_del_cursor(ncur, XFS_BTREE_NOERROR);
+	return error;
+}
+
+/*
+ * Update all the keys from some level in cursor back to the root, stopping
+ * when we find a key pair that don't need updating.
+ */
+STATIC int
+xfs_btree_updkeys(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	struct xfs_buf		*bp;
+	struct xfs_btree_block	*block;
+
+	block = xfs_btree_get_block(cur, level, &bp);
+	return __xfs_btree_updkeys(cur, level, block, bp, false);
+}
+
+/* Update all the keys from some level in cursor back to the root. */
+STATIC int
+xfs_btree_updkeys_force(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	struct xfs_buf		*bp;
+	struct xfs_btree_block	*block;
+
+	block = xfs_btree_get_block(cur, level, &bp);
+	return __xfs_btree_updkeys(cur, level, block, bp, true);
+}
+
 /*
  * Update keys at all levels from here to the root along the cursor's path.
  */
@@ -1893,6 +2132,9 @@ xfs_btree_updkey(
 	union xfs_btree_key	*kp;
 	int			ptr;
 
+	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
+		return 0;
+
 	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
 	XFS_BTREE_TRACE_ARGIK(cur, level, keyp);
 
@@ -1970,7 +2212,8 @@ xfs_btree_update(
 					    ptr, LASTREC_UPDATE);
 	}
 
-	/* Updating first rec in leaf. Pass new key value up to our parent. */
+	/* Pass new key value up to our parent. */
+	xfs_btree_updkeys(cur, 0);
 	if (ptr == 1) {
 		union xfs_btree_key	key;
 
@@ -2149,7 +2392,9 @@ xfs_btree_lshift(
 		rkp = &key;
 	}
 
-	/* Update the parent key values of right. */
+	/* Update the parent key values of left and right. */
+	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
+	xfs_btree_updkeys(cur, level);
 	error = xfs_btree_updkey(cur, rkp, level + 1);
 	if (error)
 		goto error0;
@@ -2321,6 +2566,9 @@ xfs_btree_rshift(
 	if (error)
 		goto error1;
 
+	/* Update left and right parent pointers */
+	xfs_btree_updkeys(cur, level);
+	xfs_btree_updkeys(tcur, level);
 	error = xfs_btree_updkey(tcur, rkp, level + 1);
 	if (error)
 		goto error1;
@@ -2356,7 +2604,7 @@ __xfs_btree_split(
 	struct xfs_btree_cur	*cur,
 	int			level,
 	union xfs_btree_ptr	*ptrp,
-	union xfs_btree_key	*key,
+	struct xfs_btree_double_key	*key,
 	struct xfs_btree_cur	**curp,
 	int			*stat)		/* success/failure */
 {
@@ -2452,9 +2700,6 @@ __xfs_btree_split(
 
 		xfs_btree_log_keys(cur, rbp, 1, rrecs);
 		xfs_btree_log_ptrs(cur, rbp, 1, rrecs);
-
-		/* Grab the keys to the entries moved to the right block */
-		xfs_btree_copy_keys(cur, key, rkp, 1);
 	} else {
 		/* It's a leaf.  Move records.  */
 		union xfs_btree_rec	*lrp;	/* left record pointer */
@@ -2465,12 +2710,8 @@ __xfs_btree_split(
 
 		xfs_btree_copy_recs(cur, rrp, lrp, rrecs);
 		xfs_btree_log_recs(cur, rbp, 1, rrecs);
-
-		cur->bc_ops->init_key_from_rec(key,
-			xfs_btree_rec_addr(cur, 1, right));
 	}
 
-
 	/*
 	 * Find the left block number by looking in the buffer.
 	 * Adjust numrecs, sibling pointers.
@@ -2484,6 +2725,12 @@ __xfs_btree_split(
 	xfs_btree_set_numrecs(left, lrecs);
 	xfs_btree_set_numrecs(right, xfs_btree_get_numrecs(right) + rrecs);
 
+	/* Find the low & high keys for the new block. */
+	if (level > 0)
+		xfs_btree_find_node_keys(cur, right, &key->low, &key->high);
+	else
+		xfs_btree_find_leaf_keys(cur, right, &key->low, &key->high);
+
 	xfs_btree_log_block(cur, rbp, XFS_BB_ALL_BITS);
 	xfs_btree_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB);
 
@@ -2499,6 +2746,10 @@ __xfs_btree_split(
 		xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB);
 		xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB);
 	}
+
+	/* Update the left block's keys... */
+	xfs_btree_updkeys(cur, level);
+
 	/*
 	 * If the cursor is really in the right block, move it there.
 	 * If it's just pointing past the last entry in left, then we'll
@@ -2537,7 +2788,7 @@ struct xfs_btree_split_args {
 	struct xfs_btree_cur	*cur;
 	int			level;
 	union xfs_btree_ptr	*ptrp;
-	union xfs_btree_key	*key;
+	struct xfs_btree_double_key	*key;
 	struct xfs_btree_cur	**curp;
 	int			*stat;		/* success/failure */
 	int			result;
@@ -2586,7 +2837,7 @@ xfs_btree_split(
 	struct xfs_btree_cur	*cur,
 	int			level,
 	union xfs_btree_ptr	*ptrp,
-	union xfs_btree_key	*key,
+	struct xfs_btree_double_key	*key,
 	struct xfs_btree_cur	**curp,
 	int			*stat)		/* success/failure */
 {
@@ -2806,27 +3057,27 @@ xfs_btree_new_root(
 		bp = lbp;
 		nptr = 2;
 	}
+
 	/* Fill in the new block's btree header and log it. */
 	xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2);
 	xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS);
 	ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) &&
 			!xfs_btree_ptr_is_null(cur, &rptr));
-
 	/* Fill in the key data in the new root. */
 	if (xfs_btree_get_level(left) > 0) {
-		xfs_btree_copy_keys(cur,
+		xfs_btree_find_node_keys(cur, left,
 				xfs_btree_key_addr(cur, 1, new),
-				xfs_btree_key_addr(cur, 1, left), 1);
-		xfs_btree_copy_keys(cur,
+				xfs_btree_high_key_addr(cur, 1, new));
+		xfs_btree_find_node_keys(cur, right,
 				xfs_btree_key_addr(cur, 2, new),
-				xfs_btree_key_addr(cur, 1, right), 1);
+				xfs_btree_high_key_addr(cur, 2, new));
 	} else {
-		cur->bc_ops->init_key_from_rec(
-				xfs_btree_key_addr(cur, 1, new),
-				xfs_btree_rec_addr(cur, 1, left));
-		cur->bc_ops->init_key_from_rec(
-				xfs_btree_key_addr(cur, 2, new),
-				xfs_btree_rec_addr(cur, 1, right));
+		xfs_btree_find_leaf_keys(cur, left,
+			xfs_btree_key_addr(cur, 1, new),
+			xfs_btree_high_key_addr(cur, 1, new));
+		xfs_btree_find_leaf_keys(cur, right,
+			xfs_btree_key_addr(cur, 2, new),
+			xfs_btree_high_key_addr(cur, 2, new));
 	}
 	xfs_btree_log_keys(cur, nbp, 1, 2);
 
@@ -2837,6 +3088,7 @@ xfs_btree_new_root(
 		xfs_btree_ptr_addr(cur, 2, new), &rptr, 1);
 	xfs_btree_log_ptrs(cur, nbp, 1, 2);
 
+
 	/* Fix up the cursor. */
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
 	cur->bc_ptrs[cur->bc_nlevels] = nptr;
@@ -2862,7 +3114,7 @@ xfs_btree_make_block_unfull(
 	int			*index,	/* new tree index */
 	union xfs_btree_ptr	*nptr,	/* new btree ptr */
 	struct xfs_btree_cur	**ncur,	/* new btree cursor */
-	union xfs_btree_key	*key, /* key of new block */
+	struct xfs_btree_double_key	*key,	/* key of new block */
 	int			*stat)
 {
 	int			error = 0;
@@ -2918,6 +3170,22 @@ xfs_btree_make_block_unfull(
 	return 0;
 }
 
+/* Copy a double key into a btree block. */
+static void
+xfs_btree_copy_double_keys(
+	struct xfs_btree_cur	*cur,
+	int			ptr,
+	struct xfs_btree_block	*block,
+	struct xfs_btree_double_key	*key)
+{
+	memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low,
+			cur->bc_ops->key_len);
+
+	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
+		memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high,
+				cur->bc_ops->key_len);
+}
+
 /*
  * Insert one record/level.  Return information to the caller
  * allowing the next level up to proceed if necessary.
@@ -2927,7 +3195,7 @@ xfs_btree_insrec(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level,	/* level to insert record at */
 	union xfs_btree_ptr	*ptrp,	/* i/o: block number inserted */
-	union xfs_btree_key	*key,	/* i/o: block key for ptrp */
+	struct xfs_btree_double_key	*key, /* i/o: block key for ptrp */
 	struct xfs_btree_cur	**curp,	/* output: new cursor replacing cur */
 	int			*stat)	/* success/failure */
 {
@@ -2935,7 +3203,7 @@ xfs_btree_insrec(
 	struct xfs_buf		*bp;	/* buffer for block */
 	union xfs_btree_ptr	nptr;	/* new block ptr */
 	struct xfs_btree_cur	*ncur;	/* new btree cursor */
-	union xfs_btree_key	nkey;	/* new block key */
+	struct xfs_btree_double_key	nkey;	/* new block key */
 	union xfs_btree_rec	rec;	/* record to insert */
 	int			optr;	/* old key/record index */
 	int			ptr;	/* key/record index */
@@ -2944,11 +3212,12 @@ xfs_btree_insrec(
 #ifdef DEBUG
 	int			i;
 #endif
+	xfs_daddr_t		old_bn;
 
 	/* Make a key out of the record data to be inserted, and save it. */
 	if (level == 0) {
 		cur->bc_ops->init_rec_from_cur(cur, &rec);
-		cur->bc_ops->init_key_from_rec(key, &rec);
+		cur->bc_ops->init_key_from_rec(&key->low, &rec);
 	}
 
 	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
@@ -2983,6 +3252,7 @@ xfs_btree_insrec(
 
 	/* Get pointers to the btree buffer and block. */
 	block = xfs_btree_get_block(cur, level, &bp);
+	old_bn = bp ? bp->b_bn : XFS_BUF_DADDR_NULL;
 	numrecs = xfs_btree_get_numrecs(block);
 
 #ifdef DEBUG
@@ -2996,7 +3266,7 @@ xfs_btree_insrec(
 			ASSERT(cur->bc_ops->recs_inorder(cur, &rec,
 				xfs_btree_rec_addr(cur, ptr, block)));
 		} else {
-			ASSERT(cur->bc_ops->keys_inorder(cur, key,
+			ASSERT(cur->bc_ops->keys_inorder(cur, &key->low,
 				xfs_btree_key_addr(cur, ptr, block)));
 		}
 	}
@@ -3059,7 +3329,7 @@ xfs_btree_insrec(
 #endif
 
 		/* Now put the new data in, bump numrecs and log it. */
-		xfs_btree_copy_keys(cur, kp, key, 1);
+		xfs_btree_copy_double_keys(cur, ptr, block, key);
 		xfs_btree_copy_ptrs(cur, pp, ptrp, 1);
 		numrecs++;
 		xfs_btree_set_numrecs(block, numrecs);
@@ -3095,8 +3365,24 @@ xfs_btree_insrec(
 	xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);
 
 	/* If we inserted at the start of a block, update the parents' keys. */
+	if (ncur && bp->b_bn != old_bn) {
+		/*
+		 * We just inserted into a new tree block, which means that
+		 * the key for the block is in nkey, not the tree.
+		 */
+		if (level == 0)
+			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
+					&nkey.high);
+		else
+			xfs_btree_find_node_keys(cur, block, &nkey.low,
+					&nkey.high);
+	} else {
+		/* Updating the left block, do it the standard way. */
+		xfs_btree_updkeys(cur, level);
+	}
+
 	if (optr == 1) {
-		error = xfs_btree_updkey(cur, key, level + 1);
+		error = xfs_btree_updkey(cur, &key->low, level + 1);
 		if (error)
 			goto error0;
 	}
@@ -3147,7 +3433,7 @@ xfs_btree_insert(
 	union xfs_btree_ptr	nptr;	/* new block number (split result) */
 	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
 	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
-	union xfs_btree_key	key;	/* key of block to insert */
+	struct xfs_btree_double_key	key;	/* key of block to insert */
 
 	level = 0;
 	ncur = NULL;
@@ -3552,6 +3838,7 @@ xfs_btree_delrec(
 	 * If we deleted the leftmost entry in the block, update the
 	 * key values above us in the tree.
 	 */
+	xfs_btree_updkeys(cur, level);
 	if (ptr == 1) {
 		error = xfs_btree_updkey(cur, keyp, level + 1);
 		if (error)
@@ -3882,6 +4169,16 @@ xfs_btree_delrec(
 	if (level > 0)
 		cur->bc_ptrs[level]--;
 
+	/*
+	 * We combined blocks, so we have to update the parent keys if the
+	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
+	 * points to the old block so that the caller knows which record to
+	 * delete.  Therefore, the caller must be savvy enough to call updkeys
+	 * for us if we return stat == 2.  The other exit points from this
+	 * function don't require deletions further up the tree, so they can
+	 * call updkeys directly.
+	 */
+
 	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
 	/* Return value means the next level up has something to do. */
 	*stat = 2;
@@ -3907,6 +4204,7 @@ xfs_btree_delete(
 	int			error;	/* error return value */
 	int			level;
 	int			i;
+	bool			joined = false;
 
 	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
 
@@ -3920,8 +4218,17 @@ xfs_btree_delete(
 		error = xfs_btree_delrec(cur, level, &i);
 		if (error)
 			goto error0;
+		if (i == 2)
+			joined = true;
 	}
 
+	/*
+	 * If we combined blocks as part of deleting the record, delrec won't
+	 * have updated the parent keys so we have to do that here.
+	 */
+	if (joined)
+		xfs_btree_updkeys_force(cur, 0);
+
 	if (i == 0) {
 		for (level = 1; level < cur->bc_nlevels; level++) {
 			if (cur->bc_ptrs[level] == 0) {
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index b99c018..a5ec6c7 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -126,6 +126,9 @@ struct xfs_btree_ops {
 	size_t	key_len;
 	size_t	rec_len;
 
+	/* flags */
+	uint	flags;
+
 	/* cursor operations */
 	struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *);
 	void	(*update_cursor)(struct xfs_btree_cur *src,
@@ -162,11 +165,21 @@ struct xfs_btree_ops {
 				     union xfs_btree_rec *rec);
 	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
 				     union xfs_btree_ptr *ptr);
+	void	(*init_high_key_from_rec)(union xfs_btree_key *key,
+					  union xfs_btree_rec *rec);
 
 	/* difference between key value and cursor value */
 	__int64_t (*key_diff)(struct xfs_btree_cur *cur,
 			      union xfs_btree_key *key);
 
+	/*
+	 * Difference between key2 and key1 -- positive if key2 > key1,
+	 * negative if key2 < key1, and zero if equal.
+	 */
+	__int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
+				   union xfs_btree_key *key1,
+				   union xfs_btree_key *key2);
+
 	const struct xfs_buf_ops	*buf_ops;
 
 #if defined(DEBUG) || defined(XFS_WARN)
@@ -182,6 +195,9 @@ struct xfs_btree_ops {
 #endif
 };
 
+/* btree ops flags */
+#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
+
 /*
  * Reasons for the update_lastrec method to be called.
  */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 68f27f7..ffea28c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -38,6 +38,7 @@ struct xlog_recover_item;
 struct xfs_buf_log_format;
 struct xfs_inode_log_format;
 struct xfs_bmbt_irec;
+struct xfs_btree_cur;
 
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),
@@ -2183,6 +2184,41 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
 DEFINE_DISCARD_EVENT(xfs_discard_exclude);
 DEFINE_DISCARD_EVENT(xfs_discard_busy);
 
+/* btree cursor events */
+DECLARE_EVENT_CLASS(xfs_btree_cur_class,
+	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp),
+	TP_ARGS(cur, level, bp),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_btnum_t, btnum)
+		__field(int, level)
+		__field(int, nlevels)
+		__field(int, ptr)
+		__field(xfs_daddr_t, daddr)
+	),
+	TP_fast_assign(
+		__entry->dev = cur->bc_mp->m_super->s_dev;
+		__entry->btnum = cur->bc_btnum;
+		__entry->level = level;
+		__entry->nlevels = cur->bc_nlevels;
+		__entry->ptr = cur->bc_ptrs[level];
+		__entry->daddr = bp->b_bn;
+	),
+	TP_printk("dev %d:%d btnum %d level %d/%d ptr %d daddr 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->btnum,
+		  __entry->level,
+		  __entry->nlevels,
+		  __entry->ptr,
+		  (unsigned long long)__entry->daddr)
+)
+
+#define DEFINE_BTREE_CUR_EVENT(name) \
+DEFINE_EVENT(xfs_btree_cur_class, name, \
+	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
+	TP_ARGS(cur, level, bp))
+DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 014/119] xfs: introduce interval queries on btrees
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 013/119] xfs: support btrees with overlapping intervals for keys Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-22 15:18   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 015/119] xfs: refactor btree owner change into a separate visit-blocks function Darrick J. Wong
                   ` (104 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create a function to enable querying of btree records mapping to a
range of keys.  This will be used in subsequent patches to allow
querying the reverse mapping btree to find the extents mapped to a
range of physical blocks, though the generic code can be used for
any range query.

v2: add some shortcuts so that we can jump out of processing once
we know there won't be any more records to find.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c |  249 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h |   22 +++-
 fs/xfs/xfs_trace.h        |    1 
 3 files changed, 267 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index afcafd6..5f5cf23 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4509,3 +4509,252 @@ xfs_btree_calc_size(
 	}
 	return rval;
 }
+
+/* Query a regular btree for all records overlapping a given interval. */
+STATIC int
+xfs_btree_simple_query_range(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_irec		*low_rec,
+	union xfs_btree_irec		*high_rec,
+	xfs_btree_query_range_fn	fn,
+	void				*priv)
+{
+	union xfs_btree_rec		*recp;
+	union xfs_btree_rec		rec;
+	union xfs_btree_key		low_key;
+	union xfs_btree_key		high_key;
+	union xfs_btree_key		rec_key;
+	__int64_t			diff;
+	int				stat;
+	bool				firstrec = true;
+	int				error;
+
+	ASSERT(cur->bc_ops->init_high_key_from_rec);
+
+	/* Find the keys of both ends of the interval. */
+	cur->bc_rec = *high_rec;
+	cur->bc_ops->init_rec_from_cur(cur, &rec);
+	cur->bc_ops->init_key_from_rec(&high_key, &rec);
+
+	cur->bc_rec = *low_rec;
+	cur->bc_ops->init_rec_from_cur(cur, &rec);
+	cur->bc_ops->init_key_from_rec(&low_key, &rec);
+
+	/* Find the leftmost record. */
+	stat = 0;
+	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat);
+	if (error)
+		goto out;
+
+	while (stat) {
+		/* Find the record. */
+		error = xfs_btree_get_rec(cur, &recp, &stat);
+		if (error || !stat)
+			break;
+
+		/* Can we tell if this record is too low? */
+		if (firstrec) {
+			cur->bc_rec = *low_rec;
+			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
+			diff = cur->bc_ops->key_diff(cur, &rec_key);
+			if (diff < 0)
+				goto advloop;
+		}
+		firstrec = false;
+
+		/* Have we gone past the end? */
+		cur->bc_rec = *high_rec;
+		cur->bc_ops->init_key_from_rec(&rec_key, recp);
+		diff = cur->bc_ops->key_diff(cur, &rec_key);
+		if (diff > 0)
+			break;
+
+		/* Callback */
+		error = fn(cur, recp, priv);
+		if (error < 0 || error == XFS_BTREE_QUERY_RANGE_ABORT)
+			break;
+
+advloop:
+		/* Move on to the next record. */
+		error = xfs_btree_increment(cur, 0, &stat);
+		if (error)
+			break;
+	}
+
+out:
+	return error;
+}
+
+/*
+ * Query an overlapped interval btree for all records overlapping a given
+ * interval.
+ */
+STATIC int
+xfs_btree_overlapped_query_range(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_irec		*low_rec,
+	union xfs_btree_irec		*high_rec,
+	xfs_btree_query_range_fn	fn,
+	void				*priv)
+{
+	union xfs_btree_ptr		ptr;
+	union xfs_btree_ptr		*pp;
+	union xfs_btree_key		rec_key;
+	union xfs_btree_key		low_key;
+	union xfs_btree_key		high_key;
+	union xfs_btree_key		*lkp;
+	union xfs_btree_key		*hkp;
+	union xfs_btree_rec		rec;
+	union xfs_btree_rec		*recp;
+	struct xfs_btree_block		*block;
+	__int64_t			ldiff;
+	__int64_t			hdiff;
+	int				level;
+	struct xfs_buf			*bp;
+	int				i;
+	int				error;
+
+	/* Find the keys of both ends of the interval. */
+	cur->bc_rec = *high_rec;
+	cur->bc_ops->init_rec_from_cur(cur, &rec);
+	cur->bc_ops->init_key_from_rec(&high_key, &rec);
+
+	cur->bc_rec = *low_rec;
+	cur->bc_ops->init_rec_from_cur(cur, &rec);
+	cur->bc_ops->init_key_from_rec(&low_key, &rec);
+
+	/* Load the root of the btree. */
+	level = cur->bc_nlevels - 1;
+	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
+	error = xfs_btree_lookup_get_block(cur, level, &ptr, &block);
+	if (error)
+		return error;
+	xfs_btree_get_block(cur, level, &bp);
+	trace_xfs_btree_overlapped_query_range(cur, level, bp);
+#ifdef DEBUG
+	error = xfs_btree_check_block(cur, block, level, bp);
+	if (error)
+		goto out;
+#endif
+	cur->bc_ptrs[level] = 1;
+
+	while (level < cur->bc_nlevels) {
+		block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
+
+		if (level == 0) {
+			/* End of leaf, pop back towards the root. */
+			if (cur->bc_ptrs[level] >
+			    be16_to_cpu(block->bb_numrecs)) {
+leaf_pop_up:
+				if (level < cur->bc_nlevels - 1)
+					cur->bc_ptrs[level + 1]++;
+				level++;
+				continue;
+			}
+
+			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+
+			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
+			ldiff = cur->bc_ops->diff_two_keys(cur, &low_key,
+					&rec_key);
+
+			cur->bc_ops->init_key_from_rec(&rec_key, recp);
+			hdiff = cur->bc_ops->diff_two_keys(cur, &rec_key,
+					&high_key);
+
+			/* If the record matches, callback */
+			if (ldiff >= 0 && hdiff >= 0) {
+				error = fn(cur, recp, priv);
+				if (error < 0 ||
+				    error == XFS_BTREE_QUERY_RANGE_ABORT)
+					break;
+			} else if (hdiff < 0) {
+				/* Record is larger than high key; pop. */
+				goto leaf_pop_up;
+			}
+			cur->bc_ptrs[level]++;
+			continue;
+		}
+
+		/* End of node, pop back towards the root. */
+		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+node_pop_up:
+			if (level < cur->bc_nlevels - 1)
+				cur->bc_ptrs[level + 1]++;
+			level++;
+			continue;
+		}
+
+		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
+		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
+		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+
+		ldiff = cur->bc_ops->diff_two_keys(cur, &low_key, hkp);
+		hdiff = cur->bc_ops->diff_two_keys(cur, lkp, &high_key);
+
+		/* If the key matches, drill another level deeper. */
+		if (ldiff >= 0 && hdiff >= 0) {
+			level--;
+			error = xfs_btree_lookup_get_block(cur, level, pp,
+					&block);
+			if (error)
+				goto out;
+			xfs_btree_get_block(cur, level, &bp);
+			trace_xfs_btree_overlapped_query_range(cur, level, bp);
+#ifdef DEBUG
+			error = xfs_btree_check_block(cur, block, level, bp);
+			if (error)
+				goto out;
+#endif
+			cur->bc_ptrs[level] = 1;
+			continue;
+		} else if (hdiff < 0) {
+			/* The low key is larger than the upper range; pop. */
+			goto node_pop_up;
+		}
+		cur->bc_ptrs[level]++;
+	}
+
+out:
+	/*
+	 * If we don't end this function with the cursor pointing at a record
+	 * block, a subsequent non-error cursor deletion will not release
+	 * node-level buffers, causing a buffer leak.  This is quite possible
+	 * with a zero-results range query, so release the buffers if we
+	 * failed to return any results.
+	 */
+	if (cur->bc_bufs[0] == NULL) {
+		for (i = 0; i < cur->bc_nlevels; i++) {
+			if (cur->bc_bufs[i]) {
+				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
+				cur->bc_bufs[i] = NULL;
+				cur->bc_ptrs[i] = 0;
+				cur->bc_ra[i] = 0;
+			}
+		}
+	}
+
+	return error;
+}
+
+/*
+ * Query a btree for all records overlapping a given interval of keys.  The
+ * supplied function will be called with each record found; return one of the
+ * XFS_BTREE_QUERY_RANGE_{CONTINUE,ABORT} values or the usual negative error
+ * code.  This function returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a
+ * negative error code.
+ */
+int
+xfs_btree_query_range(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_irec		*low_rec,
+	union xfs_btree_irec		*high_rec,
+	xfs_btree_query_range_fn	fn,
+	void				*priv)
+{
+	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
+		return xfs_btree_simple_query_range(cur, low_rec,
+				high_rec, fn, priv);
+	return xfs_btree_overlapped_query_range(cur, low_rec, high_rec,
+			fn, priv);
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index a5ec6c7..898fee5 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -206,6 +206,12 @@ struct xfs_btree_ops {
 #define LASTREC_DELREC	2
 
 
+union xfs_btree_irec {
+	xfs_alloc_rec_incore_t		a;
+	xfs_bmbt_irec_t			b;
+	xfs_inobt_rec_incore_t		i;
+};
+
 /*
  * Btree cursor structure.
  * This collects all information needed by the btree code in one place.
@@ -216,11 +222,7 @@ typedef struct xfs_btree_cur
 	struct xfs_mount	*bc_mp;	/* file system mount struct */
 	const struct xfs_btree_ops *bc_ops;
 	uint			bc_flags; /* btree features - below */
-	union {
-		xfs_alloc_rec_incore_t	a;
-		xfs_bmbt_irec_t		b;
-		xfs_inobt_rec_incore_t	i;
-	}		bc_rec;		/* current insert/search record value */
+	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
 	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
 	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
 	__uint8_t	bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
@@ -494,4 +496,14 @@ xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
 uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
 		unsigned long len);
 
+/* return codes */
+#define XFS_BTREE_QUERY_RANGE_CONTINUE	0	/* keep iterating */
+#define XFS_BTREE_QUERY_RANGE_ABORT	1	/* stop iterating */
+typedef int (*xfs_btree_query_range_fn)(struct xfs_btree_cur *cur,
+		union xfs_btree_rec *rec, void *priv);
+
+int xfs_btree_query_range(struct xfs_btree_cur *cur,
+		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
+		xfs_btree_query_range_fn fn, void *priv);
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ffea28c..f0ac9c9 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2218,6 +2218,7 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
 	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
 	TP_ARGS(cur, level, bp))
 DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
+DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
 
 #endif /* _TRACE_XFS_H */
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 015/119] xfs: refactor btree owner change into a separate visit-blocks function
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (13 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 014/119] xfs: introduce interval queries on btrees Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-23 17:19   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 016/119] xfs: move deferred operations into a separate file Darrick J. Wong
                   ` (103 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Refactor the btree_change_owner function into a more generic apparatus
which visits all blocks in a btree.  We'll use this in a subsequent
patch for counting btree blocks for AG reservations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c |  141 +++++++++++++++++++++++++++++----------------
 fs/xfs/libxfs/xfs_btree.h |    5 ++
 2 files changed, 96 insertions(+), 50 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 5f5cf23..eac876a 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4289,6 +4289,81 @@ xfs_btree_get_rec(
 	return 0;
 }
 
+/* Visit a block in a btree. */
+STATIC int
+xfs_btree_visit_block(
+	struct xfs_btree_cur		*cur,
+	int				level,
+	xfs_btree_visit_blocks_fn	fn,
+	void				*data)
+{
+	struct xfs_btree_block		*block;
+	struct xfs_buf			*bp;
+	union xfs_btree_ptr		rptr;
+	int				error;
+
+	/* do right sibling readahead */
+	xfs_btree_readahead(cur, level, XFS_BTCUR_RIGHTRA);
+	block = xfs_btree_get_block(cur, level, &bp);
+
+	/* process the block */
+	error = fn(cur, level, data);
+	if (error)
+		return error;
+
+	/* now read rh sibling block for next iteration */
+	xfs_btree_get_sibling(cur, block, &rptr, XFS_BB_RIGHTSIB);
+	if (xfs_btree_ptr_is_null(cur, &rptr))
+		return -ENOENT;
+
+	return xfs_btree_lookup_get_block(cur, level, &rptr, &block);
+}
+
+
+/* Visit every block in a btree. */
+int
+xfs_btree_visit_blocks(
+	struct xfs_btree_cur		*cur,
+	xfs_btree_visit_blocks_fn	fn,
+	void				*data)
+{
+	union xfs_btree_ptr		lptr;
+	int				level;
+	struct xfs_btree_block		*block = NULL;
+	int				error = 0;
+
+	cur->bc_ops->init_ptr_from_cur(cur, &lptr);
+
+	/* for each level */
+	for (level = cur->bc_nlevels - 1; level >= 0; level--) {
+		/* grab the left hand block */
+		error = xfs_btree_lookup_get_block(cur, level, &lptr, &block);
+		if (error)
+			return error;
+
+		/* readahead the left most block for the next level down */
+		if (level > 0) {
+			union xfs_btree_ptr     *ptr;
+
+			ptr = xfs_btree_ptr_addr(cur, 1, block);
+			xfs_btree_readahead_ptr(cur, ptr, 1);
+
+			/* save for the next iteration of the loop */
+			lptr = *ptr;
+		}
+
+		/* for each buffer in the level */
+		do {
+			error = xfs_btree_visit_block(cur, level, fn, data);
+		} while (!error);
+
+		if (error != -ENOENT)
+			return error;
+	}
+
+	return 0;
+}
+
 /*
  * Change the owner of a btree.
  *
@@ -4313,26 +4388,27 @@ xfs_btree_get_rec(
  * just queue the modified buffer as delayed write buffer so the transaction
  * recovery completion writes the changes to disk.
  */
+struct xfs_btree_block_change_owner_info {
+	__uint64_t		new_owner;
+	struct list_head	*buffer_list;
+};
+
 static int
 xfs_btree_block_change_owner(
 	struct xfs_btree_cur	*cur,
 	int			level,
-	__uint64_t		new_owner,
-	struct list_head	*buffer_list)
+	void			*data)
 {
+	struct xfs_btree_block_change_owner_info	*bbcoi = data;
 	struct xfs_btree_block	*block;
 	struct xfs_buf		*bp;
-	union xfs_btree_ptr     rptr;
-
-	/* do right sibling readahead */
-	xfs_btree_readahead(cur, level, XFS_BTCUR_RIGHTRA);
 
 	/* modify the owner */
 	block = xfs_btree_get_block(cur, level, &bp);
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
-		block->bb_u.l.bb_owner = cpu_to_be64(new_owner);
+		block->bb_u.l.bb_owner = cpu_to_be64(bbcoi->new_owner);
 	else
-		block->bb_u.s.bb_owner = cpu_to_be32(new_owner);
+		block->bb_u.s.bb_owner = cpu_to_be32(bbcoi->new_owner);
 
 	/*
 	 * If the block is a root block hosted in an inode, we might not have a
@@ -4346,19 +4422,14 @@ xfs_btree_block_change_owner(
 			xfs_trans_ordered_buf(cur->bc_tp, bp);
 			xfs_btree_log_block(cur, bp, XFS_BB_OWNER);
 		} else {
-			xfs_buf_delwri_queue(bp, buffer_list);
+			xfs_buf_delwri_queue(bp, bbcoi->buffer_list);
 		}
 	} else {
 		ASSERT(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE);
 		ASSERT(level == cur->bc_nlevels - 1);
 	}
 
-	/* now read rh sibling block for next iteration */
-	xfs_btree_get_sibling(cur, block, &rptr, XFS_BB_RIGHTSIB);
-	if (xfs_btree_ptr_is_null(cur, &rptr))
-		return -ENOENT;
-
-	return xfs_btree_lookup_get_block(cur, level, &rptr, &block);
+	return 0;
 }
 
 int
@@ -4367,43 +4438,13 @@ xfs_btree_change_owner(
 	__uint64_t		new_owner,
 	struct list_head	*buffer_list)
 {
-	union xfs_btree_ptr     lptr;
-	int			level;
-	struct xfs_btree_block	*block = NULL;
-	int			error = 0;
+	struct xfs_btree_block_change_owner_info	bbcoi;
 
-	cur->bc_ops->init_ptr_from_cur(cur, &lptr);
+	bbcoi.new_owner = new_owner;
+	bbcoi.buffer_list = buffer_list;
 
-	/* for each level */
-	for (level = cur->bc_nlevels - 1; level >= 0; level--) {
-		/* grab the left hand block */
-		error = xfs_btree_lookup_get_block(cur, level, &lptr, &block);
-		if (error)
-			return error;
-
-		/* readahead the left most block for the next level down */
-		if (level > 0) {
-			union xfs_btree_ptr     *ptr;
-
-			ptr = xfs_btree_ptr_addr(cur, 1, block);
-			xfs_btree_readahead_ptr(cur, ptr, 1);
-
-			/* save for the next iteration of the loop */
-			lptr = *ptr;
-		}
-
-		/* for each buffer in the level */
-		do {
-			error = xfs_btree_block_change_owner(cur, level,
-							     new_owner,
-							     buffer_list);
-		} while (!error);
-
-		if (error != -ENOENT)
-			return error;
-	}
-
-	return 0;
+	return xfs_btree_visit_blocks(cur, xfs_btree_block_change_owner,
+			&bbcoi);
 }
 
 /**
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 898fee5..0ec3055 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -506,4 +506,9 @@ int xfs_btree_query_range(struct xfs_btree_cur *cur,
 		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
 		xfs_btree_query_range_fn fn, void *priv);
 
+typedef int (*xfs_btree_visit_blocks_fn)(struct xfs_btree_cur *cur, int level,
+		void *data);
+int xfs_btree_visit_blocks(struct xfs_btree_cur *cur,
+		xfs_btree_visit_blocks_fn fn, void *data);
+
 #endif	/* __XFS_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 016/119] xfs: move deferred operations into a separate file
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (14 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 015/119] xfs: refactor btree owner change into a separate visit-blocks function Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-27 13:14   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 017/119] xfs: add tracepoints for the deferred ops mechanism Darrick J. Wong
                   ` (102 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

All the code around struct xfs_bmap_free basically implements a
deferred operation framework through which we can roll transactions
(to unlock buffers and avoid violating lock order rules) while
managing all the necessary log redo items.  Previously we only used
this code to free extents after some sort of mapping operation, but
with the advent of rmap and reflink, we suddenly need to do more than
that.

With that in mind, xfs_bmap_free really becomes a deferred ops control
structure.  Rename the structure and move the deferred ops into their
own file to avoid further bloating of the bmap code.

v2: actually sort the work items by AG to avoid deadlocks

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile           |    2 
 fs/xfs/libxfs/xfs_defer.c |  471 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_defer.h |   96 +++++++++
 fs/xfs/xfs_defer_item.c   |   36 +++
 fs/xfs/xfs_super.c        |    2 
 5 files changed, 607 insertions(+)
 create mode 100644 fs/xfs/libxfs/xfs_defer.c
 create mode 100644 fs/xfs/libxfs/xfs_defer.h
 create mode 100644 fs/xfs/xfs_defer_item.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 3542d94..ad46a2d 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -39,6 +39,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_btree.o \
 				   xfs_da_btree.o \
 				   xfs_da_format.o \
+				   xfs_defer.o \
 				   xfs_dir2.o \
 				   xfs_dir2_block.o \
 				   xfs_dir2_data.o \
@@ -66,6 +67,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_attr_list.o \
 				   xfs_bmap_util.o \
 				   xfs_buf.o \
+				   xfs_defer_item.o \
 				   xfs_dir2_readdir.o \
 				   xfs_discard.o \
 				   xfs_error.o \
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
new file mode 100644
index 0000000..ad14e33e
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -0,0 +1,471 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_trace.h"
+
+/*
+ * Deferred Operations in XFS
+ *
+ * Due to the way locking rules work in XFS, certain transactions (block
+ * mapping and unmapping, typically) have permanent reservations so that
+ * we can roll the transaction to adhere to AG locking order rules and
+ * to unlock buffers between metadata updates.  Prior to rmap/reflink,
+ * the mapping code had a mechanism to perform these deferrals for
+ * extents that were going to be freed; this code makes that facility
+ * more generic.
+ *
+ * When adding the reverse mapping and reflink features, it became
+ * necessary to perform complex remapping multi-transactions to comply
+ * with AG locking order rules, and to be able to spread a single
+ * refcount update operation (an operation on an n-block extent can
+ * update as many as n records!) among multiple transactions.  XFS can
+ * roll a transaction to facilitate this, but using this facility
+ * requires us to log "intent" items in case log recovery needs to
+ * redo the operation, and to log "done" items to indicate that redo
+ * is not necessary.
+ *
+ * The xfs_defer_ops structure tracks incoming deferred work (which is
+ * work that has not yet had an intent logged) in xfs_defer_intake.
+ * There is one xfs_defer_intake for each type of deferrable
+ * operation.  Each new deferral is placed in the op's intake list,
+ * where it waits for the caller to finish the deferred operations.
+ *
+ * Finishing a set of deferred operations is an involved process.  To
+ * start, we define "rolling a deferred-op transaction" as follows:
+ *
+ * > For each xfs_defer_intake,
+ *   - Sort the items on the intake list in AG order.
+ *   - Create a log intent item for that type.
+ *   - Attach to it the items on the intake list.
+ *   - Stash the intent+items for later in an xfs_defer_pending.
+ *   - Attach the xfs_defer_pending to the xfs_defer_ops work list.
+ * > Roll the transaction.
+ *
+ * NOTE: To avoid exceeding the transaction reservation, we limit the
+ * number of items that we attach to a given xfs_defer_pending.
+ *
+ * The actual finishing process looks like this:
+ *
+ * > For each xfs_defer_pending in the xfs_defer_ops work list,
+ *   - Roll the deferred-op transaction as above.
+ *   - Create a log done item for that type, and attach it to the
+ *     intent item.
+ *   - For each work item attached to the intent item,
+ *     * Perform the described action.
+ *     * Attach the work item to the log done item.
+ *     * If the result of doing the work was -EAGAIN, log a fresh
+ *       intent item and attach all remaining work items to it.  Put
+ *       the xfs_defer_pending item back on the work list, and repeat
+ *       the loop.  This allows us to make partial progress even if
+ *       the transaction is too full to finish the job.
+ *
+ * The key here is that we must log an intent item for all pending
+ * work items every time we roll the transaction, and that we must log
+ * a done item as soon as the work is completed.  With this mechanism
+ * we can perform complex remapping operations, chaining intent items
+ * as needed.
+ *
+ * This is an example of remapping the extent (E, E+B) into file X at
+ * offset A and dealing with the extent (C, C+B) already being mapped
+ * there:
+ * +-------------------------------------------------+
+ * | Unmap file X startblock C offset A length B     | t0
+ * | Intent to reduce refcount for extent (C, B)     |
+ * | Intent to remove rmap (X, C, A, B)              |
+ * | Intent to free extent (D, 1) (bmbt block)       |
+ * | Intent to map (X, A, B) at startblock E         |
+ * +-------------------------------------------------+
+ * | Map file X startblock E offset A length B       | t1
+ * | Done mapping (X, E, A, B)                       |
+ * | Intent to increase refcount for extent (E, B)   |
+ * | Intent to add rmap (X, E, A, B)                 |
+ * +-------------------------------------------------+
+ * | Reduce refcount for extent (C, B)               | t2
+ * | Done reducing refcount for extent (C, B)        |
+ * | Increase refcount for extent (E, B)             |
+ * | Done increasing refcount for extent (E, B)      |
+ * | Intent to free extent (C, B)                    |
+ * | Intent to free extent (F, 1) (refcountbt block) |
+ * | Intent to remove rmap (F, 1, REFC)              |
+ * +-------------------------------------------------+
+ * | Remove rmap (X, C, A, B)                        | t3
+ * | Done removing rmap (X, C, A, B)                 |
+ * | Add rmap (X, E, A, B)                           |
+ * | Done adding rmap (X, E, A, B)                   |
+ * | Remove rmap (F, 1, REFC)                        |
+ * | Done removing rmap (F, 1, REFC)                 |
+ * +-------------------------------------------------+
+ * | Free extent (C, B)                              | t4
+ * | Done freeing extent (C, B)                      |
+ * | Free extent (D, 1)                              |
+ * | Done freeing extent (D, 1)                      |
+ * | Free extent (F, 1)                              |
+ * | Done freeing extent (F, 1)                      |
+ * +-------------------------------------------------+
+ *
+ * If we should crash before t2 commits, log recovery replays
+ * the following intent items:
+ *
+ * - Intent to reduce refcount for extent (C, B)
+ * - Intent to remove rmap (X, C, A, B)
+ * - Intent to free extent (D, 1) (bmbt block)
+ * - Intent to increase refcount for extent (E, B)
+ * - Intent to add rmap (X, E, A, B)
+ *
+ * In the process of recovering, it should also generate and take care
+ * of these intent items:
+ *
+ * - Intent to free extent (C, B)
+ * - Intent to free extent (F, 1) (refcountbt block)
+ * - Intent to remove rmap (F, 1, REFC)
+ */
+
+static const struct xfs_defer_op_type *defer_op_types[XFS_DEFER_OPS_TYPE_MAX];
+
+/*
+ * For each pending item in the intake list, log its intent item and the
+ * associated extents, then add the entire intake list to the end of
+ * the pending list.
+ */
+STATIC void
+xfs_defer_intake_work(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop)
+{
+	struct list_head		*li;
+	struct xfs_defer_pending	*dfp;
+
+	list_for_each_entry(dfp, &dop->dop_intake, dfp_list) {
+		dfp->dfp_intent = dfp->dfp_type->create_intent(tp,
+				dfp->dfp_count);
+		list_sort(tp->t_mountp, &dfp->dfp_work,
+				dfp->dfp_type->diff_items);
+		list_for_each(li, &dfp->dfp_work)
+			dfp->dfp_type->log_item(tp, dfp->dfp_intent, li);
+	}
+
+	list_splice_tail_init(&dop->dop_intake, &dop->dop_pending);
+}
+
+/* Abort all the intents that were committed. */
+STATIC void
+xfs_defer_trans_abort(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop,
+	int				error)
+{
+	struct xfs_defer_pending	*dfp;
+
+	/*
+	 * If the transaction was committed, drop the intent reference
+	 * since we're bailing out of here. The other reference is
+	 * dropped when the intent hits the AIL.  If the transaction
+	 * was not committed, the intent is freed by the intent item
+	 * unlock handler on abort.
+	 */
+	if (!dop->dop_committed)
+		return;
+
+	/* Abort intent items. */
+	list_for_each_entry(dfp, &dop->dop_pending, dfp_list) {
+		if (dfp->dfp_committed)
+			dfp->dfp_type->abort_intent(dfp->dfp_intent);
+	}
+
+	/* Shut down FS. */
+	xfs_force_shutdown(tp->t_mountp, (error == -EFSCORRUPTED) ?
+			SHUTDOWN_CORRUPT_INCORE : SHUTDOWN_META_IO_ERROR);
+}
+
+/* Roll a transaction so we can do some deferred op processing. */
+STATIC int
+xfs_defer_trans_roll(
+	struct xfs_trans		**tp,
+	struct xfs_defer_ops		*dop,
+	struct xfs_inode		*ip)
+{
+	int				i;
+	int				error;
+
+	/* Log all the joined inodes except the one we passed in. */
+	for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dop->dop_inodes[i]; i++) {
+		if (dop->dop_inodes[i] == ip)
+			continue;
+		xfs_trans_log_inode(*tp, dop->dop_inodes[i], XFS_ILOG_CORE);
+	}
+
+	/* Roll the transaction. */
+	error = xfs_trans_roll(tp, ip);
+	if (error) {
+		xfs_defer_trans_abort(*tp, dop, error);
+		return error;
+	}
+	dop->dop_committed = true;
+
+	/* Log all the joined inodes except the one we passed in. */
+	for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dop->dop_inodes[i]; i++) {
+		if (dop->dop_inodes[i] == ip)
+			continue;
+		xfs_trans_ijoin(*tp, dop->dop_inodes[i], 0);
+	}
+
+	return error;
+}
+
+/* Do we have any work items to finish? */
+bool
+xfs_defer_has_unfinished_work(
+	struct xfs_defer_ops		*dop)
+{
+	return !list_empty(&dop->dop_pending) || !list_empty(&dop->dop_intake);
+}
+
+/*
+ * Add this inode to the deferred op.  Each joined inode is relogged
+ * each time we roll the transaction, in addition to any inode passed
+ * to xfs_defer_finish().
+ */
+int
+xfs_defer_join(
+	struct xfs_defer_ops		*dop,
+	struct xfs_inode		*ip)
+{
+	int				i;
+
+	for (i = 0; i < XFS_DEFER_OPS_NR_INODES; i++) {
+		if (dop->dop_inodes[i] == ip)
+			return 0;
+		else if (dop->dop_inodes[i] == NULL) {
+			dop->dop_inodes[i] = ip;
+			return 0;
+		}
+	}
+
+	return -EFSCORRUPTED;
+}
+
+/*
+ * Finish all the pending work.  This involves logging intent items for
+ * any work items that wandered in since the last transaction roll (if
+ * one has even happened), rolling the transaction, and finishing the
+ * work items in the first item on the logged-and-pending list.
+ *
+ * If an inode is provided, relog it to the new transaction.
+ */
+int
+xfs_defer_finish(
+	struct xfs_trans		**tp,
+	struct xfs_defer_ops		*dop,
+	struct xfs_inode		*ip)
+{
+	struct xfs_defer_pending	*dfp;
+	struct list_head		*li;
+	struct list_head		*n;
+	void				*done_item = NULL;
+	void				*state;
+	int				error = 0;
+	void				(*cleanup_fn)(struct xfs_trans *, void *, int);
+
+	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
+
+	/* Until we run out of pending work to finish... */
+	while (xfs_defer_has_unfinished_work(dop)) {
+		/* Log intents for work items sitting in the intake. */
+		xfs_defer_intake_work(*tp, dop);
+
+		/* Roll the transaction. */
+		error = xfs_defer_trans_roll(tp, dop, ip);
+		if (error)
+			goto out;
+
+		/* Mark all pending intents as committed. */
+		list_for_each_entry_reverse(dfp, &dop->dop_pending, dfp_list) {
+			if (dfp->dfp_committed)
+				break;
+			dfp->dfp_committed = true;
+		}
+
+		/* Log an intent-done item for the first pending item. */
+		dfp = list_first_entry(&dop->dop_pending,
+				struct xfs_defer_pending, dfp_list);
+		done_item = dfp->dfp_type->create_done(*tp, dfp->dfp_intent,
+				dfp->dfp_count);
+		cleanup_fn = dfp->dfp_type->finish_cleanup;
+
+		/* Finish the work items. */
+		state = NULL;
+		list_for_each_safe(li, n, &dfp->dfp_work) {
+			list_del(li);
+			dfp->dfp_count--;
+			error = dfp->dfp_type->finish_item(*tp, dop, li,
+					done_item, &state);
+			if (error == -EAGAIN) {
+				/*
+				 * If the caller needs to try again, put the
+				 * item back on the pending list and jump out
+				 * for further processing.
+				 */
+				list_add(li, &dfp->dfp_work);
+				dfp->dfp_count++;
+				break;
+			} else if (error) {
+				/*
+				 * Clean up after ourselves and jump out.
+				 * xfs_defer_cancel will take care of freeing
+				 * all these lists and stuff.
+				 */
+				if (cleanup_fn)
+					cleanup_fn(*tp, state, error);
+				xfs_defer_trans_abort(*tp, dop, error);
+				goto out;
+			}
+		}
+		if (error == -EAGAIN) {
+			/*
+			 * Log a new intent, relog all the remaining work
+			 * item to the new intent, attach the new intent to
+			 * the dfp, and leave the dfp at the head of the list
+			 * for further processing.
+			 */
+			dfp->dfp_intent = dfp->dfp_type->create_intent(*tp,
+					dfp->dfp_count);
+			list_for_each(li, &dfp->dfp_work)
+				dfp->dfp_type->log_item(*tp, dfp->dfp_intent,
+						li);
+		} else {
+			/* Done with the dfp, free it. */
+			list_del(&dfp->dfp_list);
+			kmem_free(dfp);
+		}
+
+		if (cleanup_fn)
+			cleanup_fn(*tp, state, error);
+	}
+
+out:
+	return error;
+}
+
+/*
+ * Free up any items left in the list.
+ */
+void
+xfs_defer_cancel(
+	struct xfs_defer_ops		*dop)
+{
+	struct xfs_defer_pending	*dfp;
+	struct xfs_defer_pending	*pli;
+	struct list_head		*pwi;
+	struct list_head		*n;
+
+	/*
+	 * Free the pending items.  Caller should already have arranged
+	 * for the intent items to be released.
+	 */
+	list_for_each_entry_safe(dfp, pli, &dop->dop_intake, dfp_list) {
+		list_del(&dfp->dfp_list);
+		list_for_each_safe(pwi, n, &dfp->dfp_work) {
+			list_del(pwi);
+			dfp->dfp_count--;
+			dfp->dfp_type->cancel_item(pwi);
+		}
+		ASSERT(dfp->dfp_count == 0);
+		kmem_free(dfp);
+	}
+	list_for_each_entry_safe(dfp, pli, &dop->dop_pending, dfp_list) {
+		list_del(&dfp->dfp_list);
+		list_for_each_safe(pwi, n, &dfp->dfp_work) {
+			list_del(pwi);
+			dfp->dfp_count--;
+			dfp->dfp_type->cancel_item(pwi);
+		}
+		ASSERT(dfp->dfp_count == 0);
+		kmem_free(dfp);
+	}
+}
+
+/* Add an item for later deferred processing. */
+void
+xfs_defer_add(
+	struct xfs_defer_ops		*dop,
+	enum xfs_defer_ops_type		type,
+	struct list_head		*li)
+{
+	struct xfs_defer_pending	*dfp = NULL;
+
+	/*
+	 * Add the item to a pending item at the end of the intake list.
+	 * If the last pending item has the same type, reuse it.  Else,
+	 * create a new pending item at the end of the intake list.
+	 */
+	if (!list_empty(&dop->dop_intake)) {
+		dfp = list_last_entry(&dop->dop_intake,
+				struct xfs_defer_pending, dfp_list);
+		if (dfp->dfp_type->type != type ||
+		    (dfp->dfp_type->max_items &&
+		     dfp->dfp_count >= dfp->dfp_type->max_items))
+			dfp = NULL;
+	}
+	if (!dfp) {
+		dfp = kmem_alloc(sizeof(struct xfs_defer_pending),
+				KM_SLEEP | KM_NOFS);
+		dfp->dfp_type = defer_op_types[type];
+		dfp->dfp_committed = false;
+		dfp->dfp_intent = NULL;
+		dfp->dfp_count = 0;
+		INIT_LIST_HEAD(&dfp->dfp_work);
+		list_add_tail(&dfp->dfp_list, &dop->dop_intake);
+	}
+
+	list_add_tail(li, &dfp->dfp_work);
+	dfp->dfp_count++;
+}
+
+/* Initialize a deferred operation list. */
+void
+xfs_defer_init_op_type(
+	const struct xfs_defer_op_type	*type)
+{
+	defer_op_types[type->type] = type;
+}
+
+/* Initialize a deferred operation. */
+void
+xfs_defer_init(
+	struct xfs_defer_ops		*dop,
+	xfs_fsblock_t			*fbp)
+{
+	dop->dop_committed = false;
+	dop->dop_low = false;
+	memset(&dop->dop_inodes, 0, sizeof(dop->dop_inodes));
+	*fbp = NULLFSBLOCK;
+	INIT_LIST_HEAD(&dop->dop_intake);
+	INIT_LIST_HEAD(&dop->dop_pending);
+}
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
new file mode 100644
index 0000000..85c7a3a
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -0,0 +1,96 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_DEFER_H__
+#define	__XFS_DEFER_H__
+
+struct xfs_defer_op_type;
+
+/*
+ * Save a log intent item and a list of extents, so that we can replay
+ * whatever action had to happen to the extent list and file the log done
+ * item.
+ */
+struct xfs_defer_pending {
+	const struct xfs_defer_op_type	*dfp_type;	/* function pointers */
+	struct list_head		dfp_list;	/* pending items */
+	bool				dfp_committed;	/* committed trans? */
+	void				*dfp_intent;	/* log intent item */
+	struct list_head		dfp_work;	/* work items */
+	unsigned int			dfp_count;	/* # extent items */
+};
+
+/*
+ * Header for deferred operation list.
+ *
+ * dop_low is used by the allocator to activate the lowspace algorithm -
+ * when free space is running low the extent allocator may choose to
+ * allocate an extent from an AG without leaving sufficient space for
+ * a btree split when inserting the new extent.  In this case the allocator
+ * will enable the lowspace algorithm which is supposed to allow further
+ * allocations (such as btree splits and newroots) to allocate from
+ * sequential AGs.  In order to avoid locking AGs out of order the lowspace
+ * algorithm will start searching for free space from AG 0.  If the correct
+ * transaction reservations have been made then this algorithm will eventually
+ * find all the space it needs.
+ */
+enum xfs_defer_ops_type {
+	XFS_DEFER_OPS_TYPE_MAX,
+};
+
+#define XFS_DEFER_OPS_NR_INODES	2	/* join up to two inodes */
+
+struct xfs_defer_ops {
+	bool			dop_committed;	/* did any trans commit? */
+	bool			dop_low;	/* alloc in low mode */
+	struct list_head	dop_intake;	/* unlogged pending work */
+	struct list_head	dop_pending;	/* logged pending work */
+
+	/* relog these inodes with each roll */
+	struct xfs_inode	*dop_inodes[XFS_DEFER_OPS_NR_INODES];
+};
+
+void xfs_defer_add(struct xfs_defer_ops *dop, enum xfs_defer_ops_type type,
+		struct list_head *h);
+int xfs_defer_finish(struct xfs_trans **tp, struct xfs_defer_ops *dop,
+		struct xfs_inode *ip);
+void xfs_defer_cancel(struct xfs_defer_ops *dop);
+void xfs_defer_init(struct xfs_defer_ops *dop, xfs_fsblock_t *fbp);
+bool xfs_defer_has_unfinished_work(struct xfs_defer_ops *dop);
+int xfs_defer_join(struct xfs_defer_ops *dop, struct xfs_inode *ip);
+
+/* Description of a deferred type. */
+struct xfs_defer_op_type {
+	enum xfs_defer_ops_type	type;
+	unsigned int		max_items;
+	void (*abort_intent)(void *);
+	void *(*create_done)(struct xfs_trans *, void *, unsigned int);
+	int (*finish_item)(struct xfs_trans *, struct xfs_defer_ops *,
+			struct list_head *, void *, void **);
+	void (*finish_cleanup)(struct xfs_trans *, void *, int);
+	void (*cancel_item)(struct list_head *);
+	int (*diff_items)(void *, struct list_head *, struct list_head *);
+	void *(*create_intent)(struct xfs_trans *, uint);
+	void (*log_item)(struct xfs_trans *, void *, struct list_head *);
+};
+
+void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
+void xfs_defer_init_types(void);
+
+#endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
new file mode 100644
index 0000000..849088d
--- /dev/null
+++ b/fs/xfs/xfs_defer_item.c
@@ -0,0 +1,36 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+
+/* Initialize the deferred operation types. */
+void
+xfs_defer_init_types(void)
+{
+}
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 09722a7..bf63f6d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -46,6 +46,7 @@
 #include "xfs_quota.h"
 #include "xfs_sysfs.h"
 #include "xfs_ondisk.h"
+#include "xfs_defer.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1850,6 +1851,7 @@ init_xfs_fs(void)
 	printk(KERN_INFO XFS_VERSION_STRING " with "
 			 XFS_BUILD_OPTIONS " enabled\n");
 
+	xfs_defer_init_types();
 	xfs_dir_startup();
 
 	error = xfs_init_zones();


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 017/119] xfs: add tracepoints for the deferred ops mechanism
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (15 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 016/119] xfs: move deferred operations into a separate file Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-27 13:15   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free Darrick J. Wong
                   ` (101 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Add tracepoints for the internals of the deferred ops mechanism
and tracepoint classes for clients of the dops, to make debugging
easier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |   19 ++++
 fs/xfs/xfs_defer_item.c   |    1 
 fs/xfs/xfs_trace.c        |    1 
 fs/xfs/xfs_trace.h        |  198 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 219 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index ad14e33e..b4e7faa 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -163,6 +163,7 @@ xfs_defer_intake_work(
 	struct xfs_defer_pending	*dfp;
 
 	list_for_each_entry(dfp, &dop->dop_intake, dfp_list) {
+		trace_xfs_defer_intake_work(tp->t_mountp, dfp);
 		dfp->dfp_intent = dfp->dfp_type->create_intent(tp,
 				dfp->dfp_count);
 		list_sort(tp->t_mountp, &dfp->dfp_work,
@@ -183,6 +184,7 @@ xfs_defer_trans_abort(
 {
 	struct xfs_defer_pending	*dfp;
 
+	trace_xfs_defer_trans_abort(tp->t_mountp, dop);
 	/*
 	 * If the transaction was committed, drop the intent reference
 	 * since we're bailing out of here. The other reference is
@@ -195,6 +197,7 @@ xfs_defer_trans_abort(
 
 	/* Abort intent items. */
 	list_for_each_entry(dfp, &dop->dop_pending, dfp_list) {
+		trace_xfs_defer_pending_abort(tp->t_mountp, dfp);
 		if (dfp->dfp_committed)
 			dfp->dfp_type->abort_intent(dfp->dfp_intent);
 	}
@@ -221,9 +224,12 @@ xfs_defer_trans_roll(
 		xfs_trans_log_inode(*tp, dop->dop_inodes[i], XFS_ILOG_CORE);
 	}
 
+	trace_xfs_defer_trans_roll((*tp)->t_mountp, dop);
+
 	/* Roll the transaction. */
 	error = xfs_trans_roll(tp, ip);
 	if (error) {
+		trace_xfs_defer_trans_roll_error((*tp)->t_mountp, dop, error);
 		xfs_defer_trans_abort(*tp, dop, error);
 		return error;
 	}
@@ -295,6 +301,8 @@ xfs_defer_finish(
 
 	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
 
+	trace_xfs_defer_finish((*tp)->t_mountp, dop);
+
 	/* Until we run out of pending work to finish... */
 	while (xfs_defer_has_unfinished_work(dop)) {
 		/* Log intents for work items sitting in the intake. */
@@ -309,12 +317,14 @@ xfs_defer_finish(
 		list_for_each_entry_reverse(dfp, &dop->dop_pending, dfp_list) {
 			if (dfp->dfp_committed)
 				break;
+			trace_xfs_defer_pending_commit((*tp)->t_mountp, dfp);
 			dfp->dfp_committed = true;
 		}
 
 		/* Log an intent-done item for the first pending item. */
 		dfp = list_first_entry(&dop->dop_pending,
 				struct xfs_defer_pending, dfp_list);
+		trace_xfs_defer_pending_finish((*tp)->t_mountp, dfp);
 		done_item = dfp->dfp_type->create_done(*tp, dfp->dfp_intent,
 				dfp->dfp_count);
 		cleanup_fn = dfp->dfp_type->finish_cleanup;
@@ -370,6 +380,10 @@ xfs_defer_finish(
 	}
 
 out:
+	if (error)
+		trace_xfs_defer_finish_error((*tp)->t_mountp, dop, error);
+	else
+		trace_xfs_defer_finish_done((*tp)->t_mountp, dop);
 	return error;
 }
 
@@ -385,11 +399,14 @@ xfs_defer_cancel(
 	struct list_head		*pwi;
 	struct list_head		*n;
 
+	trace_xfs_defer_cancel(NULL, dop);
+
 	/*
 	 * Free the pending items.  Caller should already have arranged
 	 * for the intent items to be released.
 	 */
 	list_for_each_entry_safe(dfp, pli, &dop->dop_intake, dfp_list) {
+		trace_xfs_defer_intake_cancel(NULL, dfp);
 		list_del(&dfp->dfp_list);
 		list_for_each_safe(pwi, n, &dfp->dfp_work) {
 			list_del(pwi);
@@ -400,6 +417,7 @@ xfs_defer_cancel(
 		kmem_free(dfp);
 	}
 	list_for_each_entry_safe(dfp, pli, &dop->dop_pending, dfp_list) {
+		trace_xfs_defer_pending_cancel(NULL, dfp);
 		list_del(&dfp->dfp_list);
 		list_for_each_safe(pwi, n, &dfp->dfp_work) {
 			list_del(pwi);
@@ -468,4 +486,5 @@ xfs_defer_init(
 	*fbp = NULLFSBLOCK;
 	INIT_LIST_HEAD(&dop->dop_intake);
 	INIT_LIST_HEAD(&dop->dop_pending);
+	trace_xfs_defer_init(NULL, dop);
 }
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
index 849088d..4c2ba28 100644
--- a/fs/xfs/xfs_defer_item.c
+++ b/fs/xfs/xfs_defer_item.c
@@ -28,6 +28,7 @@
 #include "xfs_mount.h"
 #include "xfs_defer.h"
 #include "xfs_trans.h"
+#include "xfs_trace.h"
 
 /* Initialize the deferred operation types. */
 void
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 13a0298..3971527 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -22,6 +22,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f0ac9c9..5923014 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2220,6 +2220,204 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
 DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
 DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
 
+/* deferred ops */
+struct xfs_defer_pending;
+struct xfs_defer_intake;
+struct xfs_defer_ops;
+
+DECLARE_EVENT_CLASS(xfs_defer_class,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop),
+	TP_ARGS(mp, dop),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(void *, dop)
+		__field(bool, committed)
+		__field(bool, low)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->dop = dop;
+		__entry->committed = dop->dop_committed;
+		__entry->low = dop->dop_low;
+	),
+	TP_printk("dev %d:%d ops %p committed %d low %d\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dop,
+		  __entry->committed,
+		  __entry->low)
+)
+#define DEFINE_DEFER_EVENT(name) \
+DEFINE_EVENT(xfs_defer_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop), \
+	TP_ARGS(mp, dop))
+
+DECLARE_EVENT_CLASS(xfs_defer_error_class,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop, int error),
+	TP_ARGS(mp, dop, error),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(void *, dop)
+		__field(bool, committed)
+		__field(bool, low)
+		__field(int, error)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->dop = dop;
+		__entry->committed = dop->dop_committed;
+		__entry->low = dop->dop_low;
+		__entry->error = error;
+	),
+	TP_printk("dev %d:%d ops %p committed %d low %d err %d\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dop,
+		  __entry->committed,
+		  __entry->low,
+		  __entry->error)
+)
+#define DEFINE_DEFER_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_defer_error_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop, int error), \
+	TP_ARGS(mp, dop, error))
+
+DECLARE_EVENT_CLASS(xfs_defer_pending_class,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp),
+	TP_ARGS(mp, dfp),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, type)
+		__field(void *, intent)
+		__field(bool, committed)
+		__field(int, nr)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = dfp->dfp_type->type;
+		__entry->intent = dfp->dfp_intent;
+		__entry->committed = dfp->dfp_committed;
+		__entry->nr = dfp->dfp_count;
+	),
+	TP_printk("dev %d:%d optype %d intent %p committed %d nr %d\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->intent,
+		  __entry->committed,
+		  __entry->nr)
+)
+#define DEFINE_DEFER_PENDING_EVENT(name) \
+DEFINE_EVENT(xfs_defer_pending_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp), \
+	TP_ARGS(mp, dfp))
+
+DECLARE_EVENT_CLASS(xfs_phys_extent_deferred_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int type, xfs_agblock_t agbno, xfs_extlen_t len),
+	TP_ARGS(mp, agno, type, agbno, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(int, type)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->type = type;
+		__entry->agbno = agbno;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d op %d agno %u agbno %u len %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len)
+);
+#define DEFINE_PHYS_EXTENT_DEFERRED_EVENT(name) \
+DEFINE_EVENT(xfs_phys_extent_deferred_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 int type, \
+		 xfs_agblock_t bno, \
+		 xfs_extlen_t len), \
+	TP_ARGS(mp, agno, type, bno, len))
+
+DECLARE_EVENT_CLASS(xfs_map_extent_deferred_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int op,
+		 xfs_agblock_t agbno,
+		 xfs_ino_t ino,
+		 int whichfork,
+		 xfs_fileoff_t offset,
+		 xfs_filblks_t len,
+		 xfs_exntst_t state),
+	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_ino_t, ino)
+		__field(xfs_agblock_t, agbno)
+		__field(int, whichfork)
+		__field(xfs_fileoff_t, l_loff)
+		__field(xfs_filblks_t, l_len)
+		__field(xfs_exntst_t, l_state)
+		__field(int, op)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->ino = ino;
+		__entry->agbno = agbno;
+		__entry->whichfork = whichfork;
+		__entry->l_loff = offset;
+		__entry->l_len = len;
+		__entry->l_state = state;
+		__entry->op = op;
+	),
+	TP_printk("dev %d:%d op %d agno %u agbno %u owner %lld %s offset %llu len %llu state %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->op,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->ino,
+		  __entry->whichfork == XFS_ATTR_FORK ? "attr" : "data",
+		  __entry->l_loff,
+		  __entry->l_len,
+		  __entry->l_state)
+);
+#define DEFINE_MAP_EXTENT_DEFERRED_EVENT(name) \
+DEFINE_EVENT(xfs_map_extent_deferred_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 int op, \
+		 xfs_agblock_t agbno, \
+		 xfs_ino_t ino, \
+		 int whichfork, \
+		 xfs_fileoff_t offset, \
+		 xfs_filblks_t len, \
+		 xfs_exntst_t state), \
+	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state))
+
+DEFINE_DEFER_EVENT(xfs_defer_init);
+DEFINE_DEFER_EVENT(xfs_defer_cancel);
+DEFINE_DEFER_EVENT(xfs_defer_trans_roll);
+DEFINE_DEFER_EVENT(xfs_defer_trans_abort);
+DEFINE_DEFER_EVENT(xfs_defer_finish);
+DEFINE_DEFER_EVENT(xfs_defer_finish_done);
+
+DEFINE_DEFER_ERROR_EVENT(xfs_defer_trans_roll_error);
+DEFINE_DEFER_ERROR_EVENT(xfs_defer_finish_error);
+DEFINE_DEFER_ERROR_EVENT(xfs_defer_op_finish_error);
+
+DEFINE_DEFER_PENDING_EVENT(xfs_defer_intake_work);
+DEFINE_DEFER_PENDING_EVENT(xfs_defer_intake_cancel);
+DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_commit);
+DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_cancel);
+DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
+DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
+
+DEFINE_PHYS_EXTENT_DEFERRED_EVENT(xfs_defer_phys_extent);
+DEFINE_MAP_EXTENT_DEFERRED_EVENT(xfs_defer_map_extent);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (16 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 017/119] xfs: add tracepoints for the deferred ops mechanism Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-27 13:15   ` Brian Foster
  2016-06-17  1:19 ` [PATCH 019/119] xfs: rework xfs_bmap_free callers to use xfs_defer_ops Darrick J. Wong
                   ` (100 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred extent freeing.  We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.h |    1 
 fs/xfs/xfs_defer_item.c   |  108 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 85c7a3a..743fc32 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -51,6 +51,7 @@ struct xfs_defer_pending {
  * find all the space it needs.
  */
 enum xfs_defer_ops_type {
+	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_MAX,
 };
 
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
index 4c2ba28..127a54e 100644
--- a/fs/xfs/xfs_defer_item.c
+++ b/fs/xfs/xfs_defer_item.c
@@ -29,9 +29,117 @@
 #include "xfs_defer.h"
 #include "xfs_trans.h"
 #include "xfs_trace.h"
+#include "xfs_bmap.h"
+#include "xfs_extfree_item.h"
+
+/* Extent Freeing */
+
+/* Sort bmap items by AG. */
+static int
+xfs_bmap_free_diff_items(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_mount		*mp = priv;
+	struct xfs_bmap_free_item	*ra;
+	struct xfs_bmap_free_item	*rb;
+
+	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
+	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
+	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
+		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
+}
+
+/* Get an EFI. */
+STATIC void *
+xfs_bmap_free_create_intent(
+	struct xfs_trans		*tp,
+	unsigned int			count)
+{
+	return xfs_trans_get_efi(tp, count);
+}
+
+/* Log a free extent to the intent item. */
+STATIC void
+xfs_bmap_free_log_item(
+	struct xfs_trans		*tp,
+	void				*intent,
+	struct list_head		*item)
+{
+	struct xfs_bmap_free_item	*free;
+
+	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
+	xfs_trans_log_efi_extent(tp, intent, free->xbfi_startblock,
+			free->xbfi_blockcount);
+}
+
+/* Get an EFD so we can process all the free extents. */
+STATIC void *
+xfs_bmap_free_create_done(
+	struct xfs_trans		*tp,
+	void				*intent,
+	unsigned int			count)
+{
+	return xfs_trans_get_efd(tp, intent, count);
+}
+
+/* Process a free extent. */
+STATIC int
+xfs_bmap_free_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop,
+	struct list_head		*item,
+	void				*done_item,
+	void				**state)
+{
+	struct xfs_bmap_free_item	*free;
+	int				error;
+
+	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
+	error = xfs_trans_free_extent(tp, done_item,
+			free->xbfi_startblock,
+			free->xbfi_blockcount);
+	kmem_free(free);
+	return error;
+}
+
+/* Abort all pending EFIs. */
+STATIC void
+xfs_bmap_free_abort_intent(
+	void				*intent)
+{
+	xfs_efi_release(intent);
+}
+
+/* Cancel a free extent. */
+STATIC void
+xfs_bmap_free_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_bmap_free_item	*free;
+
+	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
+	kmem_free(free);
+}
+
+const struct xfs_defer_op_type xfs_extent_free_defer_type = {
+	.type		= XFS_DEFER_OPS_TYPE_FREE,
+	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
+	.diff_items	= xfs_bmap_free_diff_items,
+	.create_intent	= xfs_bmap_free_create_intent,
+	.abort_intent	= xfs_bmap_free_abort_intent,
+	.log_item	= xfs_bmap_free_log_item,
+	.create_done	= xfs_bmap_free_create_done,
+	.finish_item	= xfs_bmap_free_finish_item,
+	.cancel_item	= xfs_bmap_free_cancel_item,
+};
+
+/* Deferred Item Initialization */
 
 /* Initialize the deferred operation types. */
 void
 xfs_defer_init_types(void)
 {
+	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
 }


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 019/119] xfs: rework xfs_bmap_free callers to use xfs_defer_ops
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (17 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free Darrick J. Wong
@ 2016-06-17  1:19 ` Darrick J. Wong
  2016-06-17  1:20 ` [PATCH 020/119] xfs: change xfs_bmap_{finish, cancel, init, free} -> xfs_defer_* Darrick J. Wong
                   ` (99 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:19 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Restructure everything that used xfs_bmap_free to use xfs_defer_ops
instead.  For now we'll just remove the old symbols and play some
cpp magic to make it work; in the next patch we'll actually rename
everything.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c       |    1 
 fs/xfs/libxfs/xfs_attr.c        |    1 
 fs/xfs/libxfs/xfs_attr_remote.c |    1 
 fs/xfs/libxfs/xfs_bmap.c        |   55 +++++------------------
 fs/xfs/libxfs/xfs_bmap.h        |   32 --------------
 fs/xfs/libxfs/xfs_bmap_btree.c  |    5 +-
 fs/xfs/libxfs/xfs_btree.c       |    1 
 fs/xfs/libxfs/xfs_defer.h       |    7 +++
 fs/xfs/libxfs/xfs_dir2.c        |    1 
 fs/xfs/libxfs/xfs_ialloc.c      |    1 
 fs/xfs/libxfs/xfs_inode_buf.c   |    1 
 fs/xfs/libxfs/xfs_sb.c          |    1 
 fs/xfs/xfs_bmap_util.c          |   92 +--------------------------------------
 fs/xfs/xfs_bmap_util.h          |    2 -
 fs/xfs/xfs_dquot.c              |    1 
 fs/xfs/xfs_filestream.c         |    3 +
 fs/xfs/xfs_fsops.c              |    1 
 fs/xfs/xfs_inode.c              |    1 
 fs/xfs/xfs_iomap.c              |    1 
 fs/xfs/xfs_log_recover.c        |    1 
 fs/xfs/xfs_mount.c              |    1 
 fs/xfs/xfs_rtalloc.c            |    1 
 fs/xfs/xfs_symlink.c            |    1 
 fs/xfs/xfs_trace.c              |    1 
 24 files changed, 42 insertions(+), 171 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c366889..c06b463 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -24,6 +24,7 @@
 #include "xfs_bit.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
 #include "xfs_alloc_btree.h"
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 4e126f4..79d3a30 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -23,6 +23,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_attr_sf.h"
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index a572532..93a9ce1 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -24,6 +24,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_inode.h"
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a5d207a..64ca97f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -24,6 +24,7 @@
 #include "xfs_bit.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_dir2.h"
@@ -595,41 +596,7 @@ xfs_bmap_add_free(
 	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
 	new->xbfi_startblock = bno;
 	new->xbfi_blockcount = (xfs_extlen_t)len;
-	list_add(&new->xbfi_list, &flist->xbf_flist);
-	flist->xbf_count++;
-}
-
-/*
- * Remove the entry "free" from the free item list.  Prev points to the
- * previous entry, unless "free" is the head of the list.
- */
-void
-xfs_bmap_del_free(
-	struct xfs_bmap_free		*flist,	/* free item list header */
-	struct xfs_bmap_free_item	*free)	/* list item to be freed */
-{
-	list_del(&free->xbfi_list);
-	flist->xbf_count--;
-	kmem_zone_free(xfs_bmap_free_item_zone, free);
-}
-
-/*
- * Free up any items left in the list.
- */
-void
-xfs_bmap_cancel(
-	struct xfs_bmap_free		*flist)	/* list of bmap_free_items */
-{
-	struct xfs_bmap_free_item	*free;	/* free list item */
-
-	if (flist->xbf_count == 0)
-		return;
-	while (!list_empty(&flist->xbf_flist)) {
-		free = list_first_entry(&flist->xbf_flist,
-				struct xfs_bmap_free_item, xbfi_list);
-		xfs_bmap_del_free(flist, free);
-	}
-	ASSERT(flist->xbf_count == 0);
+	xfs_defer_add(flist, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
 }
 
 /*
@@ -767,7 +734,7 @@ xfs_bmap_extents_to_btree(
 	if (*firstblock == NULLFSBLOCK) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
-	} else if (flist->xbf_low) {
+	} else if (flist->dop_low) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = *firstblock;
 	} else {
@@ -788,7 +755,7 @@ xfs_bmap_extents_to_btree(
 	ASSERT(args.fsbno != NULLFSBLOCK);
 	ASSERT(*firstblock == NULLFSBLOCK ||
 	       args.agno == XFS_FSB_TO_AGNO(mp, *firstblock) ||
-	       (flist->xbf_low &&
+	       (flist->dop_low &&
 		args.agno > XFS_FSB_TO_AGNO(mp, *firstblock)));
 	*firstblock = cur->bc_private.b.firstblock = args.fsbno;
 	cur->bc_private.b.allocated++;
@@ -3708,7 +3675,7 @@ xfs_bmap_btalloc(
 			error = xfs_bmap_btalloc_nullfb(ap, &args, &blen);
 		if (error)
 			return error;
-	} else if (ap->flist->xbf_low) {
+	} else if (ap->flist->dop_low) {
 		if (xfs_inode_is_filestream(ap->ip))
 			args.type = XFS_ALLOCTYPE_FIRST_AG;
 		else
@@ -3741,7 +3708,7 @@ xfs_bmap_btalloc(
 	 * is >= the stripe unit and the allocation offset is
 	 * at the end of file.
 	 */
-	if (!ap->flist->xbf_low && ap->aeof) {
+	if (!ap->flist->dop_low && ap->aeof) {
 		if (!ap->offset) {
 			args.alignment = stripe_align;
 			atype = args.type;
@@ -3834,7 +3801,7 @@ xfs_bmap_btalloc(
 		args.minleft = 0;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
-		ap->flist->xbf_low = 1;
+		ap->flist->dop_low = true;
 	}
 	if (args.fsbno != NULLFSBLOCK) {
 		/*
@@ -3844,7 +3811,7 @@ xfs_bmap_btalloc(
 		ASSERT(*ap->firstblock == NULLFSBLOCK ||
 		       XFS_FSB_TO_AGNO(mp, *ap->firstblock) ==
 		       XFS_FSB_TO_AGNO(mp, args.fsbno) ||
-		       (ap->flist->xbf_low &&
+		       (ap->flist->dop_low &&
 			XFS_FSB_TO_AGNO(mp, *ap->firstblock) <
 			XFS_FSB_TO_AGNO(mp, args.fsbno)));
 
@@ -3852,7 +3819,7 @@ xfs_bmap_btalloc(
 		if (*ap->firstblock == NULLFSBLOCK)
 			*ap->firstblock = args.fsbno;
 		ASSERT(nullfb || fb_agno == args.agno ||
-		       (ap->flist->xbf_low && fb_agno < args.agno));
+		       (ap->flist->dop_low && fb_agno < args.agno));
 		ap->length = args.len;
 		ap->ip->i_d.di_nblocks += args.len;
 		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
@@ -4319,7 +4286,7 @@ xfs_bmapi_allocate(
 	if (error)
 		return error;
 
-	if (bma->flist->xbf_low)
+	if (bma->flist->dop_low)
 		bma->minleft = 0;
 	if (bma->cur)
 		bma->cur->bc_private.b.firstblock = *bma->firstblock;
@@ -4684,7 +4651,7 @@ error0:
 			       XFS_FSB_TO_AGNO(mp, *firstblock) ==
 			       XFS_FSB_TO_AGNO(mp,
 				       bma.cur->bc_private.b.firstblock) ||
-			       (flist->xbf_low &&
+			       (flist->dop_low &&
 				XFS_FSB_TO_AGNO(mp, *firstblock) <
 				XFS_FSB_TO_AGNO(mp,
 					bma.cur->bc_private.b.firstblock)));
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 0ef4c6b..6681bd9 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -69,27 +69,6 @@ struct xfs_bmap_free_item
 	struct list_head	xbfi_list;
 };
 
-/*
- * Header for free extent list.
- *
- * xbf_low is used by the allocator to activate the lowspace algorithm -
- * when free space is running low the extent allocator may choose to
- * allocate an extent from an AG without leaving sufficient space for
- * a btree split when inserting the new extent.  In this case the allocator
- * will enable the lowspace algorithm which is supposed to allow further
- * allocations (such as btree splits and newroots) to allocate from
- * sequential AGs.  In order to avoid locking AGs out of order the lowspace
- * algorithm will start searching for free space from AG 0.  If the correct
- * transaction reservations have been made then this algorithm will eventually
- * find all the space it needs.
- */
-typedef	struct xfs_bmap_free
-{
-	struct list_head	xbf_flist;	/* list of to-be-free extents */
-	int			xbf_count;	/* count of items on list */
-	int			xbf_low;	/* alloc in low mode */
-} xfs_bmap_free_t;
-
 #define	XFS_BMAP_MAX_NMAP	4
 
 /*
@@ -139,14 +118,6 @@ static inline int xfs_bmapi_aflag(int w)
 #define	DELAYSTARTBLOCK		((xfs_fsblock_t)-1LL)
 #define	HOLESTARTBLOCK		((xfs_fsblock_t)-2LL)
 
-static inline void xfs_bmap_init(xfs_bmap_free_t *flp, xfs_fsblock_t *fbp)
-{
-	INIT_LIST_HEAD(&flp->xbf_flist);
-	flp->xbf_count = 0;
-	flp->xbf_low = 0;
-	*fbp = NULLFSBLOCK;
-}
-
 /*
  * Flags for xfs_bmap_add_extent*.
  */
@@ -195,9 +166,6 @@ int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
 void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_bmap_free *flist,
 			  xfs_fsblock_t bno, xfs_filblks_t len);
-void	xfs_bmap_cancel(struct xfs_bmap_free *flist);
-int	xfs_bmap_finish(struct xfs_trans **tp, struct xfs_bmap_free *flist,
-			struct xfs_inode *ip);
 void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
 int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 714b387..fa5e3a5 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -23,6 +23,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_inode_item.h"
@@ -462,7 +463,7 @@ xfs_bmbt_alloc_block(
 		 * block allocation here and corrupt the filesystem.
 		 */
 		args.minleft = args.tp->t_blk_res;
-	} else if (cur->bc_private.b.flist->xbf_low) {
+	} else if (cur->bc_private.b.flist->dop_low) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
 	} else {
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
@@ -490,7 +491,7 @@ xfs_bmbt_alloc_block(
 		error = xfs_alloc_vextent(&args);
 		if (error)
 			goto error0;
-		cur->bc_private.b.flist->xbf_low = 1;
+		cur->bc_private.b.flist->dop_low = true;
 	}
 	if (args.fsbno == NULLFSBLOCK) {
 		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index eac876a..5b3743a 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -23,6 +23,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_inode_item.h"
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 743fc32..4c05ba6 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -94,4 +94,11 @@ struct xfs_defer_op_type {
 void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
 void xfs_defer_init_types(void);
 
+/* XXX: compatibility shims, will go away in the next patch */
+#define xfs_bmap_finish		xfs_defer_finish
+#define xfs_bmap_cancel		xfs_defer_cancel
+#define xfs_bmap_init		xfs_defer_init
+#define xfs_bmap_free		xfs_defer_ops
+typedef struct xfs_defer_ops	xfs_bmap_free_t;
+
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index af0f9d1..945c0345 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -21,6 +21,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_inode.h"
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index cda7269..9ae9a43 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -24,6 +24,7 @@
 #include "xfs_bit.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
 #include "xfs_ialloc.h"
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 794fa66..44f325c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -22,6 +22,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_error.h"
 #include "xfs_cksum.h"
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 09d6fd0..a544686 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -24,6 +24,7 @@
 #include "xfs_bit.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_ialloc.h"
 #include "xfs_alloc.h"
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1aac0ba..972a27a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -25,6 +25,7 @@
 #include "xfs_bit.h"
 #include "xfs_mount.h"
 #include "xfs_da_format.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
 #include "xfs_trans.h"
@@ -79,95 +80,6 @@ xfs_zero_extent(
 		GFP_NOFS, true);
 }
 
-/* Sort bmap items by AG. */
-static int
-xfs_bmap_free_list_cmp(
-	void			*priv,
-	struct list_head	*a,
-	struct list_head	*b)
-{
-	struct xfs_mount	*mp = priv;
-	struct xfs_bmap_free_item	*ra;
-	struct xfs_bmap_free_item	*rb;
-
-	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
-	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
-	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
-		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
-}
-
-/*
- * Routine to be called at transaction's end by xfs_bmapi, xfs_bunmapi
- * caller.  Frees all the extents that need freeing, which must be done
- * last due to locking considerations.  We never free any extents in
- * the first transaction.
- *
- * If an inode *ip is provided, rejoin it to the transaction if
- * the transaction was committed.
- */
-int						/* error */
-xfs_bmap_finish(
-	struct xfs_trans		**tp,	/* transaction pointer addr */
-	struct xfs_bmap_free		*flist,	/* i/o: list extents to free */
-	struct xfs_inode		*ip)
-{
-	struct xfs_efd_log_item		*efd;	/* extent free data */
-	struct xfs_efi_log_item		*efi;	/* extent free intention */
-	int				error;	/* error return value */
-	int				committed;/* xact committed or not */
-	struct xfs_bmap_free_item	*free;	/* free extent item */
-
-	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
-	if (flist->xbf_count == 0)
-		return 0;
-
-	list_sort((*tp)->t_mountp, &flist->xbf_flist, xfs_bmap_free_list_cmp);
-
-	efi = xfs_trans_get_efi(*tp, flist->xbf_count);
-	list_for_each_entry(free, &flist->xbf_flist, xbfi_list)
-		xfs_trans_log_efi_extent(*tp, efi, free->xbfi_startblock,
-			free->xbfi_blockcount);
-
-	error = __xfs_trans_roll(tp, ip, &committed);
-	if (error) {
-		/*
-		 * If the transaction was committed, drop the EFD reference
-		 * since we're bailing out of here. The other reference is
-		 * dropped when the EFI hits the AIL.
-		 *
-		 * If the transaction was not committed, the EFI is freed by the
-		 * EFI item unlock handler on abort. Also, we have a new
-		 * transaction so we should return committed=1 even though we're
-		 * returning an error.
-		 */
-		if (committed) {
-			xfs_efi_release(efi);
-			xfs_force_shutdown((*tp)->t_mountp,
-					   SHUTDOWN_META_IO_ERROR);
-		}
-		return error;
-	}
-
-	/*
-	 * Get an EFD and free each extent in the list, logging to the EFD in
-	 * the process. The remaining bmap free list is cleaned up by the caller
-	 * on error.
-	 */
-	efd = xfs_trans_get_efd(*tp, efi, flist->xbf_count);
-	while (!list_empty(&flist->xbf_flist)) {
-		free = list_first_entry(&flist->xbf_flist,
-				struct xfs_bmap_free_item, xbfi_list);
-		error = xfs_trans_free_extent(*tp, efd, free->xbfi_startblock,
-					      free->xbfi_blockcount);
-		if (error)
-			return error;
-
-		xfs_bmap_del_free(flist, free);
-	}
-
-	return 0;
-}
-
 int
 xfs_bmap_rtalloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
@@ -815,7 +727,7 @@ xfs_bmap_punch_delalloc_range(
 		if (error)
 			break;
 
-		ASSERT(!flist.xbf_count && list_empty(&flist.xbf_flist));
+		ASSERT(!xfs_defer_has_unfinished_work(&flist));
 next_block:
 		start_fsb++;
 		remaining--;
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index f200714..51aadde 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -40,8 +40,6 @@ int	xfs_getbmap(struct xfs_inode *ip, struct getbmapx *bmv,
 		xfs_bmap_format_t formatter, void *arg);
 
 /* functions in xfs_bmap.c that are only needed by xfs_bmap_util.c */
-void	xfs_bmap_del_free(struct xfs_bmap_free *flist,
-			  struct xfs_bmap_free_item *free);
 int	xfs_bmap_extsize_align(struct xfs_mount *mp, struct xfs_bmbt_irec *gotp,
 			       struct xfs_bmbt_irec *prevp, xfs_extlen_t extsz,
 			       int rt, int eof, int delay, int convert,
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index e064665..be17f0a 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -23,6 +23,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
index a51353a..3e990fb 100644
--- a/fs/xfs/xfs_filestream.c
+++ b/fs/xfs/xfs_filestream.c
@@ -22,6 +22,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
@@ -385,7 +386,7 @@ xfs_filestream_new_ag(
 	}
 
 	flags = (ap->userdata ? XFS_PICK_USERDATA : 0) |
-	        (ap->flist->xbf_low ? XFS_PICK_LOWSPACE : 0);
+	        (ap->flist->dop_low ? XFS_PICK_LOWSPACE : 0);
 
 	err = xfs_filestream_pick_ag(pip, startag, agp, flags, minlen);
 
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b4d7582..064fce1 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -23,6 +23,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_inode.h"
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8825bcf..d2389bb 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -25,6 +25,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5839135..b090bc1 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -22,6 +22,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
 #include "xfs_bmap_btree.h"
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 8359978..080b54b 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -24,6 +24,7 @@
 #include "xfs_bit.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_inode.h"
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index e39b023..bf63682 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -24,6 +24,7 @@
 #include "xfs_bit.h"
 #include "xfs_sb.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_inode.h"
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 3938b37..627f7e6 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -23,6 +23,7 @@
 #include "xfs_trans_resv.h"
 #include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 08a46c6..20af47b 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -26,6 +26,7 @@
 #include "xfs_mount.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
+#include "xfs_defer.h"
 #include "xfs_dir2.h"
 #include "xfs_inode.h"
 #include "xfs_ialloc.h"
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 3971527..7f17ae6 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -24,6 +24,7 @@
 #include "xfs_mount.h"
 #include "xfs_defer.h"
 #include "xfs_da_format.h"
+#include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
 #include "xfs_da_btree.h"


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 020/119] xfs: change xfs_bmap_{finish, cancel, init, free} -> xfs_defer_*
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (18 preceding siblings ...)
  2016-06-17  1:19 ` [PATCH 019/119] xfs: rework xfs_bmap_free callers to use xfs_defer_ops Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-30  0:11   ` Darrick J. Wong
  2016-06-17  1:20 ` [PATCH 021/119] xfs: rename flist/free_list to dfops Darrick J. Wong
                   ` (98 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Drop the compatibility shims that we were using to integrate the new
deferred operation mechanism into the existing code.  No new code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_attr.c        |   58 ++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_attr_remote.c |   14 ++++-----
 fs/xfs/libxfs/xfs_bmap.c        |   38 ++++++++++++------------
 fs/xfs/libxfs/xfs_bmap.h        |   10 +++---
 fs/xfs/libxfs/xfs_btree.h       |    5 ++-
 fs/xfs/libxfs/xfs_da_btree.h    |    4 +--
 fs/xfs/libxfs/xfs_defer.h       |    7 ----
 fs/xfs/libxfs/xfs_dir2.c        |    6 ++--
 fs/xfs/libxfs/xfs_dir2.h        |    8 +++--
 fs/xfs/libxfs/xfs_ialloc.c      |    6 ++--
 fs/xfs/libxfs/xfs_ialloc.h      |    2 +
 fs/xfs/libxfs/xfs_trans_resv.c  |    4 +--
 fs/xfs/xfs_bmap_util.c          |   28 +++++++++---------
 fs/xfs/xfs_dquot.c              |   10 +++---
 fs/xfs/xfs_inode.c              |   62 ++++++++++++++++++++-------------------
 fs/xfs/xfs_inode.h              |    4 +--
 fs/xfs/xfs_iomap.c              |   24 ++++++++-------
 fs/xfs/xfs_rtalloc.c            |    8 +++--
 fs/xfs/xfs_symlink.c            |   16 +++++-----
 19 files changed, 154 insertions(+), 160 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 79d3a30..66baf97 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -204,7 +204,7 @@ xfs_attr_set(
 {
 	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_da_args	args;
-	struct xfs_bmap_free	flist;
+	struct xfs_defer_ops	flist;
 	struct xfs_trans_res	tres;
 	xfs_fsblock_t		firstblock;
 	int			rsvd = (flags & ATTR_ROOT) != 0;
@@ -317,13 +317,13 @@ xfs_attr_set(
 		 * It won't fit in the shortform, transform to a leaf block.
 		 * GROT: another possible req'mt for a double-split btree op.
 		 */
-		xfs_bmap_init(args.flist, args.firstblock);
+		xfs_defer_init(args.flist, args.firstblock);
 		error = xfs_attr_shortform_to_leaf(&args);
 		if (!error)
-			error = xfs_bmap_finish(&args.trans, args.flist, dp);
+			error = xfs_defer_finish(&args.trans, args.flist, dp);
 		if (error) {
 			args.trans = NULL;
-			xfs_bmap_cancel(&flist);
+			xfs_defer_cancel(&flist);
 			goto out;
 		}
 
@@ -383,7 +383,7 @@ xfs_attr_remove(
 {
 	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_da_args	args;
-	struct xfs_bmap_free	flist;
+	struct xfs_defer_ops	flist;
 	xfs_fsblock_t		firstblock;
 	int			error;
 
@@ -585,13 +585,13 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
 		 * Commit that transaction so that the node_addname() call
 		 * can manage its own transactions.
 		 */
-		xfs_bmap_init(args->flist, args->firstblock);
+		xfs_defer_init(args->flist, args->firstblock);
 		error = xfs_attr3_leaf_to_node(args);
 		if (!error)
-			error = xfs_bmap_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->flist, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_bmap_cancel(args->flist);
+			xfs_defer_cancel(args->flist);
 			return error;
 		}
 
@@ -675,15 +675,15 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
 		 * If the result is small enough, shrink it all into the inode.
 		 */
 		if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
-			xfs_bmap_init(args->flist, args->firstblock);
+			xfs_defer_init(args->flist, args->firstblock);
 			error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
 			/* bp is gone due to xfs_da_shrink_inode */
 			if (!error)
-				error = xfs_bmap_finish(&args->trans,
+				error = xfs_defer_finish(&args->trans,
 							args->flist, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_bmap_cancel(args->flist);
+				xfs_defer_cancel(args->flist);
 				return error;
 			}
 		}
@@ -738,14 +738,14 @@ xfs_attr_leaf_removename(xfs_da_args_t *args)
 	 * If the result is small enough, shrink it all into the inode.
 	 */
 	if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
-		xfs_bmap_init(args->flist, args->firstblock);
+		xfs_defer_init(args->flist, args->firstblock);
 		error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
 		/* bp is gone due to xfs_da_shrink_inode */
 		if (!error)
-			error = xfs_bmap_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->flist, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_bmap_cancel(args->flist);
+			xfs_defer_cancel(args->flist);
 			return error;
 		}
 	}
@@ -864,14 +864,14 @@ restart:
 			 */
 			xfs_da_state_free(state);
 			state = NULL;
-			xfs_bmap_init(args->flist, args->firstblock);
+			xfs_defer_init(args->flist, args->firstblock);
 			error = xfs_attr3_leaf_to_node(args);
 			if (!error)
-				error = xfs_bmap_finish(&args->trans,
+				error = xfs_defer_finish(&args->trans,
 							args->flist, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_bmap_cancel(args->flist);
+				xfs_defer_cancel(args->flist);
 				goto out;
 			}
 
@@ -892,13 +892,13 @@ restart:
 		 * in the index/blkno/rmtblkno/rmtblkcnt fields and
 		 * in the index2/blkno2/rmtblkno2/rmtblkcnt2 fields.
 		 */
-		xfs_bmap_init(args->flist, args->firstblock);
+		xfs_defer_init(args->flist, args->firstblock);
 		error = xfs_da3_split(state);
 		if (!error)
-			error = xfs_bmap_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->flist, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_bmap_cancel(args->flist);
+			xfs_defer_cancel(args->flist);
 			goto out;
 		}
 	} else {
@@ -991,14 +991,14 @@ restart:
 		 * Check to see if the tree needs to be collapsed.
 		 */
 		if (retval && (state->path.active > 1)) {
-			xfs_bmap_init(args->flist, args->firstblock);
+			xfs_defer_init(args->flist, args->firstblock);
 			error = xfs_da3_join(state);
 			if (!error)
-				error = xfs_bmap_finish(&args->trans,
+				error = xfs_defer_finish(&args->trans,
 							args->flist, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_bmap_cancel(args->flist);
+				xfs_defer_cancel(args->flist);
 				goto out;
 			}
 		}
@@ -1114,13 +1114,13 @@ xfs_attr_node_removename(xfs_da_args_t *args)
 	 * Check to see if the tree needs to be collapsed.
 	 */
 	if (retval && (state->path.active > 1)) {
-		xfs_bmap_init(args->flist, args->firstblock);
+		xfs_defer_init(args->flist, args->firstblock);
 		error = xfs_da3_join(state);
 		if (!error)
-			error = xfs_bmap_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->flist, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_bmap_cancel(args->flist);
+			xfs_defer_cancel(args->flist);
 			goto out;
 		}
 		/*
@@ -1147,15 +1147,15 @@ xfs_attr_node_removename(xfs_da_args_t *args)
 			goto out;
 
 		if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
-			xfs_bmap_init(args->flist, args->firstblock);
+			xfs_defer_init(args->flist, args->firstblock);
 			error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
 			/* bp is gone due to xfs_da_shrink_inode */
 			if (!error)
-				error = xfs_bmap_finish(&args->trans,
+				error = xfs_defer_finish(&args->trans,
 							args->flist, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_bmap_cancel(args->flist);
+				xfs_defer_cancel(args->flist);
 				goto out;
 			}
 		} else
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 93a9ce1..aabb516 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -461,16 +461,16 @@ xfs_attr_rmtval_set(
 		 * extent and then crash then the block may not contain the
 		 * correct metadata after log recovery occurs.
 		 */
-		xfs_bmap_init(args->flist, args->firstblock);
+		xfs_defer_init(args->flist, args->firstblock);
 		nmap = 1;
 		error = xfs_bmapi_write(args->trans, dp, (xfs_fileoff_t)lblkno,
 				  blkcnt, XFS_BMAPI_ATTRFORK, args->firstblock,
 				  args->total, &map, &nmap, args->flist);
 		if (!error)
-			error = xfs_bmap_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->flist, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_bmap_cancel(args->flist);
+			xfs_defer_cancel(args->flist);
 			return error;
 		}
 
@@ -504,7 +504,7 @@ xfs_attr_rmtval_set(
 
 		ASSERT(blkcnt > 0);
 
-		xfs_bmap_init(args->flist, args->firstblock);
+		xfs_defer_init(args->flist, args->firstblock);
 		nmap = 1;
 		error = xfs_bmapi_read(dp, (xfs_fileoff_t)lblkno,
 				       blkcnt, &map, &nmap,
@@ -604,16 +604,16 @@ xfs_attr_rmtval_remove(
 	blkcnt = args->rmtblkcnt;
 	done = 0;
 	while (!done) {
-		xfs_bmap_init(args->flist, args->firstblock);
+		xfs_defer_init(args->flist, args->firstblock);
 		error = xfs_bunmapi(args->trans, args->dp, lblkno, blkcnt,
 				    XFS_BMAPI_ATTRFORK, 1, args->firstblock,
 				    args->flist, &done);
 		if (!error)
-			error = xfs_bmap_finish(&args->trans, args->flist,
+			error = xfs_defer_finish(&args->trans, args->flist,
 						args->dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_bmap_cancel(args->flist);
+			xfs_defer_cancel(args->flist);
 			return error;
 		}
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 64ca97f..45ce7bd 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -572,7 +572,7 @@ xfs_bmap_validate_ret(
 void
 xfs_bmap_add_free(
 	struct xfs_mount	*mp,		/* mount point structure */
-	struct xfs_bmap_free	*flist,		/* list of extents */
+	struct xfs_defer_ops	*flist,		/* list of extents */
 	xfs_fsblock_t		bno,		/* fs block number of extent */
 	xfs_filblks_t		len)		/* length of extent */
 {
@@ -672,7 +672,7 @@ xfs_bmap_extents_to_btree(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first-block-allocated */
-	xfs_bmap_free_t		*flist,		/* blocks freed in xaction */
+	struct xfs_defer_ops	*flist,		/* blocks freed in xaction */
 	xfs_btree_cur_t		**curp,		/* cursor returned to caller */
 	int			wasdel,		/* converting a delayed alloc */
 	int			*logflagsp,	/* inode logging flags */
@@ -940,7 +940,7 @@ xfs_bmap_add_attrfork_btree(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first block allocated */
-	xfs_bmap_free_t		*flist,		/* blocks to free at commit */
+	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
 	int			*flags)		/* inode logging flags */
 {
 	xfs_btree_cur_t		*cur;		/* btree cursor */
@@ -983,7 +983,7 @@ xfs_bmap_add_attrfork_extents(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first block allocated */
-	xfs_bmap_free_t		*flist,		/* blocks to free at commit */
+	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
 	int			*flags)		/* inode logging flags */
 {
 	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
@@ -1018,7 +1018,7 @@ xfs_bmap_add_attrfork_local(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first block allocated */
-	xfs_bmap_free_t		*flist,		/* blocks to free at commit */
+	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
 	int			*flags)		/* inode logging flags */
 {
 	xfs_da_args_t		dargs;		/* args for dir/attr code */
@@ -1059,7 +1059,7 @@ xfs_bmap_add_attrfork(
 	int			rsvd)		/* xact may use reserved blks */
 {
 	xfs_fsblock_t		firstblock;	/* 1st block/ag allocated */
-	xfs_bmap_free_t		flist;		/* freed extent records */
+	struct xfs_defer_ops	flist;		/* freed extent records */
 	xfs_mount_t		*mp;		/* mount structure */
 	xfs_trans_t		*tp;		/* transaction pointer */
 	int			blks;		/* space reservation */
@@ -1125,7 +1125,7 @@ xfs_bmap_add_attrfork(
 	ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_SLEEP);
 	ip->i_afp->if_flags = XFS_IFEXTENTS;
 	logflags = 0;
-	xfs_bmap_init(&flist, &firstblock);
+	xfs_defer_init(&flist, &firstblock);
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_LOCAL:
 		error = xfs_bmap_add_attrfork_local(tp, ip, &firstblock, &flist,
@@ -1165,7 +1165,7 @@ xfs_bmap_add_attrfork(
 			xfs_log_sb(tp);
 	}
 
-	error = xfs_bmap_finish(&tp, &flist, NULL);
+	error = xfs_defer_finish(&tp, &flist, NULL);
 	if (error)
 		goto bmap_cancel;
 	error = xfs_trans_commit(tp);
@@ -1173,7 +1173,7 @@ xfs_bmap_add_attrfork(
 	return error;
 
 bmap_cancel:
-	xfs_bmap_cancel(&flist);
+	xfs_defer_cancel(&flist);
 trans_cancel:
 	xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -2214,7 +2214,7 @@ xfs_bmap_add_extent_unwritten_real(
 	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
 	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
 	xfs_fsblock_t		*first,	/* pointer to firstblock variable */
-	xfs_bmap_free_t		*flist,	/* list of extents to be freed */
+	struct xfs_defer_ops	*flist,	/* list of extents to be freed */
 	int			*logflagsp) /* inode logging flags */
 {
 	xfs_btree_cur_t		*cur;	/* btree cursor */
@@ -4447,7 +4447,7 @@ xfs_bmapi_write(
 	xfs_extlen_t		total,		/* total blocks needed */
 	struct xfs_bmbt_irec	*mval,		/* output: map values */
 	int			*nmap,		/* i/o: mval size/count */
-	struct xfs_bmap_free	*flist)		/* i/o: list extents to free */
+	struct xfs_defer_ops	*flist)		/* i/o: list extents to free */
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp;
@@ -4735,7 +4735,7 @@ xfs_bmap_del_extent(
 	xfs_inode_t		*ip,	/* incore inode pointer */
 	xfs_trans_t		*tp,	/* current transaction pointer */
 	xfs_extnum_t		*idx,	/* extent number to update/delete */
-	xfs_bmap_free_t		*flist,	/* list of extents to be freed */
+	struct xfs_defer_ops	*flist,	/* list of extents to be freed */
 	xfs_btree_cur_t		*cur,	/* if null, not a btree */
 	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
 	int			*logflagsp, /* inode logging flags */
@@ -5064,7 +5064,7 @@ xfs_bunmapi(
 	xfs_extnum_t		nexts,		/* number of extents max */
 	xfs_fsblock_t		*firstblock,	/* first allocated block
 						   controls a.g. for allocs */
-	xfs_bmap_free_t		*flist,		/* i/o: list extents to free */
+	struct xfs_defer_ops	*flist,		/* i/o: list extents to free */
 	int			*done)		/* set if not done yet */
 {
 	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
@@ -5678,7 +5678,7 @@ xfs_bmap_shift_extents(
 	int			*done,
 	xfs_fileoff_t		stop_fsb,
 	xfs_fsblock_t		*firstblock,
-	struct xfs_bmap_free	*flist,
+	struct xfs_defer_ops	*flist,
 	enum shift_direction	direction,
 	int			num_exts)
 {
@@ -5832,7 +5832,7 @@ xfs_bmap_split_extent_at(
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		split_fsb,
 	xfs_fsblock_t		*firstfsb,
-	struct xfs_bmap_free	*free_list)
+	struct xfs_defer_ops	*free_list)
 {
 	int				whichfork = XFS_DATA_FORK;
 	struct xfs_btree_cur		*cur = NULL;
@@ -5971,7 +5971,7 @@ xfs_bmap_split_extent(
 {
 	struct xfs_mount        *mp = ip->i_mount;
 	struct xfs_trans        *tp;
-	struct xfs_bmap_free    free_list;
+	struct xfs_defer_ops    free_list;
 	xfs_fsblock_t           firstfsb;
 	int                     error;
 
@@ -5983,21 +5983,21 @@ xfs_bmap_split_extent(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
-	xfs_bmap_init(&free_list, &firstfsb);
+	xfs_defer_init(&free_list, &firstfsb);
 
 	error = xfs_bmap_split_extent_at(tp, ip, split_fsb,
 			&firstfsb, &free_list);
 	if (error)
 		goto out;
 
-	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &free_list, NULL);
 	if (error)
 		goto out;
 
 	return xfs_trans_commit(tp);
 
 out:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 6681bd9..e2a0425 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -32,7 +32,7 @@ extern kmem_zone_t	*xfs_bmap_free_item_zone;
  */
 struct xfs_bmalloca {
 	xfs_fsblock_t		*firstblock; /* i/o first block allocated */
-	struct xfs_bmap_free	*flist;	/* bmap freelist */
+	struct xfs_defer_ops	*flist;	/* bmap freelist */
 	struct xfs_trans	*tp;	/* transaction pointer */
 	struct xfs_inode	*ip;	/* incore inode pointer */
 	struct xfs_bmbt_irec	prev;	/* extent before the new one */
@@ -164,7 +164,7 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
 
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
-void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_bmap_free *flist,
+void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *flist,
 			  xfs_fsblock_t bno, xfs_filblks_t len);
 void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
 int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
@@ -186,18 +186,18 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_fsblock_t *firstblock, xfs_extlen_t total,
 		struct xfs_bmbt_irec *mval, int *nmap,
-		struct xfs_bmap_free *flist);
+		struct xfs_defer_ops *flist);
 int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
-		struct xfs_bmap_free *flist, int *done);
+		struct xfs_defer_ops *flist, int *done);
 int	xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
 		xfs_extnum_t num);
 uint	xfs_default_attroffset(struct xfs_inode *ip);
 int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb,
 		int *done, xfs_fileoff_t stop_fsb, xfs_fsblock_t *firstblock,
-		struct xfs_bmap_free *flist, enum shift_direction direction,
+		struct xfs_defer_ops *flist, enum shift_direction direction,
 		int num_exts);
 int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
 
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 0ec3055..ae714a8 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -19,7 +19,7 @@
 #define	__XFS_BTREE_H__
 
 struct xfs_buf;
-struct xfs_bmap_free;
+struct xfs_defer_ops;
 struct xfs_inode;
 struct xfs_mount;
 struct xfs_trans;
@@ -234,11 +234,12 @@ typedef struct xfs_btree_cur
 	union {
 		struct {			/* needed for BNO, CNT, INO */
 			struct xfs_buf	*agbp;	/* agf/agi buffer pointer */
+			struct xfs_defer_ops *flist;	/* deferred updates */
 			xfs_agnumber_t	agno;	/* ag number */
 		} a;
 		struct {			/* needed for BMAP */
 			struct xfs_inode *ip;	/* pointer to our inode */
-			struct xfs_bmap_free *flist;	/* list to free after */
+			struct xfs_defer_ops *flist;	/* deferred updates */
 			xfs_fsblock_t	firstblock;	/* 1st blk allocated */
 			int		allocated;	/* count of alloced */
 			short		forksize;	/* fork's inode space */
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6e153e3..249813a 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -19,7 +19,7 @@
 #ifndef __XFS_DA_BTREE_H__
 #define	__XFS_DA_BTREE_H__
 
-struct xfs_bmap_free;
+struct xfs_defer_ops;
 struct xfs_inode;
 struct xfs_trans;
 struct zone;
@@ -70,7 +70,7 @@ typedef struct xfs_da_args {
 	xfs_ino_t	inumber;	/* input/output inode number */
 	struct xfs_inode *dp;		/* directory inode to manipulate */
 	xfs_fsblock_t	*firstblock;	/* ptr to firstblock for bmap calls */
-	struct xfs_bmap_free *flist;	/* ptr to freelist for bmap_finish */
+	struct xfs_defer_ops *flist;	/* ptr to freelist for bmap_finish */
 	struct xfs_trans *trans;	/* current trans (changes over time) */
 	xfs_extlen_t	total;		/* total blocks needed, for 1st bmap */
 	int		whichfork;	/* data or attribute fork */
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 4c05ba6..743fc32 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -94,11 +94,4 @@ struct xfs_defer_op_type {
 void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
 void xfs_defer_init_types(void);
 
-/* XXX: compatibility shims, will go away in the next patch */
-#define xfs_bmap_finish		xfs_defer_finish
-#define xfs_bmap_cancel		xfs_defer_cancel
-#define xfs_bmap_init		xfs_defer_init
-#define xfs_bmap_free		xfs_defer_ops
-typedef struct xfs_defer_ops	xfs_bmap_free_t;
-
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 945c0345..0523100 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -260,7 +260,7 @@ xfs_dir_createname(
 	struct xfs_name		*name,
 	xfs_ino_t		inum,		/* new entry inode number */
 	xfs_fsblock_t		*first,		/* bmap's firstblock */
-	xfs_bmap_free_t		*flist,		/* bmap's freeblock list */
+	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
 	xfs_extlen_t		total)		/* bmap's total block count */
 {
 	struct xfs_da_args	*args;
@@ -437,7 +437,7 @@ xfs_dir_removename(
 	struct xfs_name	*name,
 	xfs_ino_t	ino,
 	xfs_fsblock_t	*first,		/* bmap's firstblock */
-	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
+	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
 	xfs_extlen_t	total)		/* bmap's total block count */
 {
 	struct xfs_da_args *args;
@@ -499,7 +499,7 @@ xfs_dir_replace(
 	struct xfs_name	*name,		/* name of entry to replace */
 	xfs_ino_t	inum,		/* new inode number */
 	xfs_fsblock_t	*first,		/* bmap's firstblock */
-	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
+	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
 	xfs_extlen_t	total)		/* bmap's total block count */
 {
 	struct xfs_da_args *args;
diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index 0a62e73..5737d85 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -18,7 +18,7 @@
 #ifndef __XFS_DIR2_H__
 #define __XFS_DIR2_H__
 
-struct xfs_bmap_free;
+struct xfs_defer_ops;
 struct xfs_da_args;
 struct xfs_inode;
 struct xfs_mount;
@@ -129,18 +129,18 @@ extern int xfs_dir_init(struct xfs_trans *tp, struct xfs_inode *dp,
 extern int xfs_dir_createname(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t inum,
 				xfs_fsblock_t *first,
-				struct xfs_bmap_free *flist, xfs_extlen_t tot);
+				struct xfs_defer_ops *flist, xfs_extlen_t tot);
 extern int xfs_dir_lookup(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t *inum,
 				struct xfs_name *ci_name);
 extern int xfs_dir_removename(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t ino,
 				xfs_fsblock_t *first,
-				struct xfs_bmap_free *flist, xfs_extlen_t tot);
+				struct xfs_defer_ops *flist, xfs_extlen_t tot);
 extern int xfs_dir_replace(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t inum,
 				xfs_fsblock_t *first,
-				struct xfs_bmap_free *flist, xfs_extlen_t tot);
+				struct xfs_defer_ops *flist, xfs_extlen_t tot);
 extern int xfs_dir_canenter(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name);
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 9ae9a43..f2e29a1 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1818,7 +1818,7 @@ xfs_difree_inode_chunk(
 	struct xfs_mount		*mp,
 	xfs_agnumber_t			agno,
 	struct xfs_inobt_rec_incore	*rec,
-	struct xfs_bmap_free		*flist)
+	struct xfs_defer_ops		*flist)
 {
 	xfs_agblock_t	sagbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
 	int		startidx, endidx;
@@ -1890,7 +1890,7 @@ xfs_difree_inobt(
 	struct xfs_trans		*tp,
 	struct xfs_buf			*agbp,
 	xfs_agino_t			agino,
-	struct xfs_bmap_free		*flist,
+	struct xfs_defer_ops		*flist,
 	struct xfs_icluster		*xic,
 	struct xfs_inobt_rec_incore	*orec)
 {
@@ -2122,7 +2122,7 @@ int
 xfs_difree(
 	struct xfs_trans	*tp,		/* transaction pointer */
 	xfs_ino_t		inode,		/* inode to be freed */
-	struct xfs_bmap_free	*flist,		/* extents to free */
+	struct xfs_defer_ops	*flist,		/* extents to free */
 	struct xfs_icluster	*xic)	/* cluster info if deleted */
 {
 	/* REFERENCED */
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 6e450df..2e06b67 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -95,7 +95,7 @@ int					/* error */
 xfs_difree(
 	struct xfs_trans *tp,		/* transaction pointer */
 	xfs_ino_t	inode,		/* inode to be freed */
-	struct xfs_bmap_free *flist,	/* extents to free */
+	struct xfs_defer_ops *flist,	/* extents to free */
 	struct xfs_icluster *ifree);	/* cluster info if deleted */
 
 /*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 68cb1e7..4c7eb9d 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -153,9 +153,9 @@ xfs_calc_finobt_res(
  * item logged to try to account for the overhead of the transaction mechanism.
  *
  * Note:  Most of the reservations underestimate the number of allocation
- * groups into which they could free extents in the xfs_bmap_finish() call.
+ * groups into which they could free extents in the xfs_defer_finish() call.
  * This is because the number in the worst case is quite high and quite
- * unusual.  In order to fix this we need to change xfs_bmap_finish() to free
+ * unusual.  In order to fix this we need to change xfs_defer_finish() to free
  * extents in only a single AG at a time.  This will require changes to the
  * EFI code as well, however, so that the EFI for the extents not freed is
  * logged again in each transaction.  See SGI PV #261917.
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 972a27a..928dfa4 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -685,7 +685,7 @@ xfs_bmap_punch_delalloc_range(
 		xfs_bmbt_irec_t	imap;
 		int		nimaps = 1;
 		xfs_fsblock_t	firstblock;
-		xfs_bmap_free_t flist;
+		struct xfs_defer_ops flist;
 
 		/*
 		 * Map the range first and check that it is a delalloc extent
@@ -721,7 +721,7 @@ xfs_bmap_punch_delalloc_range(
 		 * allocated or freed for a delalloc extent and hence we need
 		 * don't cancel or finish them after the xfs_bunmapi() call.
 		 */
-		xfs_bmap_init(&flist, &firstblock);
+		xfs_defer_init(&flist, &firstblock);
 		error = xfs_bunmapi(NULL, ip, start_fsb, 1, 0, 1, &firstblock,
 					&flist, &done);
 		if (error)
@@ -884,7 +884,7 @@ xfs_alloc_file_space(
 	int			rt;
 	xfs_trans_t		*tp;
 	xfs_bmbt_irec_t		imaps[1], *imapp;
-	xfs_bmap_free_t		free_list;
+	struct xfs_defer_ops	free_list;
 	uint			qblocks, resblks, resrtextents;
 	int			error;
 
@@ -975,7 +975,7 @@ xfs_alloc_file_space(
 
 		xfs_trans_ijoin(tp, ip, 0);
 
-		xfs_bmap_init(&free_list, &firstfsb);
+		xfs_defer_init(&free_list, &firstfsb);
 		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
 					allocatesize_fsb, alloc_type, &firstfsb,
 					resblks, imapp, &nimaps, &free_list);
@@ -985,7 +985,7 @@ xfs_alloc_file_space(
 		/*
 		 * Complete the transaction
 		 */
-		error = xfs_bmap_finish(&tp, &free_list, NULL);
+		error = xfs_defer_finish(&tp, &free_list, NULL);
 		if (error)
 			goto error0;
 
@@ -1008,7 +1008,7 @@ xfs_alloc_file_space(
 	return error;
 
 error0:	/* Cancel bmap, unlock inode, unreserve quota blocks, cancel trans */
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 	xfs_trans_unreserve_quota_nblks(tp, ip, (long)qblocks, 0, quota_flag);
 
 error1:	/* Just cancel transaction */
@@ -1122,7 +1122,7 @@ xfs_free_file_space(
 	xfs_fileoff_t		endoffset_fsb;
 	int			error;
 	xfs_fsblock_t		firstfsb;
-	xfs_bmap_free_t		free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_bmbt_irec_t		imap;
 	xfs_off_t		ioffset;
 	xfs_off_t		iendoffset;
@@ -1245,7 +1245,7 @@ xfs_free_file_space(
 		/*
 		 * issue the bunmapi() call to free the blocks
 		 */
-		xfs_bmap_init(&free_list, &firstfsb);
+		xfs_defer_init(&free_list, &firstfsb);
 		error = xfs_bunmapi(tp, ip, startoffset_fsb,
 				  endoffset_fsb - startoffset_fsb,
 				  0, 2, &firstfsb, &free_list, &done);
@@ -1255,7 +1255,7 @@ xfs_free_file_space(
 		/*
 		 * complete the transaction
 		 */
-		error = xfs_bmap_finish(&tp, &free_list, NULL);
+		error = xfs_defer_finish(&tp, &free_list, ip);
 		if (error)
 			goto error0;
 
@@ -1267,7 +1267,7 @@ xfs_free_file_space(
 	return error;
 
  error0:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
  error1:
 	xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1333,7 +1333,7 @@ xfs_shift_file_space(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	int			error;
-	struct xfs_bmap_free	free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t		first_block;
 	xfs_fileoff_t		stop_fsb;
 	xfs_fileoff_t		next_fsb;
@@ -1411,7 +1411,7 @@ xfs_shift_file_space(
 
 		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
-		xfs_bmap_init(&free_list, &first_block);
+		xfs_defer_init(&free_list, &first_block);
 
 		/*
 		 * We are using the write transaction in which max 2 bmbt
@@ -1423,7 +1423,7 @@ xfs_shift_file_space(
 		if (error)
 			goto out_bmap_cancel;
 
-		error = xfs_bmap_finish(&tp, &free_list, NULL);
+		error = xfs_defer_finish(&tp, &free_list, NULL);
 		if (error)
 			goto out_bmap_cancel;
 
@@ -1433,7 +1433,7 @@ xfs_shift_file_space(
 	return error;
 
 out_bmap_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 	return error;
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index be17f0a..764e1cc 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -307,7 +307,7 @@ xfs_qm_dqalloc(
 	xfs_buf_t	**O_bpp)
 {
 	xfs_fsblock_t	firstblock;
-	xfs_bmap_free_t flist;
+	struct xfs_defer_ops flist;
 	xfs_bmbt_irec_t map;
 	int		nmaps, error;
 	xfs_buf_t	*bp;
@@ -320,7 +320,7 @@ xfs_qm_dqalloc(
 	/*
 	 * Initialize the bmap freelist prior to calling bmapi code.
 	 */
-	xfs_bmap_init(&flist, &firstblock);
+	xfs_defer_init(&flist, &firstblock);
 	xfs_ilock(quotip, XFS_ILOCK_EXCL);
 	/*
 	 * Return if this type of quotas is turned off while we didn't
@@ -368,7 +368,7 @@ xfs_qm_dqalloc(
 			      dqp->dq_flags & XFS_DQ_ALLTYPES, bp);
 
 	/*
-	 * xfs_bmap_finish() may commit the current transaction and
+	 * xfs_defer_finish() may commit the current transaction and
 	 * start a second transaction if the freelist is not empty.
 	 *
 	 * Since we still want to modify this buffer, we need to
@@ -382,7 +382,7 @@ xfs_qm_dqalloc(
 
 	xfs_trans_bhold(tp, bp);
 
-	error = xfs_bmap_finish(tpp, &flist, NULL);
+	error = xfs_defer_finish(tpp, &flist, NULL);
 	if (error)
 		goto error1;
 
@@ -398,7 +398,7 @@ xfs_qm_dqalloc(
 	return 0;
 
 error1:
-	xfs_bmap_cancel(&flist);
+	xfs_defer_cancel(&flist);
 error0:
 	xfs_iunlock(quotip, XFS_ILOCK_EXCL);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d2389bb..3ce50da 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1123,7 +1123,7 @@ xfs_create(
 	struct xfs_inode	*ip = NULL;
 	struct xfs_trans	*tp = NULL;
 	int			error;
-	xfs_bmap_free_t		free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t		first_block;
 	bool                    unlock_dp_on_error = false;
 	prid_t			prid;
@@ -1183,7 +1183,7 @@ xfs_create(
 		      XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
 	unlock_dp_on_error = true;
 
-	xfs_bmap_init(&free_list, &first_block);
+	xfs_defer_init(&free_list, &first_block);
 
 	/*
 	 * Reserve disk quota and the inode.
@@ -1254,7 +1254,7 @@ xfs_create(
 	 */
 	xfs_qm_vop_create_dqattach(tp, ip, udqp, gdqp, pdqp);
 
-	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &free_list, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -1270,7 +1270,7 @@ xfs_create(
 	return 0;
 
  out_bmap_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
  out_trans_cancel:
 	xfs_trans_cancel(tp);
  out_release_inode:
@@ -1402,7 +1402,7 @@ xfs_link(
 	xfs_mount_t		*mp = tdp->i_mount;
 	xfs_trans_t		*tp;
 	int			error;
-	xfs_bmap_free_t         free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t           first_block;
 	int			resblks;
 
@@ -1453,7 +1453,7 @@ xfs_link(
 			goto error_return;
 	}
 
-	xfs_bmap_init(&free_list, &first_block);
+	xfs_defer_init(&free_list, &first_block);
 
 	/*
 	 * Handle initial link state of O_TMPFILE inode
@@ -1483,9 +1483,9 @@ xfs_link(
 	if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
 		xfs_trans_set_sync(tp);
 
-	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &free_list, NULL);
 	if (error) {
-		xfs_bmap_cancel(&free_list);
+		xfs_defer_cancel(&free_list);
 		goto error_return;
 	}
 
@@ -1527,7 +1527,7 @@ xfs_itruncate_extents(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp = *tpp;
-	xfs_bmap_free_t		free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t		first_block;
 	xfs_fileoff_t		first_unmap_block;
 	xfs_fileoff_t		last_block;
@@ -1563,7 +1563,7 @@ xfs_itruncate_extents(
 	ASSERT(first_unmap_block < last_block);
 	unmap_len = last_block - first_unmap_block + 1;
 	while (!done) {
-		xfs_bmap_init(&free_list, &first_block);
+		xfs_defer_init(&free_list, &first_block);
 		error = xfs_bunmapi(tp, ip,
 				    first_unmap_block, unmap_len,
 				    xfs_bmapi_aflag(whichfork),
@@ -1577,7 +1577,7 @@ xfs_itruncate_extents(
 		 * Duplicate the transaction that has the permanent
 		 * reservation and commit the old transaction.
 		 */
-		error = xfs_bmap_finish(&tp, &free_list, ip);
+		error = xfs_defer_finish(&tp, &free_list, ip);
 		if (error)
 			goto out_bmap_cancel;
 
@@ -1603,7 +1603,7 @@ out_bmap_cancel:
 	 * the transaction can be properly aborted.  We just need to make sure
 	 * we're not holding any resources that we were not when we came in.
 	 */
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 	goto out;
 }
 
@@ -1744,7 +1744,7 @@ STATIC int
 xfs_inactive_ifree(
 	struct xfs_inode *ip)
 {
-	xfs_bmap_free_t		free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t		first_block;
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -1781,7 +1781,7 @@ xfs_inactive_ifree(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
-	xfs_bmap_init(&free_list, &first_block);
+	xfs_defer_init(&free_list, &first_block);
 	error = xfs_ifree(tp, ip, &free_list);
 	if (error) {
 		/*
@@ -1808,11 +1808,11 @@ xfs_inactive_ifree(
 	 * Just ignore errors at this point.  There is nothing we can do except
 	 * to try to keep going. Make sure it's not a silent error.
 	 */
-	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &free_list, NULL);
 	if (error) {
-		xfs_notice(mp, "%s: xfs_bmap_finish returned error %d",
+		xfs_notice(mp, "%s: xfs_defer_finish returned error %d",
 			__func__, error);
-		xfs_bmap_cancel(&free_list);
+		xfs_defer_cancel(&free_list);
 	}
 	error = xfs_trans_commit(tp);
 	if (error)
@@ -2368,7 +2368,7 @@ int
 xfs_ifree(
 	xfs_trans_t	*tp,
 	xfs_inode_t	*ip,
-	xfs_bmap_free_t	*flist)
+	struct xfs_defer_ops	*flist)
 {
 	int			error;
 	struct xfs_icluster	xic = { 0 };
@@ -2475,7 +2475,7 @@ xfs_iunpin_wait(
  * directory entry.
  *
  * This is still safe from a transactional point of view - it is not until we
- * get to xfs_bmap_finish() that we have the possibility of multiple
+ * get to xfs_defer_finish() that we have the possibility of multiple
  * transactions in this operation. Hence as long as we remove the directory
  * entry and drop the link count in the first transaction of the remove
  * operation, there are no transactional constraints on the ordering here.
@@ -2490,7 +2490,7 @@ xfs_remove(
 	xfs_trans_t             *tp = NULL;
 	int			is_dir = S_ISDIR(VFS_I(ip)->i_mode);
 	int                     error = 0;
-	xfs_bmap_free_t         free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t           first_block;
 	uint			resblks;
 
@@ -2572,7 +2572,7 @@ xfs_remove(
 	if (error)
 		goto out_trans_cancel;
 
-	xfs_bmap_init(&free_list, &first_block);
+	xfs_defer_init(&free_list, &first_block);
 	error = xfs_dir_removename(tp, dp, name, ip->i_ino,
 					&first_block, &free_list, resblks);
 	if (error) {
@@ -2588,7 +2588,7 @@ xfs_remove(
 	if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
 		xfs_trans_set_sync(tp);
 
-	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &free_list, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -2602,7 +2602,7 @@ xfs_remove(
 	return 0;
 
  out_bmap_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
  out_trans_cancel:
 	xfs_trans_cancel(tp);
  std_return:
@@ -2663,7 +2663,7 @@ xfs_sort_for_rename(
 static int
 xfs_finish_rename(
 	struct xfs_trans	*tp,
-	struct xfs_bmap_free	*free_list)
+	struct xfs_defer_ops	*free_list)
 {
 	int			error;
 
@@ -2674,9 +2674,9 @@ xfs_finish_rename(
 	if (tp->t_mountp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
 		xfs_trans_set_sync(tp);
 
-	error = xfs_bmap_finish(&tp, free_list, NULL);
+	error = xfs_defer_finish(&tp, free_list, NULL);
 	if (error) {
-		xfs_bmap_cancel(free_list);
+		xfs_defer_cancel(free_list);
 		xfs_trans_cancel(tp);
 		return error;
 	}
@@ -2698,7 +2698,7 @@ xfs_cross_rename(
 	struct xfs_inode	*dp2,
 	struct xfs_name		*name2,
 	struct xfs_inode	*ip2,
-	struct xfs_bmap_free	*free_list,
+	struct xfs_defer_ops	*free_list,
 	xfs_fsblock_t		*first_block,
 	int			spaceres)
 {
@@ -2801,7 +2801,7 @@ xfs_cross_rename(
 	return xfs_finish_rename(tp, free_list);
 
 out_trans_abort:
-	xfs_bmap_cancel(free_list);
+	xfs_defer_cancel(free_list);
 	xfs_trans_cancel(tp);
 	return error;
 }
@@ -2856,7 +2856,7 @@ xfs_rename(
 {
 	struct xfs_mount	*mp = src_dp->i_mount;
 	struct xfs_trans	*tp;
-	struct xfs_bmap_free	free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t		first_block;
 	struct xfs_inode	*wip = NULL;		/* whiteout inode */
 	struct xfs_inode	*inodes[__XFS_SORT_INODES];
@@ -2945,7 +2945,7 @@ xfs_rename(
 		goto out_trans_cancel;
 	}
 
-	xfs_bmap_init(&free_list, &first_block);
+	xfs_defer_init(&free_list, &first_block);
 
 	/* RENAME_EXCHANGE is unique from here on. */
 	if (flags & RENAME_EXCHANGE)
@@ -3131,7 +3131,7 @@ xfs_rename(
 	return error;
 
 out_bmap_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 out_release_wip:
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 99d7522..633f2af 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -27,7 +27,7 @@
 struct xfs_dinode;
 struct xfs_inode;
 struct xfs_buf;
-struct xfs_bmap_free;
+struct xfs_defer_ops;
 struct xfs_bmbt_irec;
 struct xfs_inode_log_item;
 struct xfs_mount;
@@ -398,7 +398,7 @@ uint		xfs_ilock_attr_map_shared(struct xfs_inode *);
 
 uint		xfs_ip2xflags(struct xfs_inode *);
 int		xfs_ifree(struct xfs_trans *, xfs_inode_t *,
-			   struct xfs_bmap_free *);
+			   struct xfs_defer_ops *);
 int		xfs_itruncate_extents(struct xfs_trans **, struct xfs_inode *,
 				      int, xfs_fsize_t);
 void		xfs_iext_realloc(xfs_inode_t *, int, int);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index b090bc1..cb7abe84 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -128,7 +128,7 @@ xfs_iomap_write_direct(
 	int		quota_flag;
 	int		rt;
 	xfs_trans_t	*tp;
-	xfs_bmap_free_t free_list;
+	struct xfs_defer_ops free_list;
 	uint		qblocks, resblks, resrtextents;
 	int		error;
 	int		lockmode;
@@ -231,7 +231,7 @@ xfs_iomap_write_direct(
 	 * From this point onwards we overwrite the imap pointer that the
 	 * caller gave to us.
 	 */
-	xfs_bmap_init(&free_list, &firstfsb);
+	xfs_defer_init(&free_list, &firstfsb);
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
 				bmapi_flags, &firstfsb, resblks, imap,
@@ -242,7 +242,7 @@ xfs_iomap_write_direct(
 	/*
 	 * Complete the transaction
 	 */
-	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &free_list, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -266,7 +266,7 @@ out_unlock:
 	return error;
 
 out_bmap_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 	xfs_trans_unreserve_quota_nblks(tp, ip, (long)qblocks, 0, quota_flag);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
@@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
 	xfs_fileoff_t	offset_fsb, last_block;
 	xfs_fileoff_t	end_fsb, map_start_fsb;
 	xfs_fsblock_t	first_block;
-	xfs_bmap_free_t	free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_filblks_t	count_fsb;
 	xfs_trans_t	*tp;
 	int		nimaps;
@@ -727,7 +727,7 @@ xfs_iomap_write_allocate(
 			xfs_ilock(ip, XFS_ILOCK_EXCL);
 			xfs_trans_ijoin(tp, ip, 0);
 
-			xfs_bmap_init(&free_list, &first_block);
+			xfs_defer_init(&free_list, &first_block);
 
 			/*
 			 * it is possible that the extents have changed since
@@ -787,7 +787,7 @@ xfs_iomap_write_allocate(
 			if (error)
 				goto trans_cancel;
 
-			error = xfs_bmap_finish(&tp, &free_list, NULL);
+			error = xfs_defer_finish(&tp, &free_list, NULL);
 			if (error)
 				goto trans_cancel;
 
@@ -821,7 +821,7 @@ xfs_iomap_write_allocate(
 	}
 
 trans_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 	xfs_trans_cancel(tp);
 error0:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -842,7 +842,7 @@ xfs_iomap_write_unwritten(
 	int		nimaps;
 	xfs_trans_t	*tp;
 	xfs_bmbt_irec_t imap;
-	xfs_bmap_free_t free_list;
+	struct xfs_defer_ops free_list;
 	xfs_fsize_t	i_size;
 	uint		resblks;
 	int		error;
@@ -886,7 +886,7 @@ xfs_iomap_write_unwritten(
 		/*
 		 * Modify the unwritten extent state of the buffer.
 		 */
-		xfs_bmap_init(&free_list, &firstfsb);
+		xfs_defer_init(&free_list, &firstfsb);
 		nimaps = 1;
 		error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
 					XFS_BMAPI_CONVERT, &firstfsb, resblks,
@@ -909,7 +909,7 @@ xfs_iomap_write_unwritten(
 			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 		}
 
-		error = xfs_bmap_finish(&tp, &free_list, NULL);
+		error = xfs_defer_finish(&tp, &free_list, NULL);
 		if (error)
 			goto error_on_bmapi_transaction;
 
@@ -936,7 +936,7 @@ xfs_iomap_write_unwritten(
 	return 0;
 
 error_on_bmapi_transaction:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 	xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 627f7e6..c761a6a 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -770,7 +770,7 @@ xfs_growfs_rt_alloc(
 	xfs_daddr_t		d;		/* disk block address */
 	int			error;		/* error return value */
 	xfs_fsblock_t		firstblock;/* first block allocated in xaction */
-	struct xfs_bmap_free	flist;		/* list of freed blocks */
+	struct xfs_defer_ops	flist;		/* list of freed blocks */
 	xfs_fsblock_t		fsbno;		/* filesystem block for bno */
 	struct xfs_bmbt_irec	map;		/* block map output */
 	int			nmap;		/* number of block maps */
@@ -795,7 +795,7 @@ xfs_growfs_rt_alloc(
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
-		xfs_bmap_init(&flist, &firstblock);
+		xfs_defer_init(&flist, &firstblock);
 		/*
 		 * Allocate blocks to the bitmap file.
 		 */
@@ -810,7 +810,7 @@ xfs_growfs_rt_alloc(
 		/*
 		 * Free any blocks freed up in the transaction, then commit.
 		 */
-		error = xfs_bmap_finish(&tp, &flist, NULL);
+		error = xfs_defer_finish(&tp, &flist, NULL);
 		if (error)
 			goto out_bmap_cancel;
 		error = xfs_trans_commit(tp);
@@ -863,7 +863,7 @@ xfs_growfs_rt_alloc(
 	return 0;
 
 out_bmap_cancel:
-	xfs_bmap_cancel(&flist);
+	xfs_defer_cancel(&flist);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 	return error;
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 20af47b..3b005ec 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -173,7 +173,7 @@ xfs_symlink(
 	struct xfs_inode	*ip = NULL;
 	int			error = 0;
 	int			pathlen;
-	struct xfs_bmap_free	free_list;
+	struct xfs_defer_ops	free_list;
 	xfs_fsblock_t		first_block;
 	bool                    unlock_dp_on_error = false;
 	xfs_fileoff_t		first_fsb;
@@ -270,7 +270,7 @@ xfs_symlink(
 	 * Initialize the bmap freelist prior to calling either
 	 * bmapi or the directory create code.
 	 */
-	xfs_bmap_init(&free_list, &first_block);
+	xfs_defer_init(&free_list, &first_block);
 
 	/*
 	 * Allocate an inode for the symlink.
@@ -377,7 +377,7 @@ xfs_symlink(
 		xfs_trans_set_sync(tp);
 	}
 
-	error = xfs_bmap_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &free_list, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -393,7 +393,7 @@ xfs_symlink(
 	return 0;
 
 out_bmap_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 out_release_inode:
@@ -427,7 +427,7 @@ xfs_inactive_symlink_rmt(
 	int		done;
 	int		error;
 	xfs_fsblock_t	first_block;
-	xfs_bmap_free_t	free_list;
+	struct xfs_defer_ops	free_list;
 	int		i;
 	xfs_mount_t	*mp;
 	xfs_bmbt_irec_t	mval[XFS_SYMLINK_MAPS];
@@ -466,7 +466,7 @@ xfs_inactive_symlink_rmt(
 	 * Find the block(s) so we can inval and unmap them.
 	 */
 	done = 0;
-	xfs_bmap_init(&free_list, &first_block);
+	xfs_defer_init(&free_list, &first_block);
 	nmaps = ARRAY_SIZE(mval);
 	error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
 				mval, &nmaps, 0);
@@ -496,7 +496,7 @@ xfs_inactive_symlink_rmt(
 	/*
 	 * Commit the first transaction.  This logs the EFI and the inode.
 	 */
-	error = xfs_bmap_finish(&tp, &free_list, ip);
+	error = xfs_defer_finish(&tp, &free_list, ip);
 	if (error)
 		goto error_bmap_cancel;
 	/*
@@ -526,7 +526,7 @@ xfs_inactive_symlink_rmt(
 	return 0;
 
 error_bmap_cancel:
-	xfs_bmap_cancel(&free_list);
+	xfs_defer_cancel(&free_list);
 error_trans_cancel:
 	xfs_trans_cancel(tp);
 error_unlock:


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 021/119] xfs: rename flist/free_list to dfops
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (19 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 020/119] xfs: change xfs_bmap_{finish, cancel, init, free} -> xfs_defer_* Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-17  1:20 ` [PATCH 022/119] xfs: add tracepoints and error injection for deferred extent freeing Darrick J. Wong
                   ` (97 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Mechanical change of flist/free_list to dfops, since they're now
deferred ops, not just a freeing list.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_attr.c        |   62 ++++++++++-----------
 fs/xfs/libxfs/xfs_attr_leaf.c   |    4 +
 fs/xfs/libxfs/xfs_attr_remote.c |   18 +++---
 fs/xfs/libxfs/xfs_bmap.c        |  116 ++++++++++++++++++++-------------------
 fs/xfs/libxfs/xfs_bmap.h        |   10 ++-
 fs/xfs/libxfs/xfs_bmap_btree.c  |   14 ++---
 fs/xfs/libxfs/xfs_btree.h       |    4 +
 fs/xfs/libxfs/xfs_da_btree.c    |    6 +-
 fs/xfs/libxfs/xfs_da_btree.h    |    2 -
 fs/xfs/libxfs/xfs_dir2.c        |   14 ++---
 fs/xfs/libxfs/xfs_dir2.h        |    6 +-
 fs/xfs/libxfs/xfs_ialloc.c      |   14 ++---
 fs/xfs/libxfs/xfs_ialloc.h      |    2 -
 fs/xfs/xfs_bmap_util.c          |   40 +++++++------
 fs/xfs/xfs_dquot.c              |   10 ++-
 fs/xfs/xfs_filestream.c         |    2 -
 fs/xfs/xfs_inode.c              |   94 ++++++++++++++++----------------
 fs/xfs/xfs_iomap.c              |   30 +++++-----
 fs/xfs/xfs_rtalloc.c            |   10 ++-
 fs/xfs/xfs_symlink.c            |   24 ++++----
 20 files changed, 241 insertions(+), 241 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 66baf97..af1ecb1 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -204,7 +204,7 @@ xfs_attr_set(
 {
 	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_da_args	args;
-	struct xfs_defer_ops	flist;
+	struct xfs_defer_ops	dfops;
 	struct xfs_trans_res	tres;
 	xfs_fsblock_t		firstblock;
 	int			rsvd = (flags & ATTR_ROOT) != 0;
@@ -222,7 +222,7 @@ xfs_attr_set(
 	args.value = value;
 	args.valuelen = valuelen;
 	args.firstblock = &firstblock;
-	args.flist = &flist;
+	args.dfops = &dfops;
 	args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
 	args.total = xfs_attr_calc_size(&args, &local);
 
@@ -317,13 +317,13 @@ xfs_attr_set(
 		 * It won't fit in the shortform, transform to a leaf block.
 		 * GROT: another possible req'mt for a double-split btree op.
 		 */
-		xfs_defer_init(args.flist, args.firstblock);
+		xfs_defer_init(args.dfops, args.firstblock);
 		error = xfs_attr_shortform_to_leaf(&args);
 		if (!error)
-			error = xfs_defer_finish(&args.trans, args.flist, dp);
+			error = xfs_defer_finish(&args.trans, args.dfops, dp);
 		if (error) {
 			args.trans = NULL;
-			xfs_defer_cancel(&flist);
+			xfs_defer_cancel(&dfops);
 			goto out;
 		}
 
@@ -383,7 +383,7 @@ xfs_attr_remove(
 {
 	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_da_args	args;
-	struct xfs_defer_ops	flist;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t		firstblock;
 	int			error;
 
@@ -400,7 +400,7 @@ xfs_attr_remove(
 		return error;
 
 	args.firstblock = &firstblock;
-	args.flist = &flist;
+	args.dfops = &dfops;
 
 	/*
 	 * we have no control over the attribute names that userspace passes us
@@ -585,13 +585,13 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
 		 * Commit that transaction so that the node_addname() call
 		 * can manage its own transactions.
 		 */
-		xfs_defer_init(args->flist, args->firstblock);
+		xfs_defer_init(args->dfops, args->firstblock);
 		error = xfs_attr3_leaf_to_node(args);
 		if (!error)
-			error = xfs_defer_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->dfops, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_defer_cancel(args->flist);
+			xfs_defer_cancel(args->dfops);
 			return error;
 		}
 
@@ -675,15 +675,15 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
 		 * If the result is small enough, shrink it all into the inode.
 		 */
 		if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
-			xfs_defer_init(args->flist, args->firstblock);
+			xfs_defer_init(args->dfops, args->firstblock);
 			error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
 			/* bp is gone due to xfs_da_shrink_inode */
 			if (!error)
 				error = xfs_defer_finish(&args->trans,
-							args->flist, dp);
+							args->dfops, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_defer_cancel(args->flist);
+				xfs_defer_cancel(args->dfops);
 				return error;
 			}
 		}
@@ -738,14 +738,14 @@ xfs_attr_leaf_removename(xfs_da_args_t *args)
 	 * If the result is small enough, shrink it all into the inode.
 	 */
 	if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
-		xfs_defer_init(args->flist, args->firstblock);
+		xfs_defer_init(args->dfops, args->firstblock);
 		error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
 		/* bp is gone due to xfs_da_shrink_inode */
 		if (!error)
-			error = xfs_defer_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->dfops, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_defer_cancel(args->flist);
+			xfs_defer_cancel(args->dfops);
 			return error;
 		}
 	}
@@ -864,14 +864,14 @@ restart:
 			 */
 			xfs_da_state_free(state);
 			state = NULL;
-			xfs_defer_init(args->flist, args->firstblock);
+			xfs_defer_init(args->dfops, args->firstblock);
 			error = xfs_attr3_leaf_to_node(args);
 			if (!error)
 				error = xfs_defer_finish(&args->trans,
-							args->flist, dp);
+							args->dfops, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_defer_cancel(args->flist);
+				xfs_defer_cancel(args->dfops);
 				goto out;
 			}
 
@@ -892,13 +892,13 @@ restart:
 		 * in the index/blkno/rmtblkno/rmtblkcnt fields and
 		 * in the index2/blkno2/rmtblkno2/rmtblkcnt2 fields.
 		 */
-		xfs_defer_init(args->flist, args->firstblock);
+		xfs_defer_init(args->dfops, args->firstblock);
 		error = xfs_da3_split(state);
 		if (!error)
-			error = xfs_defer_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->dfops, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_defer_cancel(args->flist);
+			xfs_defer_cancel(args->dfops);
 			goto out;
 		}
 	} else {
@@ -991,14 +991,14 @@ restart:
 		 * Check to see if the tree needs to be collapsed.
 		 */
 		if (retval && (state->path.active > 1)) {
-			xfs_defer_init(args->flist, args->firstblock);
+			xfs_defer_init(args->dfops, args->firstblock);
 			error = xfs_da3_join(state);
 			if (!error)
 				error = xfs_defer_finish(&args->trans,
-							args->flist, dp);
+							args->dfops, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_defer_cancel(args->flist);
+				xfs_defer_cancel(args->dfops);
 				goto out;
 			}
 		}
@@ -1114,13 +1114,13 @@ xfs_attr_node_removename(xfs_da_args_t *args)
 	 * Check to see if the tree needs to be collapsed.
 	 */
 	if (retval && (state->path.active > 1)) {
-		xfs_defer_init(args->flist, args->firstblock);
+		xfs_defer_init(args->dfops, args->firstblock);
 		error = xfs_da3_join(state);
 		if (!error)
-			error = xfs_defer_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->dfops, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_defer_cancel(args->flist);
+			xfs_defer_cancel(args->dfops);
 			goto out;
 		}
 		/*
@@ -1147,15 +1147,15 @@ xfs_attr_node_removename(xfs_da_args_t *args)
 			goto out;
 
 		if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
-			xfs_defer_init(args->flist, args->firstblock);
+			xfs_defer_init(args->dfops, args->firstblock);
 			error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
 			/* bp is gone due to xfs_da_shrink_inode */
 			if (!error)
 				error = xfs_defer_finish(&args->trans,
-							args->flist, dp);
+							args->dfops, dp);
 			if (error) {
 				args->trans = NULL;
-				xfs_defer_cancel(args->flist);
+				xfs_defer_cancel(args->dfops);
 				goto out;
 			}
 		} else
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 01a5ecf..8ea91f3 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -792,7 +792,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
 	nargs.dp = dp;
 	nargs.geo = args->geo;
 	nargs.firstblock = args->firstblock;
-	nargs.flist = args->flist;
+	nargs.dfops = args->dfops;
 	nargs.total = args->total;
 	nargs.whichfork = XFS_ATTR_FORK;
 	nargs.trans = args->trans;
@@ -922,7 +922,7 @@ xfs_attr3_leaf_to_shortform(
 	nargs.geo = args->geo;
 	nargs.dp = dp;
 	nargs.firstblock = args->firstblock;
-	nargs.flist = args->flist;
+	nargs.dfops = args->dfops;
 	nargs.total = args->total;
 	nargs.whichfork = XFS_ATTR_FORK;
 	nargs.trans = args->trans;
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index aabb516..d52f525 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -461,16 +461,16 @@ xfs_attr_rmtval_set(
 		 * extent and then crash then the block may not contain the
 		 * correct metadata after log recovery occurs.
 		 */
-		xfs_defer_init(args->flist, args->firstblock);
+		xfs_defer_init(args->dfops, args->firstblock);
 		nmap = 1;
 		error = xfs_bmapi_write(args->trans, dp, (xfs_fileoff_t)lblkno,
 				  blkcnt, XFS_BMAPI_ATTRFORK, args->firstblock,
-				  args->total, &map, &nmap, args->flist);
+				  args->total, &map, &nmap, args->dfops);
 		if (!error)
-			error = xfs_defer_finish(&args->trans, args->flist, dp);
+			error = xfs_defer_finish(&args->trans, args->dfops, dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_defer_cancel(args->flist);
+			xfs_defer_cancel(args->dfops);
 			return error;
 		}
 
@@ -504,7 +504,7 @@ xfs_attr_rmtval_set(
 
 		ASSERT(blkcnt > 0);
 
-		xfs_defer_init(args->flist, args->firstblock);
+		xfs_defer_init(args->dfops, args->firstblock);
 		nmap = 1;
 		error = xfs_bmapi_read(dp, (xfs_fileoff_t)lblkno,
 				       blkcnt, &map, &nmap,
@@ -604,16 +604,16 @@ xfs_attr_rmtval_remove(
 	blkcnt = args->rmtblkcnt;
 	done = 0;
 	while (!done) {
-		xfs_defer_init(args->flist, args->firstblock);
+		xfs_defer_init(args->dfops, args->firstblock);
 		error = xfs_bunmapi(args->trans, args->dp, lblkno, blkcnt,
 				    XFS_BMAPI_ATTRFORK, 1, args->firstblock,
-				    args->flist, &done);
+				    args->dfops, &done);
 		if (!error)
-			error = xfs_defer_finish(&args->trans, args->flist,
+			error = xfs_defer_finish(&args->trans, args->dfops,
 						args->dp);
 		if (error) {
 			args->trans = NULL;
-			xfs_defer_cancel(args->flist);
+			xfs_defer_cancel(args->dfops);
 			return error;
 		}
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 45ce7bd..85061a0 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -572,7 +572,7 @@ xfs_bmap_validate_ret(
 void
 xfs_bmap_add_free(
 	struct xfs_mount	*mp,		/* mount point structure */
-	struct xfs_defer_ops	*flist,		/* list of extents */
+	struct xfs_defer_ops	*dfops,		/* list of extents */
 	xfs_fsblock_t		bno,		/* fs block number of extent */
 	xfs_filblks_t		len)		/* length of extent */
 {
@@ -596,7 +596,7 @@ xfs_bmap_add_free(
 	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
 	new->xbfi_startblock = bno;
 	new->xbfi_blockcount = (xfs_extlen_t)len;
-	xfs_defer_add(flist, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
+	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
 }
 
 /*
@@ -649,7 +649,7 @@ xfs_bmap_btree_to_extents(
 	cblock = XFS_BUF_TO_BLOCK(cbp);
 	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
 		return error;
-	xfs_bmap_add_free(mp, cur->bc_private.b.flist, cbno, 1);
+	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1);
 	ip->i_d.di_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
@@ -672,7 +672,7 @@ xfs_bmap_extents_to_btree(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first-block-allocated */
-	struct xfs_defer_ops	*flist,		/* blocks freed in xaction */
+	struct xfs_defer_ops	*dfops,		/* blocks freed in xaction */
 	xfs_btree_cur_t		**curp,		/* cursor returned to caller */
 	int			wasdel,		/* converting a delayed alloc */
 	int			*logflagsp,	/* inode logging flags */
@@ -721,7 +721,7 @@ xfs_bmap_extents_to_btree(
 	 */
 	cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
 	cur->bc_private.b.firstblock = *firstblock;
-	cur->bc_private.b.flist = flist;
+	cur->bc_private.b.dfops = dfops;
 	cur->bc_private.b.flags = wasdel ? XFS_BTCUR_BPRV_WASDEL : 0;
 	/*
 	 * Convert to a btree with two levels, one record in root.
@@ -734,7 +734,7 @@ xfs_bmap_extents_to_btree(
 	if (*firstblock == NULLFSBLOCK) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
-	} else if (flist->dop_low) {
+	} else if (dfops->dop_low) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = *firstblock;
 	} else {
@@ -755,7 +755,7 @@ xfs_bmap_extents_to_btree(
 	ASSERT(args.fsbno != NULLFSBLOCK);
 	ASSERT(*firstblock == NULLFSBLOCK ||
 	       args.agno == XFS_FSB_TO_AGNO(mp, *firstblock) ||
-	       (flist->dop_low &&
+	       (dfops->dop_low &&
 		args.agno > XFS_FSB_TO_AGNO(mp, *firstblock)));
 	*firstblock = cur->bc_private.b.firstblock = args.fsbno;
 	cur->bc_private.b.allocated++;
@@ -940,7 +940,7 @@ xfs_bmap_add_attrfork_btree(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first block allocated */
-	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
+	struct xfs_defer_ops	*dfops,		/* blocks to free at commit */
 	int			*flags)		/* inode logging flags */
 {
 	xfs_btree_cur_t		*cur;		/* btree cursor */
@@ -953,7 +953,7 @@ xfs_bmap_add_attrfork_btree(
 		*flags |= XFS_ILOG_DBROOT;
 	else {
 		cur = xfs_bmbt_init_cursor(mp, tp, ip, XFS_DATA_FORK);
-		cur->bc_private.b.flist = flist;
+		cur->bc_private.b.dfops = dfops;
 		cur->bc_private.b.firstblock = *firstblock;
 		if ((error = xfs_bmbt_lookup_ge(cur, 0, 0, 0, &stat)))
 			goto error0;
@@ -983,7 +983,7 @@ xfs_bmap_add_attrfork_extents(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first block allocated */
-	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
+	struct xfs_defer_ops	*dfops,		/* blocks to free at commit */
 	int			*flags)		/* inode logging flags */
 {
 	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
@@ -992,7 +992,7 @@ xfs_bmap_add_attrfork_extents(
 	if (ip->i_d.di_nextents * sizeof(xfs_bmbt_rec_t) <= XFS_IFORK_DSIZE(ip))
 		return 0;
 	cur = NULL;
-	error = xfs_bmap_extents_to_btree(tp, ip, firstblock, flist, &cur, 0,
+	error = xfs_bmap_extents_to_btree(tp, ip, firstblock, dfops, &cur, 0,
 		flags, XFS_DATA_FORK);
 	if (cur) {
 		cur->bc_private.b.allocated = 0;
@@ -1018,7 +1018,7 @@ xfs_bmap_add_attrfork_local(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	xfs_fsblock_t		*firstblock,	/* first block allocated */
-	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
+	struct xfs_defer_ops	*dfops,		/* blocks to free at commit */
 	int			*flags)		/* inode logging flags */
 {
 	xfs_da_args_t		dargs;		/* args for dir/attr code */
@@ -1031,7 +1031,7 @@ xfs_bmap_add_attrfork_local(
 		dargs.geo = ip->i_mount->m_dir_geo;
 		dargs.dp = ip;
 		dargs.firstblock = firstblock;
-		dargs.flist = flist;
+		dargs.dfops = dfops;
 		dargs.total = dargs.geo->fsbcount;
 		dargs.whichfork = XFS_DATA_FORK;
 		dargs.trans = tp;
@@ -1059,7 +1059,7 @@ xfs_bmap_add_attrfork(
 	int			rsvd)		/* xact may use reserved blks */
 {
 	xfs_fsblock_t		firstblock;	/* 1st block/ag allocated */
-	struct xfs_defer_ops	flist;		/* freed extent records */
+	struct xfs_defer_ops	dfops;		/* freed extent records */
 	xfs_mount_t		*mp;		/* mount structure */
 	xfs_trans_t		*tp;		/* transaction pointer */
 	int			blks;		/* space reservation */
@@ -1125,18 +1125,18 @@ xfs_bmap_add_attrfork(
 	ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_SLEEP);
 	ip->i_afp->if_flags = XFS_IFEXTENTS;
 	logflags = 0;
-	xfs_defer_init(&flist, &firstblock);
+	xfs_defer_init(&dfops, &firstblock);
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_LOCAL:
-		error = xfs_bmap_add_attrfork_local(tp, ip, &firstblock, &flist,
+		error = xfs_bmap_add_attrfork_local(tp, ip, &firstblock, &dfops,
 			&logflags);
 		break;
 	case XFS_DINODE_FMT_EXTENTS:
 		error = xfs_bmap_add_attrfork_extents(tp, ip, &firstblock,
-			&flist, &logflags);
+			&dfops, &logflags);
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		error = xfs_bmap_add_attrfork_btree(tp, ip, &firstblock, &flist,
+		error = xfs_bmap_add_attrfork_btree(tp, ip, &firstblock, &dfops,
 			&logflags);
 		break;
 	default:
@@ -1165,7 +1165,7 @@ xfs_bmap_add_attrfork(
 			xfs_log_sb(tp);
 	}
 
-	error = xfs_defer_finish(&tp, &flist, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto bmap_cancel;
 	error = xfs_trans_commit(tp);
@@ -1173,7 +1173,7 @@ xfs_bmap_add_attrfork(
 	return error;
 
 bmap_cancel:
-	xfs_defer_cancel(&flist);
+	xfs_defer_cancel(&dfops);
 trans_cancel:
 	xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1970,7 +1970,7 @@ xfs_bmap_add_extent_delay_real(
 
 		if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
 			error = xfs_bmap_extents_to_btree(bma->tp, bma->ip,
-					bma->firstblock, bma->flist,
+					bma->firstblock, bma->dfops,
 					&bma->cur, 1, &tmp_rval, whichfork);
 			rval |= tmp_rval;
 			if (error)
@@ -2054,7 +2054,7 @@ xfs_bmap_add_extent_delay_real(
 
 		if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
 			error = xfs_bmap_extents_to_btree(bma->tp, bma->ip,
-				bma->firstblock, bma->flist, &bma->cur, 1,
+				bma->firstblock, bma->dfops, &bma->cur, 1,
 				&tmp_rval, whichfork);
 			rval |= tmp_rval;
 			if (error)
@@ -2123,7 +2123,7 @@ xfs_bmap_add_extent_delay_real(
 
 		if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
 			error = xfs_bmap_extents_to_btree(bma->tp, bma->ip,
-					bma->firstblock, bma->flist, &bma->cur,
+					bma->firstblock, bma->dfops, &bma->cur,
 					1, &tmp_rval, whichfork);
 			rval |= tmp_rval;
 			if (error)
@@ -2172,7 +2172,7 @@ xfs_bmap_add_extent_delay_real(
 
 		ASSERT(bma->cur == NULL);
 		error = xfs_bmap_extents_to_btree(bma->tp, bma->ip,
-				bma->firstblock, bma->flist, &bma->cur,
+				bma->firstblock, bma->dfops, &bma->cur,
 				da_old > 0, &tmp_logflags, whichfork);
 		bma->logflags |= tmp_logflags;
 		if (error)
@@ -2214,7 +2214,7 @@ xfs_bmap_add_extent_unwritten_real(
 	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
 	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
 	xfs_fsblock_t		*first,	/* pointer to firstblock variable */
-	struct xfs_defer_ops	*flist,	/* list of extents to be freed */
+	struct xfs_defer_ops	*dfops,	/* list of extents to be freed */
 	int			*logflagsp) /* inode logging flags */
 {
 	xfs_btree_cur_t		*cur;	/* btree cursor */
@@ -2707,7 +2707,7 @@ xfs_bmap_add_extent_unwritten_real(
 		int	tmp_logflags;	/* partial log flag return val */
 
 		ASSERT(cur == NULL);
-		error = xfs_bmap_extents_to_btree(tp, ip, first, flist, &cur,
+		error = xfs_bmap_extents_to_btree(tp, ip, first, dfops, &cur,
 				0, &tmp_logflags, XFS_DATA_FORK);
 		*logflagsp |= tmp_logflags;
 		if (error)
@@ -3100,7 +3100,7 @@ xfs_bmap_add_extent_hole_real(
 
 		ASSERT(bma->cur == NULL);
 		error = xfs_bmap_extents_to_btree(bma->tp, bma->ip,
-				bma->firstblock, bma->flist, &bma->cur,
+				bma->firstblock, bma->dfops, &bma->cur,
 				0, &tmp_logflags, whichfork);
 		bma->logflags |= tmp_logflags;
 		if (error)
@@ -3675,7 +3675,7 @@ xfs_bmap_btalloc(
 			error = xfs_bmap_btalloc_nullfb(ap, &args, &blen);
 		if (error)
 			return error;
-	} else if (ap->flist->dop_low) {
+	} else if (ap->dfops->dop_low) {
 		if (xfs_inode_is_filestream(ap->ip))
 			args.type = XFS_ALLOCTYPE_FIRST_AG;
 		else
@@ -3708,7 +3708,7 @@ xfs_bmap_btalloc(
 	 * is >= the stripe unit and the allocation offset is
 	 * at the end of file.
 	 */
-	if (!ap->flist->dop_low && ap->aeof) {
+	if (!ap->dfops->dop_low && ap->aeof) {
 		if (!ap->offset) {
 			args.alignment = stripe_align;
 			atype = args.type;
@@ -3801,7 +3801,7 @@ xfs_bmap_btalloc(
 		args.minleft = 0;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
-		ap->flist->dop_low = true;
+		ap->dfops->dop_low = true;
 	}
 	if (args.fsbno != NULLFSBLOCK) {
 		/*
@@ -3811,7 +3811,7 @@ xfs_bmap_btalloc(
 		ASSERT(*ap->firstblock == NULLFSBLOCK ||
 		       XFS_FSB_TO_AGNO(mp, *ap->firstblock) ==
 		       XFS_FSB_TO_AGNO(mp, args.fsbno) ||
-		       (ap->flist->dop_low &&
+		       (ap->dfops->dop_low &&
 			XFS_FSB_TO_AGNO(mp, *ap->firstblock) <
 			XFS_FSB_TO_AGNO(mp, args.fsbno)));
 
@@ -3819,7 +3819,7 @@ xfs_bmap_btalloc(
 		if (*ap->firstblock == NULLFSBLOCK)
 			*ap->firstblock = args.fsbno;
 		ASSERT(nullfb || fb_agno == args.agno ||
-		       (ap->flist->dop_low && fb_agno < args.agno));
+		       (ap->dfops->dop_low && fb_agno < args.agno));
 		ap->length = args.len;
 		ap->ip->i_d.di_nblocks += args.len;
 		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
@@ -4286,7 +4286,7 @@ xfs_bmapi_allocate(
 	if (error)
 		return error;
 
-	if (bma->flist->dop_low)
+	if (bma->dfops->dop_low)
 		bma->minleft = 0;
 	if (bma->cur)
 		bma->cur->bc_private.b.firstblock = *bma->firstblock;
@@ -4295,7 +4295,7 @@ xfs_bmapi_allocate(
 	if ((ifp->if_flags & XFS_IFBROOT) && !bma->cur) {
 		bma->cur = xfs_bmbt_init_cursor(mp, bma->tp, bma->ip, whichfork);
 		bma->cur->bc_private.b.firstblock = *bma->firstblock;
-		bma->cur->bc_private.b.flist = bma->flist;
+		bma->cur->bc_private.b.dfops = bma->dfops;
 	}
 	/*
 	 * Bump the number of extents we've allocated
@@ -4376,7 +4376,7 @@ xfs_bmapi_convert_unwritten(
 		bma->cur = xfs_bmbt_init_cursor(bma->ip->i_mount, bma->tp,
 					bma->ip, whichfork);
 		bma->cur->bc_private.b.firstblock = *bma->firstblock;
-		bma->cur->bc_private.b.flist = bma->flist;
+		bma->cur->bc_private.b.dfops = bma->dfops;
 	}
 	mval->br_state = (mval->br_state == XFS_EXT_UNWRITTEN)
 				? XFS_EXT_NORM : XFS_EXT_UNWRITTEN;
@@ -4393,7 +4393,7 @@ xfs_bmapi_convert_unwritten(
 	}
 
 	error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
-			&bma->cur, mval, bma->firstblock, bma->flist,
+			&bma->cur, mval, bma->firstblock, bma->dfops,
 			&tmp_logflags);
 	/*
 	 * Log the inode core unconditionally in the unwritten extent conversion
@@ -4447,7 +4447,7 @@ xfs_bmapi_write(
 	xfs_extlen_t		total,		/* total blocks needed */
 	struct xfs_bmbt_irec	*mval,		/* output: map values */
 	int			*nmap,		/* i/o: mval size/count */
-	struct xfs_defer_ops	*flist)		/* i/o: list extents to free */
+	struct xfs_defer_ops	*dfops)		/* i/o: list extents to free */
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp;
@@ -4537,7 +4537,7 @@ xfs_bmapi_write(
 	bma.ip = ip;
 	bma.total = total;
 	bma.userdata = 0;
-	bma.flist = flist;
+	bma.dfops = dfops;
 	bma.firstblock = firstblock;
 
 	while (bno < end && n < *nmap) {
@@ -4651,7 +4651,7 @@ error0:
 			       XFS_FSB_TO_AGNO(mp, *firstblock) ==
 			       XFS_FSB_TO_AGNO(mp,
 				       bma.cur->bc_private.b.firstblock) ||
-			       (flist->dop_low &&
+			       (dfops->dop_low &&
 				XFS_FSB_TO_AGNO(mp, *firstblock) <
 				XFS_FSB_TO_AGNO(mp,
 					bma.cur->bc_private.b.firstblock)));
@@ -4735,7 +4735,7 @@ xfs_bmap_del_extent(
 	xfs_inode_t		*ip,	/* incore inode pointer */
 	xfs_trans_t		*tp,	/* current transaction pointer */
 	xfs_extnum_t		*idx,	/* extent number to update/delete */
-	struct xfs_defer_ops	*flist,	/* list of extents to be freed */
+	struct xfs_defer_ops	*dfops,	/* list of extents to be freed */
 	xfs_btree_cur_t		*cur,	/* if null, not a btree */
 	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
 	int			*logflagsp, /* inode logging flags */
@@ -5023,7 +5023,7 @@ xfs_bmap_del_extent(
 	 * If we need to, add to list of extents to delete.
 	 */
 	if (do_fx)
-		xfs_bmap_add_free(mp, flist, del->br_startblock,
+		xfs_bmap_add_free(mp, dfops, del->br_startblock,
 			del->br_blockcount);
 	/*
 	 * Adjust inode # blocks in the file.
@@ -5064,7 +5064,7 @@ xfs_bunmapi(
 	xfs_extnum_t		nexts,		/* number of extents max */
 	xfs_fsblock_t		*firstblock,	/* first allocated block
 						   controls a.g. for allocs */
-	struct xfs_defer_ops	*flist,		/* i/o: list extents to free */
+	struct xfs_defer_ops	*dfops,		/* i/o: list extents to free */
 	int			*done)		/* set if not done yet */
 {
 	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
@@ -5137,7 +5137,7 @@ xfs_bunmapi(
 		ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE);
 		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
 		cur->bc_private.b.firstblock = *firstblock;
-		cur->bc_private.b.flist = flist;
+		cur->bc_private.b.dfops = dfops;
 		cur->bc_private.b.flags = 0;
 	} else
 		cur = NULL;
@@ -5229,7 +5229,7 @@ xfs_bunmapi(
 			}
 			del.br_state = XFS_EXT_UNWRITTEN;
 			error = xfs_bmap_add_extent_unwritten_real(tp, ip,
-					&lastx, &cur, &del, firstblock, flist,
+					&lastx, &cur, &del, firstblock, dfops,
 					&logflags);
 			if (error)
 				goto error0;
@@ -5288,7 +5288,7 @@ xfs_bunmapi(
 				lastx--;
 				error = xfs_bmap_add_extent_unwritten_real(tp,
 						ip, &lastx, &cur, &prev,
-						firstblock, flist, &logflags);
+						firstblock, dfops, &logflags);
 				if (error)
 					goto error0;
 				goto nodelete;
@@ -5297,7 +5297,7 @@ xfs_bunmapi(
 				del.br_state = XFS_EXT_UNWRITTEN;
 				error = xfs_bmap_add_extent_unwritten_real(tp,
 						ip, &lastx, &cur, &del,
-						firstblock, flist, &logflags);
+						firstblock, dfops, &logflags);
 				if (error)
 					goto error0;
 				goto nodelete;
@@ -5355,7 +5355,7 @@ xfs_bunmapi(
 		} else if (cur)
 			cur->bc_private.b.flags &= ~XFS_BTCUR_BPRV_WASDEL;
 
-		error = xfs_bmap_del_extent(ip, tp, &lastx, flist, cur, &del,
+		error = xfs_bmap_del_extent(ip, tp, &lastx, dfops, cur, &del,
 				&tmp_logflags, whichfork);
 		logflags |= tmp_logflags;
 		if (error)
@@ -5389,7 +5389,7 @@ nodelete:
 	 */
 	if (xfs_bmap_needs_btree(ip, whichfork)) {
 		ASSERT(cur == NULL);
-		error = xfs_bmap_extents_to_btree(tp, ip, firstblock, flist,
+		error = xfs_bmap_extents_to_btree(tp, ip, firstblock, dfops,
 			&cur, 0, &tmp_logflags, whichfork);
 		logflags |= tmp_logflags;
 		if (error)
@@ -5678,7 +5678,7 @@ xfs_bmap_shift_extents(
 	int			*done,
 	xfs_fileoff_t		stop_fsb,
 	xfs_fsblock_t		*firstblock,
-	struct xfs_defer_ops	*flist,
+	struct xfs_defer_ops	*dfops,
 	enum shift_direction	direction,
 	int			num_exts)
 {
@@ -5723,7 +5723,7 @@ xfs_bmap_shift_extents(
 	if (ifp->if_flags & XFS_IFBROOT) {
 		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
 		cur->bc_private.b.firstblock = *firstblock;
-		cur->bc_private.b.flist = flist;
+		cur->bc_private.b.dfops = dfops;
 		cur->bc_private.b.flags = 0;
 	}
 
@@ -5832,7 +5832,7 @@ xfs_bmap_split_extent_at(
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		split_fsb,
 	xfs_fsblock_t		*firstfsb,
-	struct xfs_defer_ops	*free_list)
+	struct xfs_defer_ops	*dfops)
 {
 	int				whichfork = XFS_DATA_FORK;
 	struct xfs_btree_cur		*cur = NULL;
@@ -5894,7 +5894,7 @@ xfs_bmap_split_extent_at(
 	if (ifp->if_flags & XFS_IFBROOT) {
 		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
 		cur->bc_private.b.firstblock = *firstfsb;
-		cur->bc_private.b.flist = free_list;
+		cur->bc_private.b.dfops = dfops;
 		cur->bc_private.b.flags = 0;
 		error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
 				got.br_startblock,
@@ -5947,7 +5947,7 @@ xfs_bmap_split_extent_at(
 		int tmp_logflags; /* partial log flag return val */
 
 		ASSERT(cur == NULL);
-		error = xfs_bmap_extents_to_btree(tp, ip, firstfsb, free_list,
+		error = xfs_bmap_extents_to_btree(tp, ip, firstfsb, dfops,
 				&cur, 0, &tmp_logflags, whichfork);
 		logflags |= tmp_logflags;
 	}
@@ -5971,7 +5971,7 @@ xfs_bmap_split_extent(
 {
 	struct xfs_mount        *mp = ip->i_mount;
 	struct xfs_trans        *tp;
-	struct xfs_defer_ops    free_list;
+	struct xfs_defer_ops    dfops;
 	xfs_fsblock_t           firstfsb;
 	int                     error;
 
@@ -5983,21 +5983,21 @@ xfs_bmap_split_extent(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
-	xfs_defer_init(&free_list, &firstfsb);
+	xfs_defer_init(&dfops, &firstfsb);
 
 	error = xfs_bmap_split_extent_at(tp, ip, split_fsb,
-			&firstfsb, &free_list);
+			&firstfsb, &dfops);
 	if (error)
 		goto out;
 
-	error = xfs_defer_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto out;
 
 	return xfs_trans_commit(tp);
 
 out:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index e2a0425..8c5f530 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -32,7 +32,7 @@ extern kmem_zone_t	*xfs_bmap_free_item_zone;
  */
 struct xfs_bmalloca {
 	xfs_fsblock_t		*firstblock; /* i/o first block allocated */
-	struct xfs_defer_ops	*flist;	/* bmap freelist */
+	struct xfs_defer_ops	*dfops;	/* bmap freelist */
 	struct xfs_trans	*tp;	/* transaction pointer */
 	struct xfs_inode	*ip;	/* incore inode pointer */
 	struct xfs_bmbt_irec	prev;	/* extent before the new one */
@@ -164,7 +164,7 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
 
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
-void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *flist,
+void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
 			  xfs_fsblock_t bno, xfs_filblks_t len);
 void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
 int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
@@ -186,18 +186,18 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_fsblock_t *firstblock, xfs_extlen_t total,
 		struct xfs_bmbt_irec *mval, int *nmap,
-		struct xfs_defer_ops *flist);
+		struct xfs_defer_ops *dfops);
 int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
-		struct xfs_defer_ops *flist, int *done);
+		struct xfs_defer_ops *dfops, int *done);
 int	xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
 		xfs_extnum_t num);
 uint	xfs_default_attroffset(struct xfs_inode *ip);
 int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb,
 		int *done, xfs_fileoff_t stop_fsb, xfs_fsblock_t *firstblock,
-		struct xfs_defer_ops *flist, enum shift_direction direction,
+		struct xfs_defer_ops *dfops, enum shift_direction direction,
 		int num_exts);
 int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
 
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index fa5e3a5..18b5361 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -407,11 +407,11 @@ xfs_bmbt_dup_cursor(
 			cur->bc_private.b.ip, cur->bc_private.b.whichfork);
 
 	/*
-	 * Copy the firstblock, flist, and flags values,
+	 * Copy the firstblock, dfops, and flags values,
 	 * since init cursor doesn't get them.
 	 */
 	new->bc_private.b.firstblock = cur->bc_private.b.firstblock;
-	new->bc_private.b.flist = cur->bc_private.b.flist;
+	new->bc_private.b.dfops = cur->bc_private.b.dfops;
 	new->bc_private.b.flags = cur->bc_private.b.flags;
 
 	return new;
@@ -424,7 +424,7 @@ xfs_bmbt_update_cursor(
 {
 	ASSERT((dst->bc_private.b.firstblock != NULLFSBLOCK) ||
 	       (dst->bc_private.b.ip->i_d.di_flags & XFS_DIFLAG_REALTIME));
-	ASSERT(dst->bc_private.b.flist == src->bc_private.b.flist);
+	ASSERT(dst->bc_private.b.dfops == src->bc_private.b.dfops);
 
 	dst->bc_private.b.allocated += src->bc_private.b.allocated;
 	dst->bc_private.b.firstblock = src->bc_private.b.firstblock;
@@ -463,7 +463,7 @@ xfs_bmbt_alloc_block(
 		 * block allocation here and corrupt the filesystem.
 		 */
 		args.minleft = args.tp->t_blk_res;
-	} else if (cur->bc_private.b.flist->dop_low) {
+	} else if (cur->bc_private.b.dfops->dop_low) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
 	} else {
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
@@ -491,7 +491,7 @@ xfs_bmbt_alloc_block(
 		error = xfs_alloc_vextent(&args);
 		if (error)
 			goto error0;
-		cur->bc_private.b.flist->dop_low = true;
+		cur->bc_private.b.dfops->dop_low = true;
 	}
 	if (args.fsbno == NULLFSBLOCK) {
 		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
@@ -527,7 +527,7 @@ xfs_bmbt_free_block(
 	struct xfs_trans	*tp = cur->bc_tp;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
 
-	xfs_bmap_add_free(mp, cur->bc_private.b.flist, fsbno, 1);
+	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1);
 	ip->i_d.di_nblocks--;
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
@@ -789,7 +789,7 @@ xfs_bmbt_init_cursor(
 	cur->bc_private.b.forksize = XFS_IFORK_SIZE(ip, whichfork);
 	cur->bc_private.b.ip = ip;
 	cur->bc_private.b.firstblock = NULLFSBLOCK;
-	cur->bc_private.b.flist = NULL;
+	cur->bc_private.b.dfops = NULL;
 	cur->bc_private.b.allocated = 0;
 	cur->bc_private.b.flags = 0;
 	cur->bc_private.b.whichfork = whichfork;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index ae714a8..7483cac 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -234,12 +234,12 @@ typedef struct xfs_btree_cur
 	union {
 		struct {			/* needed for BNO, CNT, INO */
 			struct xfs_buf	*agbp;	/* agf/agi buffer pointer */
-			struct xfs_defer_ops *flist;	/* deferred updates */
+			struct xfs_defer_ops *dfops;	/* deferred updates */
 			xfs_agnumber_t	agno;	/* ag number */
 		} a;
 		struct {			/* needed for BMAP */
 			struct xfs_inode *ip;	/* pointer to our inode */
-			struct xfs_defer_ops *flist;	/* deferred updates */
+			struct xfs_defer_ops *dfops;	/* deferred updates */
 			xfs_fsblock_t	firstblock;	/* 1st blk allocated */
 			int		allocated;	/* count of alloced */
 			short		forksize;	/* fork's inode space */
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 097bf77..68594c7 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2030,7 +2030,7 @@ xfs_da_grow_inode_int(
 	error = xfs_bmapi_write(tp, dp, *bno, count,
 			xfs_bmapi_aflag(w)|XFS_BMAPI_METADATA|XFS_BMAPI_CONTIG,
 			args->firstblock, args->total, &map, &nmap,
-			args->flist);
+			args->dfops);
 	if (error)
 		return error;
 
@@ -2053,7 +2053,7 @@ xfs_da_grow_inode_int(
 			error = xfs_bmapi_write(tp, dp, b, c,
 					xfs_bmapi_aflag(w)|XFS_BMAPI_METADATA,
 					args->firstblock, args->total,
-					&mapp[mapi], &nmap, args->flist);
+					&mapp[mapi], &nmap, args->dfops);
 			if (error)
 				goto out_free_map;
 			if (nmap < 1)
@@ -2363,7 +2363,7 @@ xfs_da_shrink_inode(
 		 */
 		error = xfs_bunmapi(tp, dp, dead_blkno, count,
 				    xfs_bmapi_aflag(w), 0, args->firstblock,
-				    args->flist, &done);
+				    args->dfops, &done);
 		if (error == -ENOSPC) {
 			if (w != XFS_DATA_FORK)
 				break;
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 249813a..98c75cb 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -70,7 +70,7 @@ typedef struct xfs_da_args {
 	xfs_ino_t	inumber;	/* input/output inode number */
 	struct xfs_inode *dp;		/* directory inode to manipulate */
 	xfs_fsblock_t	*firstblock;	/* ptr to firstblock for bmap calls */
-	struct xfs_defer_ops *flist;	/* ptr to freelist for bmap_finish */
+	struct xfs_defer_ops *dfops;	/* ptr to freelist for bmap_finish */
 	struct xfs_trans *trans;	/* current trans (changes over time) */
 	xfs_extlen_t	total;		/* total blocks needed, for 1st bmap */
 	int		whichfork;	/* data or attribute fork */
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 0523100..20a96dd 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -260,7 +260,7 @@ xfs_dir_createname(
 	struct xfs_name		*name,
 	xfs_ino_t		inum,		/* new entry inode number */
 	xfs_fsblock_t		*first,		/* bmap's firstblock */
-	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
+	struct xfs_defer_ops	*dfops,		/* bmap's freeblock list */
 	xfs_extlen_t		total)		/* bmap's total block count */
 {
 	struct xfs_da_args	*args;
@@ -287,7 +287,7 @@ xfs_dir_createname(
 	args->inumber = inum;
 	args->dp = dp;
 	args->firstblock = first;
-	args->flist = flist;
+	args->dfops = dfops;
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
@@ -437,7 +437,7 @@ xfs_dir_removename(
 	struct xfs_name	*name,
 	xfs_ino_t	ino,
 	xfs_fsblock_t	*first,		/* bmap's firstblock */
-	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
+	struct xfs_defer_ops	*dfops,		/* bmap's freeblock list */
 	xfs_extlen_t	total)		/* bmap's total block count */
 {
 	struct xfs_da_args *args;
@@ -459,7 +459,7 @@ xfs_dir_removename(
 	args->inumber = ino;
 	args->dp = dp;
 	args->firstblock = first;
-	args->flist = flist;
+	args->dfops = dfops;
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
@@ -499,7 +499,7 @@ xfs_dir_replace(
 	struct xfs_name	*name,		/* name of entry to replace */
 	xfs_ino_t	inum,		/* new inode number */
 	xfs_fsblock_t	*first,		/* bmap's firstblock */
-	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
+	struct xfs_defer_ops	*dfops,		/* bmap's freeblock list */
 	xfs_extlen_t	total)		/* bmap's total block count */
 {
 	struct xfs_da_args *args;
@@ -524,7 +524,7 @@ xfs_dir_replace(
 	args->inumber = inum;
 	args->dp = dp;
 	args->firstblock = first;
-	args->flist = flist;
+	args->dfops = dfops;
 	args->total = total;
 	args->whichfork = XFS_DATA_FORK;
 	args->trans = tp;
@@ -681,7 +681,7 @@ xfs_dir2_shrink_inode(
 
 	/* Unmap the fsblock(s). */
 	error = xfs_bunmapi(tp, dp, da, args->geo->fsbcount, 0, 0,
-			    args->firstblock, args->flist, &done);
+			    args->firstblock, args->dfops, &done);
 	if (error) {
 		/*
 		 * ENOSPC actually can happen if we're in a removename with no
diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
index 5737d85..a9bab0e 100644
--- a/fs/xfs/libxfs/xfs_dir2.h
+++ b/fs/xfs/libxfs/xfs_dir2.h
@@ -129,18 +129,18 @@ extern int xfs_dir_init(struct xfs_trans *tp, struct xfs_inode *dp,
 extern int xfs_dir_createname(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t inum,
 				xfs_fsblock_t *first,
-				struct xfs_defer_ops *flist, xfs_extlen_t tot);
+				struct xfs_defer_ops *dfops, xfs_extlen_t tot);
 extern int xfs_dir_lookup(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t *inum,
 				struct xfs_name *ci_name);
 extern int xfs_dir_removename(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t ino,
 				xfs_fsblock_t *first,
-				struct xfs_defer_ops *flist, xfs_extlen_t tot);
+				struct xfs_defer_ops *dfops, xfs_extlen_t tot);
 extern int xfs_dir_replace(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name, xfs_ino_t inum,
 				xfs_fsblock_t *first,
-				struct xfs_defer_ops *flist, xfs_extlen_t tot);
+				struct xfs_defer_ops *dfops, xfs_extlen_t tot);
 extern int xfs_dir_canenter(struct xfs_trans *tp, struct xfs_inode *dp,
 				struct xfs_name *name);
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index f2e29a1..dbc3e35 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1818,7 +1818,7 @@ xfs_difree_inode_chunk(
 	struct xfs_mount		*mp,
 	xfs_agnumber_t			agno,
 	struct xfs_inobt_rec_incore	*rec,
-	struct xfs_defer_ops		*flist)
+	struct xfs_defer_ops		*dfops)
 {
 	xfs_agblock_t	sagbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
 	int		startidx, endidx;
@@ -1829,7 +1829,7 @@ xfs_difree_inode_chunk(
 
 	if (!xfs_inobt_issparse(rec->ir_holemask)) {
 		/* not sparse, calculate extent info directly */
-		xfs_bmap_add_free(mp, flist, XFS_AGB_TO_FSB(mp, agno, sagbno),
+		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
 				  mp->m_ialloc_blks);
 		return;
 	}
@@ -1873,7 +1873,7 @@ xfs_difree_inode_chunk(
 
 		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
 		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
-		xfs_bmap_add_free(mp, flist, XFS_AGB_TO_FSB(mp, agno, agbno),
+		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
 				  contigblk);
 
 		/* reset range to current bit and carry on... */
@@ -1890,7 +1890,7 @@ xfs_difree_inobt(
 	struct xfs_trans		*tp,
 	struct xfs_buf			*agbp,
 	xfs_agino_t			agino,
-	struct xfs_defer_ops		*flist,
+	struct xfs_defer_ops		*dfops,
 	struct xfs_icluster		*xic,
 	struct xfs_inobt_rec_incore	*orec)
 {
@@ -1977,7 +1977,7 @@ xfs_difree_inobt(
 			goto error0;
 		}
 
-		xfs_difree_inode_chunk(mp, agno, &rec, flist);
+		xfs_difree_inode_chunk(mp, agno, &rec, dfops);
 	} else {
 		xic->deleted = 0;
 
@@ -2122,7 +2122,7 @@ int
 xfs_difree(
 	struct xfs_trans	*tp,		/* transaction pointer */
 	xfs_ino_t		inode,		/* inode to be freed */
-	struct xfs_defer_ops	*flist,		/* extents to free */
+	struct xfs_defer_ops	*dfops,		/* extents to free */
 	struct xfs_icluster	*xic)	/* cluster info if deleted */
 {
 	/* REFERENCED */
@@ -2174,7 +2174,7 @@ xfs_difree(
 	/*
 	 * Fix up the inode allocation btree.
 	 */
-	error = xfs_difree_inobt(mp, tp, agbp, agino, flist, xic, &rec);
+	error = xfs_difree_inobt(mp, tp, agbp, agino, dfops, xic, &rec);
 	if (error)
 		goto error0;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 2e06b67..0bb8966 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -95,7 +95,7 @@ int					/* error */
 xfs_difree(
 	struct xfs_trans *tp,		/* transaction pointer */
 	xfs_ino_t	inode,		/* inode to be freed */
-	struct xfs_defer_ops *flist,	/* extents to free */
+	struct xfs_defer_ops *dfops,	/* extents to free */
 	struct xfs_icluster *ifree);	/* cluster info if deleted */
 
 /*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 928dfa4..62d194e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -685,7 +685,7 @@ xfs_bmap_punch_delalloc_range(
 		xfs_bmbt_irec_t	imap;
 		int		nimaps = 1;
 		xfs_fsblock_t	firstblock;
-		struct xfs_defer_ops flist;
+		struct xfs_defer_ops dfops;
 
 		/*
 		 * Map the range first and check that it is a delalloc extent
@@ -716,18 +716,18 @@ xfs_bmap_punch_delalloc_range(
 		WARN_ON(imap.br_blockcount == 0);
 
 		/*
-		 * Note: while we initialise the firstblock/flist pair, they
+		 * Note: while we initialise the firstblock/dfops pair, they
 		 * should never be used because blocks should never be
 		 * allocated or freed for a delalloc extent and hence we need
 		 * don't cancel or finish them after the xfs_bunmapi() call.
 		 */
-		xfs_defer_init(&flist, &firstblock);
+		xfs_defer_init(&dfops, &firstblock);
 		error = xfs_bunmapi(NULL, ip, start_fsb, 1, 0, 1, &firstblock,
-					&flist, &done);
+					&dfops, &done);
 		if (error)
 			break;
 
-		ASSERT(!xfs_defer_has_unfinished_work(&flist));
+		ASSERT(!xfs_defer_has_unfinished_work(&dfops));
 next_block:
 		start_fsb++;
 		remaining--;
@@ -884,7 +884,7 @@ xfs_alloc_file_space(
 	int			rt;
 	xfs_trans_t		*tp;
 	xfs_bmbt_irec_t		imaps[1], *imapp;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	uint			qblocks, resblks, resrtextents;
 	int			error;
 
@@ -975,17 +975,17 @@ xfs_alloc_file_space(
 
 		xfs_trans_ijoin(tp, ip, 0);
 
-		xfs_defer_init(&free_list, &firstfsb);
+		xfs_defer_init(&dfops, &firstfsb);
 		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
 					allocatesize_fsb, alloc_type, &firstfsb,
-					resblks, imapp, &nimaps, &free_list);
+					resblks, imapp, &nimaps, &dfops);
 		if (error)
 			goto error0;
 
 		/*
 		 * Complete the transaction
 		 */
-		error = xfs_defer_finish(&tp, &free_list, NULL);
+		error = xfs_defer_finish(&tp, &dfops, NULL);
 		if (error)
 			goto error0;
 
@@ -1008,7 +1008,7 @@ xfs_alloc_file_space(
 	return error;
 
 error0:	/* Cancel bmap, unlock inode, unreserve quota blocks, cancel trans */
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 	xfs_trans_unreserve_quota_nblks(tp, ip, (long)qblocks, 0, quota_flag);
 
 error1:	/* Just cancel transaction */
@@ -1122,7 +1122,7 @@ xfs_free_file_space(
 	xfs_fileoff_t		endoffset_fsb;
 	int			error;
 	xfs_fsblock_t		firstfsb;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_bmbt_irec_t		imap;
 	xfs_off_t		ioffset;
 	xfs_off_t		iendoffset;
@@ -1245,17 +1245,17 @@ xfs_free_file_space(
 		/*
 		 * issue the bunmapi() call to free the blocks
 		 */
-		xfs_defer_init(&free_list, &firstfsb);
+		xfs_defer_init(&dfops, &firstfsb);
 		error = xfs_bunmapi(tp, ip, startoffset_fsb,
 				  endoffset_fsb - startoffset_fsb,
-				  0, 2, &firstfsb, &free_list, &done);
+				  0, 2, &firstfsb, &dfops, &done);
 		if (error)
 			goto error0;
 
 		/*
 		 * complete the transaction
 		 */
-		error = xfs_defer_finish(&tp, &free_list, ip);
+		error = xfs_defer_finish(&tp, &dfops, ip);
 		if (error)
 			goto error0;
 
@@ -1267,7 +1267,7 @@ xfs_free_file_space(
 	return error;
 
  error0:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
  error1:
 	xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -1333,7 +1333,7 @@ xfs_shift_file_space(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	int			error;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t		first_block;
 	xfs_fileoff_t		stop_fsb;
 	xfs_fileoff_t		next_fsb;
@@ -1411,19 +1411,19 @@ xfs_shift_file_space(
 
 		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
-		xfs_defer_init(&free_list, &first_block);
+		xfs_defer_init(&dfops, &first_block);
 
 		/*
 		 * We are using the write transaction in which max 2 bmbt
 		 * updates are allowed
 		 */
 		error = xfs_bmap_shift_extents(tp, ip, &next_fsb, shift_fsb,
-				&done, stop_fsb, &first_block, &free_list,
+				&done, stop_fsb, &first_block, &dfops,
 				direction, XFS_BMAP_MAX_SHIFT_EXTENTS);
 		if (error)
 			goto out_bmap_cancel;
 
-		error = xfs_defer_finish(&tp, &free_list, NULL);
+		error = xfs_defer_finish(&tp, &dfops, NULL);
 		if (error)
 			goto out_bmap_cancel;
 
@@ -1433,7 +1433,7 @@ xfs_shift_file_space(
 	return error;
 
 out_bmap_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 	return error;
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 764e1cc..8ca21b8 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -307,7 +307,7 @@ xfs_qm_dqalloc(
 	xfs_buf_t	**O_bpp)
 {
 	xfs_fsblock_t	firstblock;
-	struct xfs_defer_ops flist;
+	struct xfs_defer_ops dfops;
 	xfs_bmbt_irec_t map;
 	int		nmaps, error;
 	xfs_buf_t	*bp;
@@ -320,7 +320,7 @@ xfs_qm_dqalloc(
 	/*
 	 * Initialize the bmap freelist prior to calling bmapi code.
 	 */
-	xfs_defer_init(&flist, &firstblock);
+	xfs_defer_init(&dfops, &firstblock);
 	xfs_ilock(quotip, XFS_ILOCK_EXCL);
 	/*
 	 * Return if this type of quotas is turned off while we didn't
@@ -336,7 +336,7 @@ xfs_qm_dqalloc(
 	error = xfs_bmapi_write(tp, quotip, offset_fsb,
 				XFS_DQUOT_CLUSTER_SIZE_FSB, XFS_BMAPI_METADATA,
 				&firstblock, XFS_QM_DQALLOC_SPACE_RES(mp),
-				&map, &nmaps, &flist);
+				&map, &nmaps, &dfops);
 	if (error)
 		goto error0;
 	ASSERT(map.br_blockcount == XFS_DQUOT_CLUSTER_SIZE_FSB);
@@ -382,7 +382,7 @@ xfs_qm_dqalloc(
 
 	xfs_trans_bhold(tp, bp);
 
-	error = xfs_defer_finish(tpp, &flist, NULL);
+	error = xfs_defer_finish(tpp, &dfops, NULL);
 	if (error)
 		goto error1;
 
@@ -398,7 +398,7 @@ xfs_qm_dqalloc(
 	return 0;
 
 error1:
-	xfs_defer_cancel(&flist);
+	xfs_defer_cancel(&dfops);
 error0:
 	xfs_iunlock(quotip, XFS_ILOCK_EXCL);
 
diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
index 3e990fb..4a33a33 100644
--- a/fs/xfs/xfs_filestream.c
+++ b/fs/xfs/xfs_filestream.c
@@ -386,7 +386,7 @@ xfs_filestream_new_ag(
 	}
 
 	flags = (ap->userdata ? XFS_PICK_USERDATA : 0) |
-	        (ap->flist->dop_low ? XFS_PICK_LOWSPACE : 0);
+	        (ap->dfops->dop_low ? XFS_PICK_LOWSPACE : 0);
 
 	err = xfs_filestream_pick_ag(pip, startag, agp, flags, minlen);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 3ce50da..e08eaea 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1123,7 +1123,7 @@ xfs_create(
 	struct xfs_inode	*ip = NULL;
 	struct xfs_trans	*tp = NULL;
 	int			error;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t		first_block;
 	bool                    unlock_dp_on_error = false;
 	prid_t			prid;
@@ -1183,7 +1183,7 @@ xfs_create(
 		      XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
 	unlock_dp_on_error = true;
 
-	xfs_defer_init(&free_list, &first_block);
+	xfs_defer_init(&dfops, &first_block);
 
 	/*
 	 * Reserve disk quota and the inode.
@@ -1220,7 +1220,7 @@ xfs_create(
 	unlock_dp_on_error = false;
 
 	error = xfs_dir_createname(tp, dp, name, ip->i_ino,
-					&first_block, &free_list, resblks ?
+					&first_block, &dfops, resblks ?
 					resblks - XFS_IALLOC_SPACE_RES(mp) : 0);
 	if (error) {
 		ASSERT(error != -ENOSPC);
@@ -1254,7 +1254,7 @@ xfs_create(
 	 */
 	xfs_qm_vop_create_dqattach(tp, ip, udqp, gdqp, pdqp);
 
-	error = xfs_defer_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -1270,7 +1270,7 @@ xfs_create(
 	return 0;
 
  out_bmap_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
  out_trans_cancel:
 	xfs_trans_cancel(tp);
  out_release_inode:
@@ -1402,7 +1402,7 @@ xfs_link(
 	xfs_mount_t		*mp = tdp->i_mount;
 	xfs_trans_t		*tp;
 	int			error;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t           first_block;
 	int			resblks;
 
@@ -1453,7 +1453,7 @@ xfs_link(
 			goto error_return;
 	}
 
-	xfs_defer_init(&free_list, &first_block);
+	xfs_defer_init(&dfops, &first_block);
 
 	/*
 	 * Handle initial link state of O_TMPFILE inode
@@ -1465,7 +1465,7 @@ xfs_link(
 	}
 
 	error = xfs_dir_createname(tp, tdp, target_name, sip->i_ino,
-					&first_block, &free_list, resblks);
+					&first_block, &dfops, resblks);
 	if (error)
 		goto error_return;
 	xfs_trans_ichgtime(tp, tdp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
@@ -1483,9 +1483,9 @@ xfs_link(
 	if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
 		xfs_trans_set_sync(tp);
 
-	error = xfs_defer_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error) {
-		xfs_defer_cancel(&free_list);
+		xfs_defer_cancel(&dfops);
 		goto error_return;
 	}
 
@@ -1527,7 +1527,7 @@ xfs_itruncate_extents(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp = *tpp;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t		first_block;
 	xfs_fileoff_t		first_unmap_block;
 	xfs_fileoff_t		last_block;
@@ -1563,12 +1563,12 @@ xfs_itruncate_extents(
 	ASSERT(first_unmap_block < last_block);
 	unmap_len = last_block - first_unmap_block + 1;
 	while (!done) {
-		xfs_defer_init(&free_list, &first_block);
+		xfs_defer_init(&dfops, &first_block);
 		error = xfs_bunmapi(tp, ip,
 				    first_unmap_block, unmap_len,
 				    xfs_bmapi_aflag(whichfork),
 				    XFS_ITRUNC_MAX_EXTENTS,
-				    &first_block, &free_list,
+				    &first_block, &dfops,
 				    &done);
 		if (error)
 			goto out_bmap_cancel;
@@ -1577,7 +1577,7 @@ xfs_itruncate_extents(
 		 * Duplicate the transaction that has the permanent
 		 * reservation and commit the old transaction.
 		 */
-		error = xfs_defer_finish(&tp, &free_list, ip);
+		error = xfs_defer_finish(&tp, &dfops, ip);
 		if (error)
 			goto out_bmap_cancel;
 
@@ -1603,7 +1603,7 @@ out_bmap_cancel:
 	 * the transaction can be properly aborted.  We just need to make sure
 	 * we're not holding any resources that we were not when we came in.
 	 */
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 	goto out;
 }
 
@@ -1744,7 +1744,7 @@ STATIC int
 xfs_inactive_ifree(
 	struct xfs_inode *ip)
 {
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t		first_block;
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -1781,8 +1781,8 @@ xfs_inactive_ifree(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
-	xfs_defer_init(&free_list, &first_block);
-	error = xfs_ifree(tp, ip, &free_list);
+	xfs_defer_init(&dfops, &first_block);
+	error = xfs_ifree(tp, ip, &dfops);
 	if (error) {
 		/*
 		 * If we fail to free the inode, shut down.  The cancel
@@ -1808,11 +1808,11 @@ xfs_inactive_ifree(
 	 * Just ignore errors at this point.  There is nothing we can do except
 	 * to try to keep going. Make sure it's not a silent error.
 	 */
-	error = xfs_defer_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error) {
 		xfs_notice(mp, "%s: xfs_defer_finish returned error %d",
 			__func__, error);
-		xfs_defer_cancel(&free_list);
+		xfs_defer_cancel(&dfops);
 	}
 	error = xfs_trans_commit(tp);
 	if (error)
@@ -2368,7 +2368,7 @@ int
 xfs_ifree(
 	xfs_trans_t	*tp,
 	xfs_inode_t	*ip,
-	struct xfs_defer_ops	*flist)
+	struct xfs_defer_ops	*dfops)
 {
 	int			error;
 	struct xfs_icluster	xic = { 0 };
@@ -2387,7 +2387,7 @@ xfs_ifree(
 	if (error)
 		return error;
 
-	error = xfs_difree(tp, ip->i_ino, flist, &xic);
+	error = xfs_difree(tp, ip->i_ino, dfops, &xic);
 	if (error)
 		return error;
 
@@ -2490,7 +2490,7 @@ xfs_remove(
 	xfs_trans_t             *tp = NULL;
 	int			is_dir = S_ISDIR(VFS_I(ip)->i_mode);
 	int                     error = 0;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t           first_block;
 	uint			resblks;
 
@@ -2572,9 +2572,9 @@ xfs_remove(
 	if (error)
 		goto out_trans_cancel;
 
-	xfs_defer_init(&free_list, &first_block);
+	xfs_defer_init(&dfops, &first_block);
 	error = xfs_dir_removename(tp, dp, name, ip->i_ino,
-					&first_block, &free_list, resblks);
+					&first_block, &dfops, resblks);
 	if (error) {
 		ASSERT(error != -ENOENT);
 		goto out_bmap_cancel;
@@ -2588,7 +2588,7 @@ xfs_remove(
 	if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
 		xfs_trans_set_sync(tp);
 
-	error = xfs_defer_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -2602,7 +2602,7 @@ xfs_remove(
 	return 0;
 
  out_bmap_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
  out_trans_cancel:
 	xfs_trans_cancel(tp);
  std_return:
@@ -2663,7 +2663,7 @@ xfs_sort_for_rename(
 static int
 xfs_finish_rename(
 	struct xfs_trans	*tp,
-	struct xfs_defer_ops	*free_list)
+	struct xfs_defer_ops	*dfops)
 {
 	int			error;
 
@@ -2674,9 +2674,9 @@ xfs_finish_rename(
 	if (tp->t_mountp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
 		xfs_trans_set_sync(tp);
 
-	error = xfs_defer_finish(&tp, free_list, NULL);
+	error = xfs_defer_finish(&tp, dfops, NULL);
 	if (error) {
-		xfs_defer_cancel(free_list);
+		xfs_defer_cancel(dfops);
 		xfs_trans_cancel(tp);
 		return error;
 	}
@@ -2698,7 +2698,7 @@ xfs_cross_rename(
 	struct xfs_inode	*dp2,
 	struct xfs_name		*name2,
 	struct xfs_inode	*ip2,
-	struct xfs_defer_ops	*free_list,
+	struct xfs_defer_ops	*dfops,
 	xfs_fsblock_t		*first_block,
 	int			spaceres)
 {
@@ -2710,14 +2710,14 @@ xfs_cross_rename(
 	/* Swap inode number for dirent in first parent */
 	error = xfs_dir_replace(tp, dp1, name1,
 				ip2->i_ino,
-				first_block, free_list, spaceres);
+				first_block, dfops, spaceres);
 	if (error)
 		goto out_trans_abort;
 
 	/* Swap inode number for dirent in second parent */
 	error = xfs_dir_replace(tp, dp2, name2,
 				ip1->i_ino,
-				first_block, free_list, spaceres);
+				first_block, dfops, spaceres);
 	if (error)
 		goto out_trans_abort;
 
@@ -2732,7 +2732,7 @@ xfs_cross_rename(
 		if (S_ISDIR(VFS_I(ip2)->i_mode)) {
 			error = xfs_dir_replace(tp, ip2, &xfs_name_dotdot,
 						dp1->i_ino, first_block,
-						free_list, spaceres);
+						dfops, spaceres);
 			if (error)
 				goto out_trans_abort;
 
@@ -2759,7 +2759,7 @@ xfs_cross_rename(
 		if (S_ISDIR(VFS_I(ip1)->i_mode)) {
 			error = xfs_dir_replace(tp, ip1, &xfs_name_dotdot,
 						dp2->i_ino, first_block,
-						free_list, spaceres);
+						dfops, spaceres);
 			if (error)
 				goto out_trans_abort;
 
@@ -2798,10 +2798,10 @@ xfs_cross_rename(
 	}
 	xfs_trans_ichgtime(tp, dp1, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, dp1, XFS_ILOG_CORE);
-	return xfs_finish_rename(tp, free_list);
+	return xfs_finish_rename(tp, dfops);
 
 out_trans_abort:
-	xfs_defer_cancel(free_list);
+	xfs_defer_cancel(dfops);
 	xfs_trans_cancel(tp);
 	return error;
 }
@@ -2856,7 +2856,7 @@ xfs_rename(
 {
 	struct xfs_mount	*mp = src_dp->i_mount;
 	struct xfs_trans	*tp;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t		first_block;
 	struct xfs_inode	*wip = NULL;		/* whiteout inode */
 	struct xfs_inode	*inodes[__XFS_SORT_INODES];
@@ -2945,13 +2945,13 @@ xfs_rename(
 		goto out_trans_cancel;
 	}
 
-	xfs_defer_init(&free_list, &first_block);
+	xfs_defer_init(&dfops, &first_block);
 
 	/* RENAME_EXCHANGE is unique from here on. */
 	if (flags & RENAME_EXCHANGE)
 		return xfs_cross_rename(tp, src_dp, src_name, src_ip,
 					target_dp, target_name, target_ip,
-					&free_list, &first_block, spaceres);
+					&dfops, &first_block, spaceres);
 
 	/*
 	 * Set up the target.
@@ -2973,7 +2973,7 @@ xfs_rename(
 		 */
 		error = xfs_dir_createname(tp, target_dp, target_name,
 						src_ip->i_ino, &first_block,
-						&free_list, spaceres);
+						&dfops, spaceres);
 		if (error)
 			goto out_bmap_cancel;
 
@@ -3013,7 +3013,7 @@ xfs_rename(
 		 */
 		error = xfs_dir_replace(tp, target_dp, target_name,
 					src_ip->i_ino,
-					&first_block, &free_list, spaceres);
+					&first_block, &dfops, spaceres);
 		if (error)
 			goto out_bmap_cancel;
 
@@ -3048,7 +3048,7 @@ xfs_rename(
 		 */
 		error = xfs_dir_replace(tp, src_ip, &xfs_name_dotdot,
 					target_dp->i_ino,
-					&first_block, &free_list, spaceres);
+					&first_block, &dfops, spaceres);
 		ASSERT(error != -EEXIST);
 		if (error)
 			goto out_bmap_cancel;
@@ -3087,10 +3087,10 @@ xfs_rename(
 	 */
 	if (wip) {
 		error = xfs_dir_replace(tp, src_dp, src_name, wip->i_ino,
-					&first_block, &free_list, spaceres);
+					&first_block, &dfops, spaceres);
 	} else
 		error = xfs_dir_removename(tp, src_dp, src_name, src_ip->i_ino,
-					   &first_block, &free_list, spaceres);
+					   &first_block, &dfops, spaceres);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -3125,13 +3125,13 @@ xfs_rename(
 	if (new_parent)
 		xfs_trans_log_inode(tp, target_dp, XFS_ILOG_CORE);
 
-	error = xfs_finish_rename(tp, &free_list);
+	error = xfs_finish_rename(tp, &dfops);
 	if (wip)
 		IRELE(wip);
 	return error;
 
 out_bmap_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 out_release_wip:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index cb7abe84..61b61f51 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -128,7 +128,7 @@ xfs_iomap_write_direct(
 	int		quota_flag;
 	int		rt;
 	xfs_trans_t	*tp;
-	struct xfs_defer_ops free_list;
+	struct xfs_defer_ops dfops;
 	uint		qblocks, resblks, resrtextents;
 	int		error;
 	int		lockmode;
@@ -231,18 +231,18 @@ xfs_iomap_write_direct(
 	 * From this point onwards we overwrite the imap pointer that the
 	 * caller gave to us.
 	 */
-	xfs_defer_init(&free_list, &firstfsb);
+	xfs_defer_init(&dfops, &firstfsb);
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
 				bmapi_flags, &firstfsb, resblks, imap,
-				&nimaps, &free_list);
+				&nimaps, &dfops);
 	if (error)
 		goto out_bmap_cancel;
 
 	/*
 	 * Complete the transaction
 	 */
-	error = xfs_defer_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -266,7 +266,7 @@ out_unlock:
 	return error;
 
 out_bmap_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 	xfs_trans_unreserve_quota_nblks(tp, ip, (long)qblocks, 0, quota_flag);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
@@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
 	xfs_fileoff_t	offset_fsb, last_block;
 	xfs_fileoff_t	end_fsb, map_start_fsb;
 	xfs_fsblock_t	first_block;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_filblks_t	count_fsb;
 	xfs_trans_t	*tp;
 	int		nimaps;
@@ -727,7 +727,7 @@ xfs_iomap_write_allocate(
 			xfs_ilock(ip, XFS_ILOCK_EXCL);
 			xfs_trans_ijoin(tp, ip, 0);
 
-			xfs_defer_init(&free_list, &first_block);
+			xfs_defer_init(&dfops, &first_block);
 
 			/*
 			 * it is possible that the extents have changed since
@@ -783,11 +783,11 @@ xfs_iomap_write_allocate(
 			error = xfs_bmapi_write(tp, ip, map_start_fsb,
 						count_fsb, 0, &first_block,
 						nres, imap, &nimaps,
-						&free_list);
+						&dfops);
 			if (error)
 				goto trans_cancel;
 
-			error = xfs_defer_finish(&tp, &free_list, NULL);
+			error = xfs_defer_finish(&tp, &dfops, NULL);
 			if (error)
 				goto trans_cancel;
 
@@ -821,7 +821,7 @@ xfs_iomap_write_allocate(
 	}
 
 trans_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 	xfs_trans_cancel(tp);
 error0:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -842,7 +842,7 @@ xfs_iomap_write_unwritten(
 	int		nimaps;
 	xfs_trans_t	*tp;
 	xfs_bmbt_irec_t imap;
-	struct xfs_defer_ops free_list;
+	struct xfs_defer_ops dfops;
 	xfs_fsize_t	i_size;
 	uint		resblks;
 	int		error;
@@ -886,11 +886,11 @@ xfs_iomap_write_unwritten(
 		/*
 		 * Modify the unwritten extent state of the buffer.
 		 */
-		xfs_defer_init(&free_list, &firstfsb);
+		xfs_defer_init(&dfops, &firstfsb);
 		nimaps = 1;
 		error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
 					XFS_BMAPI_CONVERT, &firstfsb, resblks,
-					&imap, &nimaps, &free_list);
+					&imap, &nimaps, &dfops);
 		if (error)
 			goto error_on_bmapi_transaction;
 
@@ -909,7 +909,7 @@ xfs_iomap_write_unwritten(
 			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 		}
 
-		error = xfs_defer_finish(&tp, &free_list, NULL);
+		error = xfs_defer_finish(&tp, &dfops, NULL);
 		if (error)
 			goto error_on_bmapi_transaction;
 
@@ -936,7 +936,7 @@ xfs_iomap_write_unwritten(
 	return 0;
 
 error_on_bmapi_transaction:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 	xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index c761a6a..802bcc3 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -770,7 +770,7 @@ xfs_growfs_rt_alloc(
 	xfs_daddr_t		d;		/* disk block address */
 	int			error;		/* error return value */
 	xfs_fsblock_t		firstblock;/* first block allocated in xaction */
-	struct xfs_defer_ops	flist;		/* list of freed blocks */
+	struct xfs_defer_ops	dfops;		/* list of freed blocks */
 	xfs_fsblock_t		fsbno;		/* filesystem block for bno */
 	struct xfs_bmbt_irec	map;		/* block map output */
 	int			nmap;		/* number of block maps */
@@ -795,14 +795,14 @@ xfs_growfs_rt_alloc(
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
-		xfs_defer_init(&flist, &firstblock);
+		xfs_defer_init(&dfops, &firstblock);
 		/*
 		 * Allocate blocks to the bitmap file.
 		 */
 		nmap = 1;
 		error = xfs_bmapi_write(tp, ip, oblocks, nblocks - oblocks,
 					XFS_BMAPI_METADATA, &firstblock,
-					resblks, &map, &nmap, &flist);
+					resblks, &map, &nmap, &dfops);
 		if (!error && nmap < 1)
 			error = -ENOSPC;
 		if (error)
@@ -810,7 +810,7 @@ xfs_growfs_rt_alloc(
 		/*
 		 * Free any blocks freed up in the transaction, then commit.
 		 */
-		error = xfs_defer_finish(&tp, &flist, NULL);
+		error = xfs_defer_finish(&tp, &dfops, NULL);
 		if (error)
 			goto out_bmap_cancel;
 		error = xfs_trans_commit(tp);
@@ -863,7 +863,7 @@ xfs_growfs_rt_alloc(
 	return 0;
 
 out_bmap_cancel:
-	xfs_defer_cancel(&flist);
+	xfs_defer_cancel(&dfops);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 	return error;
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 3b005ec..58142ae 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -173,7 +173,7 @@ xfs_symlink(
 	struct xfs_inode	*ip = NULL;
 	int			error = 0;
 	int			pathlen;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	xfs_fsblock_t		first_block;
 	bool                    unlock_dp_on_error = false;
 	xfs_fileoff_t		first_fsb;
@@ -270,7 +270,7 @@ xfs_symlink(
 	 * Initialize the bmap freelist prior to calling either
 	 * bmapi or the directory create code.
 	 */
-	xfs_defer_init(&free_list, &first_block);
+	xfs_defer_init(&dfops, &first_block);
 
 	/*
 	 * Allocate an inode for the symlink.
@@ -314,7 +314,7 @@ xfs_symlink(
 
 		error = xfs_bmapi_write(tp, ip, first_fsb, fs_blocks,
 				  XFS_BMAPI_METADATA, &first_block, resblks,
-				  mval, &nmaps, &free_list);
+				  mval, &nmaps, &dfops);
 		if (error)
 			goto out_bmap_cancel;
 
@@ -362,7 +362,7 @@ xfs_symlink(
 	 * Create the directory entry for the symlink.
 	 */
 	error = xfs_dir_createname(tp, dp, link_name, ip->i_ino,
-					&first_block, &free_list, resblks);
+					&first_block, &dfops, resblks);
 	if (error)
 		goto out_bmap_cancel;
 	xfs_trans_ichgtime(tp, dp, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
@@ -377,7 +377,7 @@ xfs_symlink(
 		xfs_trans_set_sync(tp);
 	}
 
-	error = xfs_defer_finish(&tp, &free_list, NULL);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
 	if (error)
 		goto out_bmap_cancel;
 
@@ -393,7 +393,7 @@ xfs_symlink(
 	return 0;
 
 out_bmap_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 out_trans_cancel:
 	xfs_trans_cancel(tp);
 out_release_inode:
@@ -427,7 +427,7 @@ xfs_inactive_symlink_rmt(
 	int		done;
 	int		error;
 	xfs_fsblock_t	first_block;
-	struct xfs_defer_ops	free_list;
+	struct xfs_defer_ops	dfops;
 	int		i;
 	xfs_mount_t	*mp;
 	xfs_bmbt_irec_t	mval[XFS_SYMLINK_MAPS];
@@ -466,7 +466,7 @@ xfs_inactive_symlink_rmt(
 	 * Find the block(s) so we can inval and unmap them.
 	 */
 	done = 0;
-	xfs_defer_init(&free_list, &first_block);
+	xfs_defer_init(&dfops, &first_block);
 	nmaps = ARRAY_SIZE(mval);
 	error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
 				mval, &nmaps, 0);
@@ -486,17 +486,17 @@ xfs_inactive_symlink_rmt(
 		xfs_trans_binval(tp, bp);
 	}
 	/*
-	 * Unmap the dead block(s) to the free_list.
+	 * Unmap the dead block(s) to the dfops.
 	 */
 	error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps,
-			    &first_block, &free_list, &done);
+			    &first_block, &dfops, &done);
 	if (error)
 		goto error_bmap_cancel;
 	ASSERT(done);
 	/*
 	 * Commit the first transaction.  This logs the EFI and the inode.
 	 */
-	error = xfs_defer_finish(&tp, &free_list, ip);
+	error = xfs_defer_finish(&tp, &dfops, ip);
 	if (error)
 		goto error_bmap_cancel;
 	/*
@@ -526,7 +526,7 @@ xfs_inactive_symlink_rmt(
 	return 0;
 
 error_bmap_cancel:
-	xfs_defer_cancel(&free_list);
+	xfs_defer_cancel(&dfops);
 error_trans_cancel:
 	xfs_trans_cancel(tp);
 error_unlock:


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 022/119] xfs: add tracepoints and error injection for deferred extent freeing
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (20 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 021/119] xfs: rename flist/free_list to dfops Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-17  1:20 ` [PATCH 023/119] xfs: introduce rmap btree definitions Darrick J. Wong
                   ` (96 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Add a couple of tracepoints for the deferred extent free operation and
a site for injecting errors while finishing the operation.  This makes
it easier to debug deferred ops and test log redo.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c |    7 +++++++
 fs/xfs/libxfs/xfs_bmap.c  |    2 ++
 fs/xfs/xfs_error.h        |    4 +++-
 fs/xfs/xfs_trace.h        |    5 ++++-
 4 files changed, 16 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c06b463..56c8690 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2704,6 +2704,13 @@ xfs_free_extent(
 
 	ASSERT(len != 0);
 
+	trace_xfs_bmap_free_deferred(mp, agno, 0, agbno, len);
+
+	if (XFS_TEST_ERROR(false, mp,
+			XFS_ERRTAG_FREE_EXTENT,
+			XFS_RANDOM_FREE_EXTENT))
+		return -EIO;
+
 	error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
 	if (error)
 		return error;
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 85061a0..3a6d3e3 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -596,6 +596,8 @@ xfs_bmap_add_free(
 	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
 	new->xbfi_startblock = bno;
 	new->xbfi_blockcount = (xfs_extlen_t)len;
+	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
+			XFS_FSB_TO_AGBNO(mp, bno), len);
 	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
 }
 
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 4ed3042..ee4680e 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -90,7 +90,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_STRATCMPL_IOERR			19
 #define XFS_ERRTAG_DIOWRITE_IOERR			20
 #define XFS_ERRTAG_BMAPIFORMAT				21
-#define XFS_ERRTAG_MAX					22
+#define XFS_ERRTAG_FREE_EXTENT				22
+#define XFS_ERRTAG_MAX					23
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -117,6 +118,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_STRATCMPL_IOERR			(XFS_RANDOM_DEFAULT/10)
 #define XFS_RANDOM_DIOWRITE_IOERR			(XFS_RANDOM_DEFAULT/10)
 #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
+#define XFS_RANDOM_FREE_EXTENT				1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 5923014..777a89c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2415,9 +2415,12 @@ DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_cancel);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
 
-DEFINE_PHYS_EXTENT_DEFERRED_EVENT(xfs_defer_phys_extent);
 DEFINE_MAP_EXTENT_DEFERRED_EVENT(xfs_defer_map_extent);
 
+#define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
+DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
+DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 023/119] xfs: introduce rmap btree definitions
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (21 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 022/119] xfs: add tracepoints and error injection for deferred extent freeing Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-30 17:32   ` Brian Foster
  2016-06-17  1:20 ` [PATCH 024/119] xfs: add rmap btree stats infrastructure Darrick J. Wong
                   ` (95 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Add new per-ag rmap btree definitions to the per-ag structures. The
rmap btree will sit in the empty slots on disk after the free space
btrees, and hence form a part of the array of space management
btrees. This requires the definition of the btree to be contiguous
with the free space btrees.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_alloc.c  |    6 ++++++
 fs/xfs/libxfs/xfs_btree.c  |    4 ++--
 fs/xfs/libxfs/xfs_btree.h  |    3 +++
 fs/xfs/libxfs/xfs_format.h |   22 +++++++++++++++++-----
 fs/xfs/libxfs/xfs_types.h  |    4 ++--
 5 files changed, 30 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 56c8690..b61e9c6 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2272,6 +2272,10 @@ xfs_agf_verify(
 	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]) > XFS_BTREE_MAXLEVELS)
 		return false;
 
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb) &&
+	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) > XFS_BTREE_MAXLEVELS)
+		return false;
+
 	/*
 	 * during growfs operations, the perag is not fully initialised,
 	 * so we can't use it for any useful checking. growfs ensures we can't
@@ -2403,6 +2407,8 @@ xfs_alloc_read_agf(
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
 		pag->pagf_levels[XFS_BTNUM_CNTi] =
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
+		pag->pagf_levels[XFS_BTNUM_RMAPi] =
+			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
 		spin_lock_init(&pag->pagb_lock);
 		pag->pagb_count = 0;
 #ifdef __KERNEL__
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 5b3743a..624b572 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -44,9 +44,9 @@ kmem_zone_t	*xfs_btree_cur_zone;
  * Btree magic numbers.
  */
 static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
-	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
+	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
 	  XFS_FIBT_MAGIC },
-	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC,
+	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC,
 	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
 };
 #define xfs_btree_magic(cur) \
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 7483cac..202fdd3 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -63,6 +63,7 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_BMAP	((xfs_btnum_t)XFS_BTNUM_BMAPi)
 #define	XFS_BTNUM_INO	((xfs_btnum_t)XFS_BTNUM_INOi)
 #define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
+#define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
 
 /*
  * For logging record fields.
@@ -95,6 +96,7 @@ do {    \
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(__mp, bmbt, stat); break; \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
+	case XFS_BTNUM_RMAP: break;	\
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
@@ -115,6 +117,7 @@ do {    \
 		__XFS_BTREE_STATS_ADD(__mp, ibt, stat, val); break; \
 	case XFS_BTNUM_FINO:	\
 		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
+	case XFS_BTNUM_RMAP: break; \
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index ba528b3..8ca4a3d 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -455,6 +455,7 @@ xfs_sb_has_compat_feature(
 }
 
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
+#define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
@@ -538,6 +539,12 @@ static inline bool xfs_sb_version_hasmetauuid(struct xfs_sb *sbp)
 		(sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_META_UUID);
 }
 
+static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
+}
+
 /*
  * end of superblock version macros
  */
@@ -598,10 +605,10 @@ xfs_is_quota_inode(struct xfs_sb *sbp, xfs_ino_t ino)
 #define	XFS_AGI_GOOD_VERSION(v)	((v) == XFS_AGI_VERSION)
 
 /*
- * Btree number 0 is bno, 1 is cnt.  This value gives the size of the
+ * Btree number 0 is bno, 1 is cnt, 2 is rmap. This value gives the size of the
  * arrays below.
  */
-#define	XFS_BTNUM_AGF	((int)XFS_BTNUM_CNTi + 1)
+#define	XFS_BTNUM_AGF	((int)XFS_BTNUM_RMAPi + 1)
 
 /*
  * The second word of agf_levels in the first a.g. overlaps the EFS
@@ -618,12 +625,10 @@ typedef struct xfs_agf {
 	__be32		agf_seqno;	/* sequence # starting from 0 */
 	__be32		agf_length;	/* size in blocks of a.g. */
 	/*
-	 * Freespace information
+	 * Freespace and rmap information
 	 */
 	__be32		agf_roots[XFS_BTNUM_AGF];	/* root blocks */
-	__be32		agf_spare0;	/* spare field */
 	__be32		agf_levels[XFS_BTNUM_AGF];	/* btree levels */
-	__be32		agf_spare1;	/* spare field */
 
 	__be32		agf_flfirst;	/* first freelist block's index */
 	__be32		agf_fllast;	/* last freelist block's index */
@@ -1307,6 +1312,13 @@ typedef __be32 xfs_inobt_ptr_t;
 #define	XFS_FIBT_BLOCK(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
 
 /*
+ * Reverse mapping btree format definitions
+ *
+ * There is a btree for the reverse map per allocation group
+ */
+#define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
+
+/*
  * The first data block of an AG depends on whether the filesystem was formatted
  * with the finobt feature. If so, account for the finobt reserved root btree
  * block.
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index f0d145a..da87796 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -111,8 +111,8 @@ typedef enum {
 } xfs_lookup_t;
 
 typedef enum {
-	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_BMAPi, XFS_BTNUM_INOi,
-	XFS_BTNUM_FINOi, XFS_BTNUM_MAX
+	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
+	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 struct xfs_name {


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 024/119] xfs: add rmap btree stats infrastructure
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (22 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 023/119] xfs: introduce rmap btree definitions Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-30 17:32   ` Brian Foster
  2016-06-17  1:20 ` [PATCH 025/119] xfs: rmap btree add more reserved blocks Darrick J. Wong
                   ` (94 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

The rmap btree will require the same stats as all the other generic
btrees, so add al the code for that now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_btree.h |    5 +++--
 fs/xfs/xfs_stats.c        |    1 +
 fs/xfs/xfs_stats.h        |   18 +++++++++++++++++-
 3 files changed, 21 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 202fdd3..a29067c 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -96,7 +96,7 @@ do {    \
 	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(__mp, bmbt, stat); break; \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
-	case XFS_BTNUM_RMAP: break;	\
+	case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(__mp, rmap, stat); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
@@ -117,7 +117,8 @@ do {    \
 		__XFS_BTREE_STATS_ADD(__mp, ibt, stat, val); break; \
 	case XFS_BTNUM_FINO:	\
 		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
-	case XFS_BTNUM_RMAP: break; \
+	case XFS_BTNUM_RMAP:	\
+		__XFS_BTREE_STATS_ADD(__mp, rmap, stat, val); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
index 8686df6..f04f547 100644
--- a/fs/xfs/xfs_stats.c
+++ b/fs/xfs/xfs_stats.c
@@ -61,6 +61,7 @@ int xfs_stats_format(struct xfsstats __percpu *stats, char *buf)
 		{ "bmbt2",		XFSSTAT_END_BMBT_V2		},
 		{ "ibt2",		XFSSTAT_END_IBT_V2		},
 		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
+		{ "rmapbt",		XFSSTAT_END_RMAP_V2		},
 		/* we print both series of quota information together */
 		{ "qm",			XFSSTAT_END_QM			},
 	};
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index 483b0ef..657865f 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -197,7 +197,23 @@ struct xfsstats {
 	__uint32_t		xs_fibt_2_alloc;
 	__uint32_t		xs_fibt_2_free;
 	__uint32_t		xs_fibt_2_moves;
-#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_FIBT_V2+6)
+#define XFSSTAT_END_RMAP_V2		(XFSSTAT_END_FIBT_V2+15)
+	__uint32_t		xs_rmap_2_lookup;
+	__uint32_t		xs_rmap_2_compare;
+	__uint32_t		xs_rmap_2_insrec;
+	__uint32_t		xs_rmap_2_delrec;
+	__uint32_t		xs_rmap_2_newroot;
+	__uint32_t		xs_rmap_2_killroot;
+	__uint32_t		xs_rmap_2_increment;
+	__uint32_t		xs_rmap_2_decrement;
+	__uint32_t		xs_rmap_2_lshift;
+	__uint32_t		xs_rmap_2_rshift;
+	__uint32_t		xs_rmap_2_split;
+	__uint32_t		xs_rmap_2_join;
+	__uint32_t		xs_rmap_2_alloc;
+	__uint32_t		xs_rmap_2_free;
+	__uint32_t		xs_rmap_2_moves;
+#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_RMAP_V2+6)
 	__uint32_t		xs_qm_dqreclaims;
 	__uint32_t		xs_qm_dqreclaim_misses;
 	__uint32_t		xs_qm_dquot_dups;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 025/119] xfs: rmap btree add more reserved blocks
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (23 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 024/119] xfs: add rmap btree stats infrastructure Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-30 17:32   ` Brian Foster
  2016-06-17  1:20 ` [PATCH 026/119] xfs: add owner field to extent allocation and freeing Darrick J. Wong
                   ` (93 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

XFS reserves a small amount of space in each AG for the minimum
number of free blocks needed for operation. Adding the rmap btree
increases the number of reserved blocks, but it also increases the
complexity of the calculation as the free inode btree is optional
(like the rmbt).

Rather than calculate the prealloc blocks every time we need to
check it, add a function to calculate it at mount time and store it
in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
just to use the xfs-mount variable directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_alloc.c  |   11 +++++++++++
 fs/xfs/libxfs/xfs_alloc.h  |    2 ++
 fs/xfs/libxfs/xfs_format.h |    9 +--------
 fs/xfs/xfs_fsops.c         |    6 +++---
 fs/xfs/xfs_mount.c         |    2 ++
 fs/xfs/xfs_mount.h         |    1 +
 6 files changed, 20 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index b61e9c6..fb00042 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -50,6 +50,17 @@ STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
 		xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
 
+xfs_extlen_t
+xfs_prealloc_blocks(
+	struct xfs_mount	*mp)
+{
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return XFS_RMAP_BLOCK(mp) + 1;
+	if (xfs_sb_version_hasfinobt(&mp->m_sb))
+		return XFS_FIBT_BLOCK(mp) + 1;
+	return XFS_IBT_BLOCK(mp) + 1;
+}
+
 /*
  * Lookup the record equal to [bno, len] in the btree given by cur.
  */
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index cf268b2..20b54aa 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -232,4 +232,6 @@ int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
 int xfs_free_extent_fix_freelist(struct xfs_trans *tp, xfs_agnumber_t agno,
 		struct xfs_buf **agbp);
 
+xfs_extlen_t xfs_prealloc_blocks(struct xfs_mount *mp);
+
 #endif	/* __XFS_ALLOC_H__ */
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 8ca4a3d..b5b0901 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1318,18 +1318,11 @@ typedef __be32 xfs_inobt_ptr_t;
  */
 #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
 
-/*
- * The first data block of an AG depends on whether the filesystem was formatted
- * with the finobt feature. If so, account for the finobt reserved root btree
- * block.
- */
-#define XFS_PREALLOC_BLOCKS(mp) \
+#define	XFS_RMAP_BLOCK(mp) \
 	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
 	 XFS_FIBT_BLOCK(mp) + 1 : \
 	 XFS_IBT_BLOCK(mp) + 1)
 
-
-
 /*
  * BMAP Btree format definitions
  *
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 064fce1..62162d4 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -243,7 +243,7 @@ xfs_growfs_data_private(
 		agf->agf_flfirst = cpu_to_be32(1);
 		agf->agf_fllast = 0;
 		agf->agf_flcount = 0;
-		tmpsize = agsize - XFS_PREALLOC_BLOCKS(mp);
+		tmpsize = agsize - mp->m_ag_prealloc_blocks;
 		agf->agf_freeblks = cpu_to_be32(tmpsize);
 		agf->agf_longest = cpu_to_be32(tmpsize);
 		if (xfs_sb_version_hascrc(&mp->m_sb))
@@ -340,7 +340,7 @@ xfs_growfs_data_private(
 						agno, 0);
 
 		arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
-		arec->ar_startblock = cpu_to_be32(XFS_PREALLOC_BLOCKS(mp));
+		arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
 		arec->ar_blockcount = cpu_to_be32(
 			agsize - be32_to_cpu(arec->ar_startblock));
 
@@ -369,7 +369,7 @@ xfs_growfs_data_private(
 						agno, 0);
 
 		arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
-		arec->ar_startblock = cpu_to_be32(XFS_PREALLOC_BLOCKS(mp));
+		arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
 		arec->ar_blockcount = cpu_to_be32(
 			agsize - be32_to_cpu(arec->ar_startblock));
 		nfree += be32_to_cpu(arec->ar_blockcount);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bf63682..b4153f0 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -231,6 +231,8 @@ xfs_initialize_perag(
 
 	if (maxagi)
 		*maxagi = index;
+
+	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
 	return 0;
 
 out_unwind:
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c1b798c..0537b1f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -119,6 +119,7 @@ typedef struct xfs_mount {
 	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
 	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
 	uint			m_in_maxlevels;	/* max inobt btree levels. */
+	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
 	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
 	struct mutex		m_growlock;	/* growfs mutex */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 026/119] xfs: add owner field to extent allocation and freeing
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (24 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 025/119] xfs: rmap btree add more reserved blocks Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-07-06  4:01   ` Dave Chinner
  2016-07-07 15:12   ` Brian Foster
  2016-06-17  1:20 ` [PATCH 027/119] xfs: introduce rmap extent operation stubs Darrick J. Wong
                   ` (92 subsequent siblings)
  118 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

For the rmap btree to work, we have to feed the extent owner
information to the the allocation and freeing functions. This
information is what will end up in the rmap btree that tracks
allocated extents. While we technically don't need the owner
information when freeing extents, passing it allows us to validate
that the extent we are removing from the rmap btree actually
belonged to the owner we expected it to belong to.

We also define a special set of owner values for internal metadata
that would otherwise have no owner. This allows us to tell the
difference between metadata owned by different per-ag btrees, as
well as static fs metadata (e.g. AG headers) and internal journal
blocks.

There are also a couple of special cases we need to take care of -
during EFI recovery, we don't actually know who the original owner
was, so we need to pass a wildcard to indicate that we aren't
checking the owner for validity. We also need special handling in
growfs, as we "free" the space in the last AG when extending it, but
because it's new space it has no actual owner...

While touching the xfs_bmap_add_free() function, re-order the
parameters to put the struct xfs_mount first.

Extend the owner field to include both the owner type and some sort
of index within the owner.  The index field will be used to support
reverse mappings when reflink is enabled.

This is based upon a patch originally from Dave Chinner. It has been
extended to add more owner information with the intent of helping
recovery operations when things go wrong (e.g. offset of user data
block in a file).

v2: When we're freeing extents from an EFI, we don't have the owner
information available (rmap updates have their own redo items).
xfs_free_extent therefore doesn't need to do an rmap update, but the
log replay code doesn't signal this correctly.  Fix it so that it
does.

[dchinner: de-shout the xfs_rmap_*_owner helpers]
[darrick: minor style fixes suggested by Christoph Hellwig]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_alloc.c        |   11 +++++-
 fs/xfs/libxfs/xfs_alloc.h        |    4 ++
 fs/xfs/libxfs/xfs_bmap.c         |   17 ++++++++--
 fs/xfs/libxfs/xfs_bmap.h         |    4 ++
 fs/xfs/libxfs/xfs_bmap_btree.c   |    6 +++-
 fs/xfs/libxfs/xfs_format.h       |   65 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc.c       |    7 +++-
 fs/xfs/libxfs/xfs_ialloc_btree.c |    7 ++++
 fs/xfs/xfs_defer_item.c          |    3 +-
 fs/xfs/xfs_fsops.c               |   16 +++++++--
 fs/xfs/xfs_log_recover.c         |    5 ++-
 fs/xfs/xfs_trans.h               |    2 +
 fs/xfs/xfs_trans_extfree.c       |    5 ++-
 13 files changed, 131 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index fb00042..eed26f9 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1596,6 +1596,7 @@ xfs_free_ag_extent(
 	xfs_agnumber_t	agno,	/* allocation group number */
 	xfs_agblock_t	bno,	/* starting block number */
 	xfs_extlen_t	len,	/* length of extent */
+	struct xfs_owner_info	*oinfo,	/* extent owner */
 	int		isfl)	/* set if is freelist blocks - no sb acctg */
 {
 	xfs_btree_cur_t	*bno_cur;	/* cursor for by-block btree */
@@ -2005,13 +2006,15 @@ xfs_alloc_fix_freelist(
 	 * back on the free list? Maybe we should only do this when space is
 	 * getting low or the AGFL is more than half full?
 	 */
+	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
 	while (pag->pagf_flcount > need) {
 		struct xfs_buf	*bp;
 
 		error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
 		if (error)
 			goto out_agbp_relse;
-		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1);
+		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
+					   &targs.oinfo, 1);
 		if (error)
 			goto out_agbp_relse;
 		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
@@ -2021,6 +2024,7 @@ xfs_alloc_fix_freelist(
 	memset(&targs, 0, sizeof(targs));
 	targs.tp = tp;
 	targs.mp = mp;
+	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
 	targs.agbp = agbp;
 	targs.agno = args->agno;
 	targs.alignment = targs.minlen = targs.prod = targs.isfl = 1;
@@ -2711,7 +2715,8 @@ int				/* error */
 xfs_free_extent(
 	struct xfs_trans	*tp,	/* transaction pointer */
 	xfs_fsblock_t		bno,	/* starting block number of extent */
-	xfs_extlen_t		len)	/* length of extent */
+	xfs_extlen_t		len,	/* length of extent */
+	struct xfs_owner_info	*oinfo)	/* extent owner */
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_buf		*agbp;
@@ -2739,7 +2744,7 @@ xfs_free_extent(
 			agbno + len <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length),
 			err);
 
-	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, 0);
+	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, 0);
 	if (error)
 		goto err;
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 20b54aa..0721a48 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -123,6 +123,7 @@ typedef struct xfs_alloc_arg {
 	char		isfl;		/* set if is freelist blocks - !acctg */
 	char		userdata;	/* mask defining userdata treatment */
 	xfs_fsblock_t	firstblock;	/* io first block allocated */
+	struct xfs_owner_info	oinfo;	/* owner of blocks being allocated */
 } xfs_alloc_arg_t;
 
 /*
@@ -210,7 +211,8 @@ int				/* error */
 xfs_free_extent(
 	struct xfs_trans *tp,	/* transaction pointer */
 	xfs_fsblock_t	bno,	/* starting block number of extent */
-	xfs_extlen_t	len);	/* length of extent */
+	xfs_extlen_t	len,	/* length of extent */
+	struct xfs_owner_info	*oinfo);	/* extent owner */
 
 int				/* error */
 xfs_alloc_lookup_ge(
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3a6d3e3..2c28f2a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -574,7 +574,8 @@ xfs_bmap_add_free(
 	struct xfs_mount	*mp,		/* mount point structure */
 	struct xfs_defer_ops	*dfops,		/* list of extents */
 	xfs_fsblock_t		bno,		/* fs block number of extent */
-	xfs_filblks_t		len)		/* length of extent */
+	xfs_filblks_t		len,		/* length of extent */
+	struct xfs_owner_info	*oinfo)		/* extent owner */
 {
 	struct xfs_bmap_free_item	*new;		/* new element */
 #ifdef DEBUG
@@ -593,9 +594,14 @@ xfs_bmap_add_free(
 	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
 #endif
 	ASSERT(xfs_bmap_free_item_zone != NULL);
+
 	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
 	new->xbfi_startblock = bno;
 	new->xbfi_blockcount = (xfs_extlen_t)len;
+	if (oinfo)
+		memcpy(&new->xbfi_oinfo, oinfo, sizeof(struct xfs_owner_info));
+	else
+		memset(&new->xbfi_oinfo, 0, sizeof(struct xfs_owner_info));
 	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
 			XFS_FSB_TO_AGBNO(mp, bno), len);
 	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
@@ -628,6 +634,7 @@ xfs_bmap_btree_to_extents(
 	xfs_mount_t		*mp;	/* mount point structure */
 	__be64			*pp;	/* ptr to block address */
 	struct xfs_btree_block	*rblock;/* root btree block */
+	struct xfs_owner_info	oinfo;
 
 	mp = ip->i_mount;
 	ifp = XFS_IFORK_PTR(ip, whichfork);
@@ -651,7 +658,8 @@ xfs_bmap_btree_to_extents(
 	cblock = XFS_BUF_TO_BLOCK(cbp);
 	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
 		return error;
-	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1);
+	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
+	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
 	ip->i_d.di_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
@@ -732,6 +740,7 @@ xfs_bmap_extents_to_btree(
 	memset(&args, 0, sizeof(args));
 	args.tp = tp;
 	args.mp = mp;
+	xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork);
 	args.firstblock = *firstblock;
 	if (*firstblock == NULLFSBLOCK) {
 		args.type = XFS_ALLOCTYPE_START_BNO;
@@ -878,6 +887,7 @@ xfs_bmap_local_to_extents(
 	memset(&args, 0, sizeof(args));
 	args.tp = tp;
 	args.mp = ip->i_mount;
+	xfs_rmap_ino_owner(&args.oinfo, ip->i_ino, whichfork, 0);
 	args.firstblock = *firstblock;
 	/*
 	 * Allocate a block.  We know we need only one, since the
@@ -4839,6 +4849,7 @@ xfs_bmap_del_extent(
 		nblks = 0;
 		do_fx = 0;
 	}
+
 	/*
 	 * Set flag value to use in switch statement.
 	 * Left-contig is 2, right-contig is 1.
@@ -5026,7 +5037,7 @@ xfs_bmap_del_extent(
 	 */
 	if (do_fx)
 		xfs_bmap_add_free(mp, dfops, del->br_startblock,
-			del->br_blockcount);
+				  del->br_blockcount, NULL);
 	/*
 	 * Adjust inode # blocks in the file.
 	 */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 8c5f530..862ea464 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -67,6 +67,7 @@ struct xfs_bmap_free_item
 	xfs_fsblock_t		xbfi_startblock;/* starting fs block number */
 	xfs_extlen_t		xbfi_blockcount;/* number of blocks in extent */
 	struct list_head	xbfi_list;
+	struct xfs_owner_info	xbfi_oinfo;	/* extent owner */
 };
 
 #define	XFS_BMAP_MAX_NMAP	4
@@ -165,7 +166,8 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
 int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
 void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
 void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
-			  xfs_fsblock_t bno, xfs_filblks_t len);
+			  xfs_fsblock_t bno, xfs_filblks_t len,
+			  struct xfs_owner_info *oinfo);
 void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
 int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 18b5361..3e68f9a 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -447,6 +447,8 @@ xfs_bmbt_alloc_block(
 	args.mp = cur->bc_mp;
 	args.fsbno = cur->bc_private.b.firstblock;
 	args.firstblock = args.fsbno;
+	xfs_rmap_ino_bmbt_owner(&args.oinfo, cur->bc_private.b.ip->i_ino,
+			cur->bc_private.b.whichfork);
 
 	if (args.fsbno == NULLFSBLOCK) {
 		args.fsbno = be64_to_cpu(start->l);
@@ -526,8 +528,10 @@ xfs_bmbt_free_block(
 	struct xfs_inode	*ip = cur->bc_private.b.ip;
 	struct xfs_trans	*tp = cur->bc_tp;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
+	struct xfs_owner_info	oinfo;
 
-	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1);
+	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
+	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
 	ip->i_d.di_nblocks--;
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index b5b0901..97f354f 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1318,6 +1318,71 @@ typedef __be32 xfs_inobt_ptr_t;
  */
 #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
 
+/*
+ * Ownership info for an extent.  This is used to create reverse-mapping
+ * entries.
+ */
+#define XFS_OWNER_INFO_ATTR_FORK	(1 << 0)
+#define XFS_OWNER_INFO_BMBT_BLOCK	(1 << 1)
+struct xfs_owner_info {
+	uint64_t		oi_owner;
+	xfs_fileoff_t		oi_offset;
+	unsigned int		oi_flags;
+};
+
+static inline void
+xfs_rmap_ag_owner(
+	struct xfs_owner_info	*oi,
+	uint64_t		owner)
+{
+	oi->oi_owner = owner;
+	oi->oi_offset = 0;
+	oi->oi_flags = 0;
+}
+
+static inline void
+xfs_rmap_ino_bmbt_owner(
+	struct xfs_owner_info	*oi,
+	xfs_ino_t		ino,
+	int			whichfork)
+{
+	oi->oi_owner = ino;
+	oi->oi_offset = 0;
+	oi->oi_flags = XFS_OWNER_INFO_BMBT_BLOCK;
+	if (whichfork == XFS_ATTR_FORK)
+		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
+}
+
+static inline void
+xfs_rmap_ino_owner(
+	struct xfs_owner_info	*oi,
+	xfs_ino_t		ino,
+	int			whichfork,
+	xfs_fileoff_t		offset)
+{
+	oi->oi_owner = ino;
+	oi->oi_offset = offset;
+	oi->oi_flags = 0;
+	if (whichfork == XFS_ATTR_FORK)
+		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
+}
+
+/*
+ * Special owner types.
+ *
+ * Seeing as we only support up to 8EB, we have the upper bit of the owner field
+ * to tell us we have a special owner value. We use these for static metadata
+ * allocated at mkfs/growfs time, as well as for freespace management metadata.
+ */
+#define XFS_RMAP_OWN_NULL	(-1ULL)	/* No owner, for growfs */
+#define XFS_RMAP_OWN_UNKNOWN	(-2ULL)	/* Unknown owner, for EFI recovery */
+#define XFS_RMAP_OWN_FS		(-3ULL)	/* static fs metadata */
+#define XFS_RMAP_OWN_LOG	(-4ULL)	/* static fs metadata */
+#define XFS_RMAP_OWN_AG		(-5ULL)	/* AG freespace btree blocks */
+#define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
+#define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
+#define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
+
 #define	XFS_RMAP_BLOCK(mp) \
 	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
 	 XFS_FIBT_BLOCK(mp) + 1 : \
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index dbc3e35..1982561 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -615,6 +615,7 @@ xfs_ialloc_ag_alloc(
 	args.tp = tp;
 	args.mp = tp->t_mountp;
 	args.fsbno = NULLFSBLOCK;
+	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INODES);
 
 #ifdef DEBUG
 	/* randomly do sparse inode allocations */
@@ -1825,12 +1826,14 @@ xfs_difree_inode_chunk(
 	int		nextbit;
 	xfs_agblock_t	agbno;
 	int		contigblk;
+	struct xfs_owner_info	oinfo;
 	DECLARE_BITMAP(holemask, XFS_INOBT_HOLEMASK_BITS);
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INODES);
 
 	if (!xfs_inobt_issparse(rec->ir_holemask)) {
 		/* not sparse, calculate extent info directly */
 		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
-				  mp->m_ialloc_blks);
+				  mp->m_ialloc_blks, &oinfo);
 		return;
 	}
 
@@ -1874,7 +1877,7 @@ xfs_difree_inode_chunk(
 		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
 		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
 		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
-				  contigblk);
+				  contigblk, &oinfo);
 
 		/* reset range to current bit and carry on... */
 		startidx = endidx = nextbit;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 88da2ad..f9ea86b 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -96,6 +96,7 @@ xfs_inobt_alloc_block(
 	memset(&args, 0, sizeof(args));
 	args.tp = cur->bc_tp;
 	args.mp = cur->bc_mp;
+	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INOBT);
 	args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.a.agno, sbno);
 	args.minlen = 1;
 	args.maxlen = 1;
@@ -125,8 +126,12 @@ xfs_inobt_free_block(
 	struct xfs_btree_cur	*cur,
 	struct xfs_buf		*bp)
 {
+	struct xfs_owner_info	oinfo;
+
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
 	return xfs_free_extent(cur->bc_tp,
-			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1);
+			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
+			&oinfo);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
index 127a54e..1c2d556 100644
--- a/fs/xfs/xfs_defer_item.c
+++ b/fs/xfs/xfs_defer_item.c
@@ -99,7 +99,8 @@ xfs_bmap_free_finish_item(
 	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
 	error = xfs_trans_free_extent(tp, done_item,
 			free->xbfi_startblock,
-			free->xbfi_blockcount);
+			free->xbfi_blockcount,
+			&free->xbfi_oinfo);
 	kmem_free(free);
 	return error;
 }
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 62162d4..d60bb97 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -436,6 +436,8 @@ xfs_growfs_data_private(
 	 * There are new blocks in the old last a.g.
 	 */
 	if (new) {
+		struct xfs_owner_info	oinfo;
+
 		/*
 		 * Change the agi length.
 		 */
@@ -463,14 +465,20 @@ xfs_growfs_data_private(
 		       be32_to_cpu(agi->agi_length));
 
 		xfs_alloc_log_agf(tp, bp, XFS_AGF_LENGTH);
+
 		/*
 		 * Free the new space.
+		 *
+		 * XFS_RMAP_OWN_NULL is used here to tell the rmap btree that
+		 * this doesn't actually exist in the rmap btree.
 		 */
-		error = xfs_free_extent(tp, XFS_AGB_TO_FSB(mp, agno,
-			be32_to_cpu(agf->agf_length) - new), new);
-		if (error) {
+		xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_NULL);
+		error = xfs_free_extent(tp,
+				XFS_AGB_TO_FSB(mp, agno,
+					be32_to_cpu(agf->agf_length) - new),
+				new, &oinfo);
+		if (error)
 			goto error0;
-		}
 	}
 
 	/*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 080b54b..0c41bd2 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -4180,6 +4180,7 @@ xlog_recover_process_efi(
 	int			error = 0;
 	xfs_extent_t		*extp;
 	xfs_fsblock_t		startblock_fsb;
+	struct xfs_owner_info	oinfo;
 
 	ASSERT(!test_bit(XFS_EFI_RECOVERED, &efip->efi_flags));
 
@@ -4211,10 +4212,12 @@ xlog_recover_process_efi(
 		return error;
 	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
 
+	oinfo.oi_owner = 0;
 	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
 		extp = &(efip->efi_format.efi_extents[i]);
 		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
-					      extp->ext_len);
+					      extp->ext_len,
+					      &oinfo);
 		if (error)
 			goto abort_error;
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 9a462e8..f8d363f 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -219,7 +219,7 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(xfs_trans_t *,
 				  uint);
 int		xfs_trans_free_extent(struct xfs_trans *,
 				      struct xfs_efd_log_item *, xfs_fsblock_t,
-				      xfs_extlen_t);
+				      xfs_extlen_t, struct xfs_owner_info *);
 int		xfs_trans_commit(struct xfs_trans *);
 int		__xfs_trans_roll(struct xfs_trans **, struct xfs_inode *, int *);
 int		xfs_trans_roll(struct xfs_trans **, struct xfs_inode *);
diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
index a96ae54..d1b8833 100644
--- a/fs/xfs/xfs_trans_extfree.c
+++ b/fs/xfs/xfs_trans_extfree.c
@@ -118,13 +118,14 @@ xfs_trans_free_extent(
 	struct xfs_trans	*tp,
 	struct xfs_efd_log_item	*efdp,
 	xfs_fsblock_t		start_block,
-	xfs_extlen_t		ext_len)
+	xfs_extlen_t		ext_len,
+	struct xfs_owner_info	*oinfo)
 {
 	uint			next_extent;
 	struct xfs_extent	*extp;
 	int			error;
 
-	error = xfs_free_extent(tp, start_block, ext_len);
+	error = xfs_free_extent(tp, start_block, ext_len, oinfo);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 027/119] xfs: introduce rmap extent operation stubs
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (25 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 026/119] xfs: add owner field to extent allocation and freeing Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-17  1:20 ` [PATCH 028/119] xfs: define the on-disk rmap btree format Darrick J. Wong
                   ` (91 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Add the stubs into the extent allocation and freeing paths that the
rmap btree implementation will hook into. While doing this, add the
trace points that will be used to track rmap btree extent
manipulations.

[darrick.wong@oracle.com: Extend the stubs to take full owner info.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_alloc.c      |   18 ++++++++
 fs/xfs/libxfs/xfs_rmap.c       |   90 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |   30 +++++++++++++
 fs/xfs/xfs_trace.h             |   47 +++++++++++++++++++++
 5 files changed, 185 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/libxfs/xfs_rmap.c
 create mode 100644 fs/xfs/libxfs/xfs_rmap_btree.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ad46a2d..06dd760 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -52,6 +52,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_inode_fork.o \
 				   xfs_inode_buf.o \
 				   xfs_log_rlimit.o \
+				   xfs_rmap.o \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_resv.o \
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index eed26f9..570ca17 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -27,6 +27,7 @@
 #include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
+#include "xfs_rmap_btree.h"
 #include "xfs_alloc_btree.h"
 #include "xfs_alloc.h"
 #include "xfs_extent_busy.h"
@@ -648,6 +649,14 @@ xfs_alloc_ag_vextent(
 	ASSERT(!args->wasfromfl || !args->isfl);
 	ASSERT(args->agbno % args->alignment == 0);
 
+	/* if not file data, insert new block into the reverse map btree */
+	if (args->oinfo.oi_owner) {
+		error = xfs_rmap_alloc(args->tp, args->agbp, args->agno,
+				       args->agbno, args->len, &args->oinfo);
+		if (error)
+			return error;
+	}
+
 	if (!args->wasfromfl) {
 		error = xfs_alloc_update_counters(args->tp, args->pag,
 						  args->agbp,
@@ -1614,12 +1623,19 @@ xfs_free_ag_extent(
 	xfs_extlen_t	nlen;		/* new length of freespace */
 	xfs_perag_t	*pag;		/* per allocation group data */
 
+	bno_cur = cnt_cur = NULL;
 	mp = tp->t_mountp;
+
+	if (oinfo->oi_owner) {
+		error = xfs_rmap_free(tp, agbp, agno, bno, len, oinfo);
+		if (error)
+			goto error0;
+	}
+
 	/*
 	 * Allocate and initialize a cursor for the by-block btree.
 	 */
 	bno_cur = xfs_allocbt_init_cursor(mp, tp, agbp, agno, XFS_BTNUM_BNO);
-	cnt_cur = NULL;
 	/*
 	 * Look for a neighboring block on the left (lower block numbers)
 	 * that is contiguous with this space.
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
new file mode 100644
index 0000000..d1fd471
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -0,0 +1,90 @@
+
+/*
+ * Copyright (c) 2014 Red Hat, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_btree.h"
+#include "xfs_trans.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_trace.h"
+#include "xfs_error.h"
+#include "xfs_extent_busy.h"
+
+int
+xfs_rmap_free(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	int			error = 0;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return 0;
+
+	trace_xfs_rmap_free_extent(mp, agno, bno, len, false, oinfo);
+	if (1)
+		goto out_error;
+	trace_xfs_rmap_free_extent_done(mp, agno, bno, len, false, oinfo);
+	return 0;
+
+out_error:
+	trace_xfs_rmap_free_extent_error(mp, agno, bno, len, false, oinfo);
+	return error;
+}
+
+int
+xfs_rmap_alloc(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	int			error = 0;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return 0;
+
+	trace_xfs_rmap_alloc_extent(mp, agno, bno, len, false, oinfo);
+	if (1)
+		goto out_error;
+	trace_xfs_rmap_alloc_extent_done(mp, agno, bno, len, false, oinfo);
+	return 0;
+
+out_error:
+	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
new file mode 100644
index 0000000..a3b8f90
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (c) 2014 Red Hat, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#ifndef __XFS_RMAP_BTREE_H__
+#define	__XFS_RMAP_BTREE_H__
+
+struct xfs_buf;
+
+int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
+		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
+		   struct xfs_owner_info *oinfo);
+int xfs_rmap_free(struct xfs_trans *tp, struct xfs_buf *agbp,
+		  xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
+		  struct xfs_owner_info *oinfo);
+
+#endif	/* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 777a89c..4872fbd 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2421,6 +2421,53 @@ DEFINE_MAP_EXTENT_DEFERRED_EVENT(xfs_defer_map_extent);
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred);
 
+/* rmap tracepoints */
+DECLARE_EVENT_CLASS(xfs_rmap_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agblock_t agbno, xfs_extlen_t len, bool unwritten,
+		 struct xfs_owner_info *oinfo),
+	TP_ARGS(mp, agno, agbno, len, unwritten, oinfo),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(uint64_t, owner)
+		__field(uint64_t, offset)
+		__field(unsigned long, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agbno = agbno;
+		__entry->len = len;
+		__entry->owner = oinfo->oi_owner;
+		__entry->offset = oinfo->oi_offset;
+		__entry->flags = oinfo->oi_flags;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u owner %lld offset %llu flags 0x%lx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->owner,
+		  __entry->offset,
+		  __entry->flags)
+);
+#define DEFINE_RMAP_EVENT(name) \
+DEFINE_EVENT(xfs_rmap_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 xfs_agblock_t agbno, xfs_extlen_t len, bool unwritten, \
+		 struct xfs_owner_info *oinfo), \
+	TP_ARGS(mp, agno, agbno, len, unwritten, oinfo))
+
+DEFINE_RMAP_EVENT(xfs_rmap_free_extent);
+DEFINE_RMAP_EVENT(xfs_rmap_free_extent_done);
+DEFINE_RMAP_EVENT(xfs_rmap_free_extent_error);
+DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
+DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
+DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 028/119] xfs: define the on-disk rmap btree format
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (26 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 027/119] xfs: introduce rmap extent operation stubs Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-07-06  4:05   ` Dave Chinner
  2016-07-07 18:41   ` Brian Foster
  2016-06-17  1:20 ` [PATCH 029/119] xfs: add rmap btree growfs support Darrick J. Wong
                   ` (90 subsequent siblings)
  118 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Now we have all the surrounding call infrastructure in place, we can
start filling out the rmap btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate rmap btree blocks. This prepares the way for adding the
btree operations implementation.

[darrick: record owner and offset info in rmap btree]
[darrick: fork, bmbt and unwritten state in rmap btree]
[darrick: flags are a separate field in xfs_rmap_irec]
[darrick: calculate maxlevels separately]
[darrick: move the 'unwritten' bit into unused parts of rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_btree.c      |    3 +
 fs/xfs/libxfs/xfs_btree.h      |   18 ++--
 fs/xfs/libxfs/xfs_format.h     |  140 +++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.c |  180 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |   32 +++++++
 fs/xfs/libxfs/xfs_sb.c         |    6 +
 fs/xfs/libxfs/xfs_shared.h     |    2 
 fs/xfs/xfs_mount.c             |    2 
 fs/xfs/xfs_mount.h             |    3 +
 fs/xfs/xfs_ondisk.h            |    3 +
 fs/xfs/xfs_trace.h             |    2 
 12 files changed, 384 insertions(+), 8 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_rmap_btree.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 06dd760..2de8c20 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -53,6 +53,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_inode_buf.o \
 				   xfs_log_rlimit.o \
 				   xfs_rmap.o \
+				   xfs_rmap_btree.o \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_resv.o \
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 624b572..4b90419 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1210,6 +1210,9 @@ xfs_btree_set_refs(
 	case XFS_BTNUM_BMAP:
 		xfs_buf_set_ref(bp, XFS_BMAP_BTREE_REF);
 		break;
+	case XFS_BTNUM_RMAP:
+		xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index a29067c..90ea2a7 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -38,17 +38,19 @@ union xfs_btree_ptr {
 };
 
 union xfs_btree_key {
-	xfs_bmbt_key_t		bmbt;
-	xfs_bmdr_key_t		bmbr;	/* bmbt root block */
-	xfs_alloc_key_t		alloc;
-	xfs_inobt_key_t		inobt;
+	struct xfs_bmbt_key		bmbt;
+	xfs_bmdr_key_t			bmbr;	/* bmbt root block */
+	xfs_alloc_key_t			alloc;
+	struct xfs_inobt_key		inobt;
+	struct xfs_rmap_key		rmap;
 };
 
 union xfs_btree_rec {
-	xfs_bmbt_rec_t		bmbt;
-	xfs_bmdr_rec_t		bmbr;	/* bmbt root block */
-	xfs_alloc_rec_t		alloc;
-	xfs_inobt_rec_t		inobt;
+	struct xfs_bmbt_rec		bmbt;
+	xfs_bmdr_rec_t			bmbr;	/* bmbt root block */
+	struct xfs_alloc_rec		alloc;
+	struct xfs_inobt_rec		inobt;
+	struct xfs_rmap_rec		rmap;
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 97f354f..6efc7a3 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1383,11 +1383,151 @@ xfs_rmap_ino_owner(
 #define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
 #define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
 
+#define XFS_RMAP_NON_INODE_OWNER(owner)	(!!((owner) & (1ULL << 63)))
+
+/*
+ * Data record structure
+ */
+struct xfs_rmap_rec {
+	__be32		rm_startblock;	/* extent start block */
+	__be32		rm_blockcount;	/* extent length */
+	__be64		rm_owner;	/* extent owner */
+	__be64		rm_offset;	/* offset within the owner */
+};
+
+/*
+ * rmap btree record
+ *  rm_offset:63 is the attribute fork flag
+ *  rm_offset:62 is the bmbt block flag
+ *  rm_offset:61 is the unwritten extent flag (same as l0:63 in bmbt)
+ *  rm_offset:54-60 aren't used and should be zero
+ *  rm_offset:0-53 is the block offset within the inode
+ */
+#define XFS_RMAP_OFF_ATTR_FORK	((__uint64_t)1ULL << 63)
+#define XFS_RMAP_OFF_BMBT_BLOCK	((__uint64_t)1ULL << 62)
+#define XFS_RMAP_OFF_UNWRITTEN	((__uint64_t)1ULL << 61)
+
+#define XFS_RMAP_LEN_MAX	((__uint32_t)~0U)
+#define XFS_RMAP_OFF_FLAGS	(XFS_RMAP_OFF_ATTR_FORK | \
+				 XFS_RMAP_OFF_BMBT_BLOCK | \
+				 XFS_RMAP_OFF_UNWRITTEN)
+#define XFS_RMAP_OFF_MASK	((__uint64_t)0x3FFFFFFFFFFFFFULL)
+
+#define XFS_RMAP_OFF(off)		((off) & XFS_RMAP_OFF_MASK)
+
+#define XFS_RMAP_IS_BMBT_BLOCK(off)	(!!((off) & XFS_RMAP_OFF_BMBT_BLOCK))
+#define XFS_RMAP_IS_ATTR_FORK(off)	(!!((off) & XFS_RMAP_OFF_ATTR_FORK))
+#define XFS_RMAP_IS_UNWRITTEN(len)	(!!((off) & XFS_RMAP_OFF_UNWRITTEN))
+
+#define RMAPBT_STARTBLOCK_BITLEN	32
+#define RMAPBT_BLOCKCOUNT_BITLEN	32
+#define RMAPBT_OWNER_BITLEN		64
+#define RMAPBT_ATTRFLAG_BITLEN		1
+#define RMAPBT_BMBTFLAG_BITLEN		1
+#define RMAPBT_EXNTFLAG_BITLEN		1
+#define RMAPBT_UNUSED_OFFSET_BITLEN	7
+#define RMAPBT_OFFSET_BITLEN		54
+
+#define XFS_RMAP_ATTR_FORK		(1 << 0)
+#define XFS_RMAP_BMBT_BLOCK		(1 << 1)
+#define XFS_RMAP_UNWRITTEN		(1 << 2)
+#define XFS_RMAP_KEY_FLAGS		(XFS_RMAP_ATTR_FORK | \
+					 XFS_RMAP_BMBT_BLOCK)
+#define XFS_RMAP_REC_FLAGS		(XFS_RMAP_UNWRITTEN)
+struct xfs_rmap_irec {
+	xfs_agblock_t	rm_startblock;	/* extent start block */
+	xfs_extlen_t	rm_blockcount;	/* extent length */
+	__uint64_t	rm_owner;	/* extent owner */
+	__uint64_t	rm_offset;	/* offset within the owner */
+	unsigned int	rm_flags;	/* state flags */
+};
+
+static inline __u64
+xfs_rmap_irec_offset_pack(
+	const struct xfs_rmap_irec	*irec)
+{
+	__u64			x;
+
+	x = XFS_RMAP_OFF(irec->rm_offset);
+	if (irec->rm_flags & XFS_RMAP_ATTR_FORK)
+		x |= XFS_RMAP_OFF_ATTR_FORK;
+	if (irec->rm_flags & XFS_RMAP_BMBT_BLOCK)
+		x |= XFS_RMAP_OFF_BMBT_BLOCK;
+	if (irec->rm_flags & XFS_RMAP_UNWRITTEN)
+		x |= XFS_RMAP_OFF_UNWRITTEN;
+	return x;
+}
+
+static inline int
+xfs_rmap_irec_offset_unpack(
+	__u64			offset,
+	struct xfs_rmap_irec	*irec)
+{
+	if (offset & ~(XFS_RMAP_OFF_MASK | XFS_RMAP_OFF_FLAGS))
+		return -EFSCORRUPTED;
+	irec->rm_offset = XFS_RMAP_OFF(offset);
+	if (offset & XFS_RMAP_OFF_ATTR_FORK)
+		irec->rm_flags |= XFS_RMAP_ATTR_FORK;
+	if (offset & XFS_RMAP_OFF_BMBT_BLOCK)
+		irec->rm_flags |= XFS_RMAP_BMBT_BLOCK;
+	if (offset & XFS_RMAP_OFF_UNWRITTEN)
+		irec->rm_flags |= XFS_RMAP_UNWRITTEN;
+	return 0;
+}
+
+/*
+ * Key structure
+ *
+ * We don't use the length for lookups
+ */
+struct xfs_rmap_key {
+	__be32		rm_startblock;	/* extent start block */
+	__be64		rm_owner;	/* extent owner */
+	__be64		rm_offset;	/* offset within the owner */
+} __attribute__((packed));
+
+/* btree pointer type */
+typedef __be32 xfs_rmap_ptr_t;
+
 #define	XFS_RMAP_BLOCK(mp) \
 	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
 	 XFS_FIBT_BLOCK(mp) + 1 : \
 	 XFS_IBT_BLOCK(mp) + 1)
 
+static inline void
+xfs_owner_info_unpack(
+	struct xfs_owner_info	*oinfo,
+	uint64_t		*owner,
+	uint64_t		*offset,
+	unsigned int		*flags)
+{
+	unsigned int		r = 0;
+
+	*owner = oinfo->oi_owner;
+	*offset = oinfo->oi_offset;
+	if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK)
+		r |= XFS_RMAP_ATTR_FORK;
+	if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
+		r |= XFS_RMAP_BMBT_BLOCK;
+	*flags = r;
+}
+
+static inline void
+xfs_owner_info_pack(
+	struct xfs_owner_info	*oinfo,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags)
+{
+	oinfo->oi_owner = owner;
+	oinfo->oi_offset = XFS_RMAP_OFF(offset);
+	oinfo->oi_flags = 0;
+	if (flags & XFS_RMAP_ATTR_FORK)
+		oinfo->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
+	if (flags & XFS_RMAP_BMBT_BLOCK)
+		oinfo->oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
+}
+
 /*
  * BMAP Btree format definitions
  *
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
new file mode 100644
index 0000000..7a35c78
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -0,0 +1,180 @@
+/*
+ * Copyright (c) 2014 Red Hat, Inc.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_alloc.h"
+#include "xfs_btree.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_trace.h"
+#include "xfs_cksum.h"
+#include "xfs_error.h"
+#include "xfs_extent_busy.h"
+
+static struct xfs_btree_cur *
+xfs_rmapbt_dup_cursor(
+	struct xfs_btree_cur	*cur)
+{
+	return xfs_rmapbt_init_cursor(cur->bc_mp, cur->bc_tp,
+			cur->bc_private.a.agbp, cur->bc_private.a.agno);
+}
+
+static bool
+xfs_rmapbt_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_target->bt_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_perag	*pag = bp->b_pag;
+	unsigned int		level;
+
+	/*
+	 * magic number and level verification
+	 *
+	 * During growfs operations, we can't verify the exact level or owner as
+	 * the perag is not fully initialised and hence not attached to the
+	 * buffer.  In this case, check against the maximum tree depth.
+	 *
+	 * Similarly, during log recovery we will have a perag structure
+	 * attached, but the agf information will not yet have been initialised
+	 * from the on disk AGF. Again, we can only check against maximum limits
+	 * in this case.
+	 */
+	if (block->bb_magic != cpu_to_be32(XFS_RMAP_CRC_MAGIC))
+		return false;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return false;
+	if (!xfs_btree_sblock_v5hdr_verify(bp))
+		return false;
+
+	level = be16_to_cpu(block->bb_level);
+	if (pag && pag->pagf_init) {
+		if (level >= pag->pagf_levels[XFS_BTNUM_RMAPi])
+			return false;
+	} else if (level >= mp->m_rmap_maxlevels)
+		return false;
+
+	return xfs_btree_sblock_verify(bp, mp->m_rmap_mxr[level != 0]);
+}
+
+static void
+xfs_rmapbt_read_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_btree_sblock_verify_crc(bp))
+		xfs_buf_ioerror(bp, -EFSBADCRC);
+	else if (!xfs_rmapbt_verify(bp))
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+
+	if (bp->b_error) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_verifier_error(bp);
+	}
+}
+
+static void
+xfs_rmapbt_write_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_rmapbt_verify(bp)) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+		xfs_verifier_error(bp);
+		return;
+	}
+	xfs_btree_sblock_calc_crc(bp);
+
+}
+
+const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
+	.name			= "xfs_rmapbt",
+	.verify_read		= xfs_rmapbt_read_verify,
+	.verify_write		= xfs_rmapbt_write_verify,
+};
+
+static const struct xfs_btree_ops xfs_rmapbt_ops = {
+	.rec_len		= sizeof(struct xfs_rmap_rec),
+	.key_len		= sizeof(struct xfs_rmap_key),
+
+	.dup_cursor		= xfs_rmapbt_dup_cursor,
+	.buf_ops		= &xfs_rmapbt_buf_ops,
+};
+
+/*
+ * Allocate a new allocation btree cursor.
+ */
+struct xfs_btree_cur *
+xfs_rmapbt_init_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	struct xfs_btree_cur	*cur;
+
+	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
+	cur->bc_tp = tp;
+	cur->bc_mp = mp;
+	cur->bc_btnum = XFS_BTNUM_RMAP;
+	cur->bc_flags = XFS_BTREE_CRC_BLOCKS;
+	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+	cur->bc_ops = &xfs_rmapbt_ops;
+	cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
+
+	cur->bc_private.a.agbp = agbp;
+	cur->bc_private.a.agno = agno;
+
+	return cur;
+}
+
+/*
+ * Calculate number of records in an rmap btree block.
+ */
+int
+xfs_rmapbt_maxrecs(
+	struct xfs_mount	*mp,
+	int			blocklen,
+	int			leaf)
+{
+	blocklen -= XFS_RMAP_BLOCK_LEN;
+
+	if (leaf)
+		return blocklen / sizeof(struct xfs_rmap_rec);
+	return blocklen /
+		(sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
+}
+
+/* Compute the maximum height of an rmap btree. */
+void
+xfs_rmapbt_compute_maxlevels(
+	struct xfs_mount		*mp)
+{
+	mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
+			mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index a3b8f90..462767f 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -19,6 +19,38 @@
 #define	__XFS_RMAP_BTREE_H__
 
 struct xfs_buf;
+struct xfs_btree_cur;
+struct xfs_mount;
+
+/* rmaps only exist on crc enabled filesystems */
+#define XFS_RMAP_BLOCK_LEN	XFS_BTREE_SBLOCK_CRC_LEN
+
+/*
+ * Record, key, and pointer address macros for btree blocks.
+ *
+ * (note that some of these may appear unused, but they are used in userspace)
+ */
+#define XFS_RMAP_REC_ADDR(block, index) \
+	((struct xfs_rmap_rec *) \
+		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
+		 (((index) - 1) * sizeof(struct xfs_rmap_rec))))
+
+#define XFS_RMAP_KEY_ADDR(block, index) \
+	((struct xfs_rmap_key *) \
+		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
+		 ((index) - 1) * sizeof(struct xfs_rmap_key)))
+
+#define XFS_RMAP_PTR_ADDR(block, index, maxrecs) \
+	((xfs_rmap_ptr_t *) \
+		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
+		 (maxrecs) * sizeof(struct xfs_rmap_key) + \
+		 ((index) - 1) * sizeof(xfs_rmap_ptr_t)))
+
+struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
+				struct xfs_trans *tp, struct xfs_buf *bp,
+				xfs_agnumber_t agno);
+int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
+extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
 
 int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
 		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a544686..f86226b 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -37,6 +37,7 @@
 #include "xfs_alloc_btree.h"
 #include "xfs_ialloc_btree.h"
 #include "xfs_log.h"
+#include "xfs_rmap_btree.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -734,6 +735,11 @@ xfs_sb_mount_common(
 	mp->m_bmap_dmnr[0] = mp->m_bmap_dmxr[0] / 2;
 	mp->m_bmap_dmnr[1] = mp->m_bmap_dmxr[1] / 2;
 
+	mp->m_rmap_mxr[0] = xfs_rmapbt_maxrecs(mp, sbp->sb_blocksize, 1);
+	mp->m_rmap_mxr[1] = xfs_rmapbt_maxrecs(mp, sbp->sb_blocksize, 0);
+	mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2;
+	mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2;
+
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
 	mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
 					sbp->sb_inopblock);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 16002b5..0c5b30b 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -38,6 +38,7 @@ extern const struct xfs_buf_ops xfs_agi_buf_ops;
 extern const struct xfs_buf_ops xfs_agf_buf_ops;
 extern const struct xfs_buf_ops xfs_agfl_buf_ops;
 extern const struct xfs_buf_ops xfs_allocbt_buf_ops;
+extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops;
 extern const struct xfs_buf_ops xfs_bmbt_buf_ops;
@@ -116,6 +117,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define	XFS_INO_BTREE_REF	3
 #define	XFS_ALLOC_BTREE_REF	2
 #define	XFS_BMAP_BTREE_REF	2
+#define	XFS_RMAP_BTREE_REF	2
 #define	XFS_DIR_BTREE_REF	2
 #define	XFS_INO_REF		2
 #define	XFS_ATTR_BTREE_REF	1
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b4153f0..8af1c88 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -42,6 +42,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_sysfs.h"
+#include "xfs_rmap_btree.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -680,6 +681,7 @@ xfs_mountfs(
 	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
 	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
 	xfs_ialloc_compute_maxlevels(mp);
+	xfs_rmapbt_compute_maxlevels(mp);
 
 	xfs_set_maxicount(mp);
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 0537b1f..0ed0f29 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -116,9 +116,12 @@ typedef struct xfs_mount {
 	uint			m_bmap_dmnr[2];	/* min bmap btree records */
 	uint			m_inobt_mxr[2];	/* max inobt btree records */
 	uint			m_inobt_mnr[2];	/* min inobt btree records */
+	uint			m_rmap_mxr[2];	/* max rmap btree records */
+	uint			m_rmap_mnr[2];	/* min rmap btree records */
 	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
 	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
 	uint			m_in_maxlevels;	/* max inobt btree levels. */
+	uint			m_rmap_maxlevels; /* max rmap btree levels */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
 	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
index 0272301..48d544f 100644
--- a/fs/xfs/xfs_ondisk.h
+++ b/fs/xfs/xfs_ondisk.h
@@ -47,11 +47,14 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key,		20);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec,		24);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_timestamp,		8);
 	XFS_CHECK_STRUCT_SIZE(xfs_alloc_key_t,			8);
 	XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t,			4);
 	XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t,			8);
 	XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t,			4);
+	XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t,			4);
 
 	/* dir/attr trees */
 	XFS_CHECK_STRUCT_SIZE(struct xfs_attr3_leaf_hdr,	80);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 4872fbd..b4ee9c8 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2444,6 +2444,8 @@ DECLARE_EVENT_CLASS(xfs_rmap_class,
 		__entry->owner = oinfo->oi_owner;
 		__entry->offset = oinfo->oi_offset;
 		__entry->flags = oinfo->oi_flags;
+		if (unwritten)
+			__entry->flags |= XFS_RMAP_UNWRITTEN;
 	),
 	TP_printk("dev %d:%d agno %u agbno %u len %u owner %lld offset %llu flags 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 029/119] xfs: add rmap btree growfs support
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (27 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 028/119] xfs: define the on-disk rmap btree format Darrick J. Wong
@ 2016-06-17  1:20 ` Darrick J. Wong
  2016-06-17  1:21 ` [PATCH 030/119] xfs: rmap btree transaction reservations Darrick J. Wong
                   ` (89 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:20 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Now we can read and write rmap btree blocks, we can add support to
the growfs code to initialise new rmap btree blocks.

[darrick.wong@oracle.com: fill out the rmap offset fields]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_fsops.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index d60bb97..8a85e49 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -33,6 +33,7 @@
 #include "xfs_btree.h"
 #include "xfs_alloc_btree.h"
 #include "xfs_alloc.h"
+#include "xfs_rmap_btree.h"
 #include "xfs_ialloc.h"
 #include "xfs_fsops.h"
 #include "xfs_itable.h"
@@ -240,6 +241,12 @@ xfs_growfs_data_private(
 		agf->agf_roots[XFS_BTNUM_CNTi] = cpu_to_be32(XFS_CNT_BLOCK(mp));
 		agf->agf_levels[XFS_BTNUM_BNOi] = cpu_to_be32(1);
 		agf->agf_levels[XFS_BTNUM_CNTi] = cpu_to_be32(1);
+		if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+			agf->agf_roots[XFS_BTNUM_RMAPi] =
+						cpu_to_be32(XFS_RMAP_BLOCK(mp));
+			agf->agf_levels[XFS_BTNUM_RMAPi] = cpu_to_be32(1);
+		}
+
 		agf->agf_flfirst = cpu_to_be32(1);
 		agf->agf_fllast = 0;
 		agf->agf_flcount = 0;
@@ -379,6 +386,72 @@ xfs_growfs_data_private(
 		if (error)
 			goto error0;
 
+		/* RMAP btree root block */
+		if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+			struct xfs_rmap_rec	*rrec;
+			struct xfs_btree_block	*block;
+
+			bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AGB_TO_DADDR(mp, agno, XFS_RMAP_BLOCK(mp)),
+				BTOBB(mp->m_sb.sb_blocksize), 0,
+				&xfs_rmapbt_buf_ops);
+			if (!bp) {
+				error = -ENOMEM;
+				goto error0;
+			}
+
+			xfs_btree_init_block(mp, bp, XFS_RMAP_CRC_MAGIC, 0, 0,
+						agno, XFS_BTREE_CRC_BLOCKS);
+			block = XFS_BUF_TO_BLOCK(bp);
+
+
+			/*
+			 * mark the AG header regions as static metadata The BNO
+			 * btree block is the first block after the headers, so
+			 * it's location defines the size of region the static
+			 * metadata consumes.
+			 *
+			 * Note: unlike mkfs, we never have to account for log
+			 * space when growing the data regions
+			 */
+			rrec = XFS_RMAP_REC_ADDR(block, 1);
+			rrec->rm_startblock = 0;
+			rrec->rm_blockcount = cpu_to_be32(XFS_BNO_BLOCK(mp));
+			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_FS);
+			rrec->rm_offset = 0;
+			be16_add_cpu(&block->bb_numrecs, 1);
+
+			/* account freespace btree root blocks */
+			rrec = XFS_RMAP_REC_ADDR(block, 2);
+			rrec->rm_startblock = cpu_to_be32(XFS_BNO_BLOCK(mp));
+			rrec->rm_blockcount = cpu_to_be32(2);
+			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
+			rrec->rm_offset = 0;
+			be16_add_cpu(&block->bb_numrecs, 1);
+
+			/* account inode btree root blocks */
+			rrec = XFS_RMAP_REC_ADDR(block, 3);
+			rrec->rm_startblock = cpu_to_be32(XFS_IBT_BLOCK(mp));
+			rrec->rm_blockcount = cpu_to_be32(XFS_RMAP_BLOCK(mp) -
+							XFS_IBT_BLOCK(mp));
+			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_INOBT);
+			rrec->rm_offset = 0;
+			be16_add_cpu(&block->bb_numrecs, 1);
+
+			/* account for rmap btree root */
+			rrec = XFS_RMAP_REC_ADDR(block, 4);
+			rrec->rm_startblock = cpu_to_be32(XFS_RMAP_BLOCK(mp));
+			rrec->rm_blockcount = cpu_to_be32(1);
+			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
+			rrec->rm_offset = 0;
+			be16_add_cpu(&block->bb_numrecs, 1);
+
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+			if (error)
+				goto error0;
+		}
+
 		/*
 		 * INO btree root block
 		 */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 030/119] xfs: rmap btree transaction reservations
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (28 preceding siblings ...)
  2016-06-17  1:20 ` [PATCH 029/119] xfs: add rmap btree growfs support Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-08 13:21   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 031/119] xfs: rmap btree requires more reserved free space Darrick J. Wong
                   ` (88 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

The rmap btrees will use the AGFL as the block allocation source, so
we need to ensure that the transaction reservations reflect the fact
this tree is modified by allocation and freeing. Hence we need to
extend all the extent allocation/free reservations used in
transactions to handle this.

Note that this also gets rid of the unused XFS_ALLOCFREE_LOG_RES
macro, as we now do buffer reservations based on the number of
buffers logged via xfs_calc_buf_res(). Hence we only need the buffer
count calculation now.

[darrick: use rmap_maxlevels when calculating log block resv]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |   58 ++++++++++++++++++++++++++++------------
 fs/xfs/libxfs/xfs_trans_resv.h |   10 -------
 2 files changed, 41 insertions(+), 27 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 4c7eb9d..301ef2f 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -64,6 +64,30 @@ xfs_calc_buf_res(
 }
 
 /*
+ * Per-extent log reservation for the btree changes involved in freeing or
+ * allocating an extent.  In classic XFS there were two trees that will be
+ * modified (bnobt + cntbt).  With rmap enabled, there are three trees
+ * (rmapbt).  The number of blocks reserved is based on the formula:
+ *
+ * num trees * ((2 blocks/level * max depth) - 1)
+ *
+ * Keep in mind that max depth is calculated separately for each type of tree.
+ */
+static uint
+xfs_allocfree_log_count(
+	struct xfs_mount *mp,
+	uint		num_ops)
+{
+	uint		blocks;
+
+	blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
+
+	return blocks;
+}
+
+/*
  * Logging inodes is really tricksy. They are logged in memory format,
  * which means that what we write into the log doesn't directly translate into
  * the amount of space they use on disk.
@@ -126,7 +150,7 @@ xfs_calc_inode_res(
  */
 STATIC uint
 xfs_calc_finobt_res(
-	struct xfs_mount 	*mp,
+	struct xfs_mount	*mp,
 	int			alloc,
 	int			modify)
 {
@@ -137,7 +161,7 @@ xfs_calc_finobt_res(
 
 	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
 	if (alloc)
-		res += xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1), 
+		res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 					XFS_FSB_TO_B(mp, 1));
 	if (modify)
 		res += (uint)XFS_FSB_TO_B(mp, 1);
@@ -188,10 +212,10 @@ xfs_calc_write_reservation(
 		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
 				      XFS_FSB_TO_B(mp, 1)) +
 		     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
 				      XFS_FSB_TO_B(mp, 1))),
 		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
 				      XFS_FSB_TO_B(mp, 1))));
 }
 
@@ -217,10 +241,10 @@ xfs_calc_itruncate_reservation(
 		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1,
 				      XFS_FSB_TO_B(mp, 1))),
 		    (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 4),
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
 				      XFS_FSB_TO_B(mp, 1)) +
 		    xfs_calc_buf_res(5, 0) +
-		    xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				     XFS_FSB_TO_B(mp, 1)) +
 		    xfs_calc_buf_res(2 + mp->m_ialloc_blks +
 				     mp->m_in_maxlevels, 0)));
@@ -247,7 +271,7 @@ xfs_calc_rename_reservation(
 		     xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
 				      XFS_FSB_TO_B(mp, 1))),
 		    (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 3),
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
 				      XFS_FSB_TO_B(mp, 1))));
 }
 
@@ -286,7 +310,7 @@ xfs_calc_link_reservation(
 		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
 				      XFS_FSB_TO_B(mp, 1))),
 		    (xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				      XFS_FSB_TO_B(mp, 1))));
 }
 
@@ -324,7 +348,7 @@ xfs_calc_remove_reservation(
 		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
 				      XFS_FSB_TO_B(mp, 1))),
 		    (xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
 				      XFS_FSB_TO_B(mp, 1))));
 }
 
@@ -371,7 +395,7 @@ xfs_calc_create_resv_alloc(
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
 }
 
@@ -399,7 +423,7 @@ xfs_calc_icreate_resv_alloc(
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_finobt_res(mp, 0, 0);
 }
@@ -483,7 +507,7 @@ xfs_calc_ifree_reservation(
 		xfs_calc_buf_res(1, 0) +
 		xfs_calc_buf_res(2 + mp->m_ialloc_blks +
 				 mp->m_in_maxlevels, 0) +
-		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_finobt_res(mp, 0, 1);
 }
@@ -513,7 +537,7 @@ xfs_calc_growdata_reservation(
 	struct xfs_mount	*mp)
 {
 	return xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
 }
 
@@ -535,7 +559,7 @@ xfs_calc_growrtalloc_reservation(
 		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
 				 XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_inode_res(mp, 1) +
-		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
 }
 
@@ -611,7 +635,7 @@ xfs_calc_addafork_reservation(
 		xfs_calc_buf_res(1, mp->m_dir_geo->blksize) +
 		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1,
 				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
 }
 
@@ -634,7 +658,7 @@ xfs_calc_attrinval_reservation(
 		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
 				     XFS_FSB_TO_B(mp, 1))),
 		   (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
-		    xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 4),
+		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
 				     XFS_FSB_TO_B(mp, 1))));
 }
 
@@ -701,7 +725,7 @@ xfs_calc_attrrm_reservation(
 					XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) +
 		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), 0)),
 		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
 				      XFS_FSB_TO_B(mp, 1))));
 }
 
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 7978150..0eb46ed 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -68,16 +68,6 @@ struct xfs_trans_resv {
 #define M_RES(mp)	(&(mp)->m_resv)
 
 /*
- * Per-extent log reservation for the allocation btree changes
- * involved in freeing or allocating an extent.
- * 2 trees * (2 blocks/level * max depth - 1) * block size
- */
-#define	XFS_ALLOCFREE_LOG_RES(mp,nx) \
-	((nx) * (2 * XFS_FSB_TO_B((mp), 2 * (mp)->m_ag_maxlevels - 1)))
-#define	XFS_ALLOCFREE_LOG_COUNT(mp,nx) \
-	((nx) * (2 * (2 * (mp)->m_ag_maxlevels - 1)))
-
-/*
  * Per-directory log reservation for any directory change.
  * dir blocks: (1 btree block per level + data block + free block) * dblock size
  * bmap btree: (levels + 2) * max depth * block size


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 031/119] xfs: rmap btree requires more reserved free space
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (29 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 030/119] xfs: rmap btree transaction reservations Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-08 13:21   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 032/119] xfs: add rmap btree operations Darrick J. Wong
                   ` (87 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

The rmap btree is allocated from the AGFL, which means we have to
ensure ENOSPC is reported to userspace before we run out of free
space in each AG. The last allocation in an AG can cause a full
height rmap btree split, and that means we have to reserve at least
this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
Update the various space calculation functiosn to handle this.

Also, because the macros are now executing conditional code and are called quite
frequently, convert them to functions that initialise varaibles in the struct
xfs_mount, use the new variables everywhere and document the calculations
better.

v2: If rmapbt is disabled, it is incorrect to require 1 extra AGFL block
for the rmapbt (due to the + 1); the entire clause needs to be gated
on the feature flag.

v3: Use m_rmap_maxlevels to determine min_free.

[darrick.wong@oracle.com: don't reserve blocks if !rmap]
[dchinner@redhat.com: update m_ag_max_usable after growfs]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_alloc.c |   71 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_alloc.h |   41 +++-----------------------
 fs/xfs/libxfs/xfs_bmap.c  |    2 +
 fs/xfs/libxfs/xfs_sb.c    |    2 +
 fs/xfs/xfs_discard.c      |    2 +
 fs/xfs/xfs_fsops.c        |    5 ++-
 fs/xfs/xfs_log_recover.c  |    1 +
 fs/xfs/xfs_mount.c        |    2 +
 fs/xfs/xfs_mount.h        |    2 +
 fs/xfs/xfs_super.c        |    2 +
 10 files changed, 88 insertions(+), 42 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 570ca17..4c8ffd4 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -63,6 +63,72 @@ xfs_prealloc_blocks(
 }
 
 /*
+ * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
+ * AGF buffer (PV 947395), we place constraints on the relationship among
+ * actual allocations for data blocks, freelist blocks, and potential file data
+ * bmap btree blocks. However, these restrictions may result in no actual space
+ * allocated for a delayed extent, for example, a data block in a certain AG is
+ * allocated but there is no additional block for the additional bmap btree
+ * block due to a split of the bmap btree of the file. The result of this may
+ * lead to an infinite loop when the file gets flushed to disk and all delayed
+ * extents need to be actually allocated. To get around this, we explicitly set
+ * aside a few blocks which will not be reserved in delayed allocation.
+ *
+ * The minimum number of needed freelist blocks is 4 fsbs _per AG_ when we are
+ * not using rmap btrees a potential split of file's bmap btree requires 1 fsb,
+ * so we set the number of set-aside blocks to 4 + 4*agcount when not using
+ * rmap btrees.
+ *
+ * When rmap btrees are active, we have to consider that using the last block
+ * in the AG can cause a full height rmap btree split and we need enough blocks
+ * on the AGFL to be able to handle this. That means we have, in addition to
+ * the above consideration, another (2 * mp->m_rmap_levels) - 1 blocks required
+ * to be available to the free list.
+ */
+unsigned int
+xfs_alloc_set_aside(
+	struct xfs_mount *mp)
+{
+	unsigned int	blocks;
+
+	blocks = 4 + (mp->m_sb.sb_agcount * XFS_ALLOC_AGFL_RESERVE);
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return blocks;
+	return blocks + (mp->m_sb.sb_agcount * (2 * mp->m_rmap_maxlevels) - 1);
+}
+
+/*
+ * When deciding how much space to allocate out of an AG, we limit the
+ * allocation maximum size to the size the AG. However, we cannot use all the
+ * blocks in the AG - some are permanently used by metadata. These
+ * blocks are generally:
+ *	- the AG superblock, AGF, AGI and AGFL
+ *	- the AGF (bno and cnt) and AGI btree root blocks, and optionally
+ *	  the AGI free inode and rmap btree root blocks.
+ *	- blocks on the AGFL according to xfs_alloc_set_aside() limits
+ *
+ * The AG headers are sector sized, so the amount of space they take up is
+ * dependent on filesystem geometry. The others are all single blocks.
+ */
+unsigned int
+xfs_alloc_ag_max_usable(struct xfs_mount *mp)
+{
+	unsigned int	blocks;
+
+	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
+	blocks += XFS_ALLOC_AGFL_RESERVE;
+	blocks += 3;			/* AGF, AGI btree root blocks */
+	if (xfs_sb_version_hasfinobt(&mp->m_sb))
+		blocks++;		/* finobt root block */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		/* rmap root block + full tree split on full AG */
+		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;
+	}
+
+	return mp->m_sb.sb_agblocks - blocks;
+}
+
+/*
  * Lookup the record equal to [bno, len] in the btree given by cur.
  */
 STATIC int				/* error */
@@ -1904,6 +1970,11 @@ xfs_alloc_min_freelist(
 	/* space needed by-size freespace btree */
 	min_free += min_t(unsigned int, pag->pagf_levels[XFS_BTNUM_CNTi] + 1,
 				       mp->m_ag_maxlevels);
+	/* space needed reverse mapping used space btree */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		min_free += min_t(unsigned int,
+				  pag->pagf_levels[XFS_BTNUM_RMAPi] + 1,
+				  mp->m_rmap_maxlevels);
 
 	return min_free;
 }
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 0721a48..7b6c66b 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -56,42 +56,6 @@ typedef unsigned int xfs_alloctype_t;
 #define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
 
 /*
- * In order to avoid ENOSPC-related deadlock caused by
- * out-of-order locking of AGF buffer (PV 947395), we place
- * constraints on the relationship among actual allocations for
- * data blocks, freelist blocks, and potential file data bmap
- * btree blocks. However, these restrictions may result in no
- * actual space allocated for a delayed extent, for example, a data
- * block in a certain AG is allocated but there is no additional
- * block for the additional bmap btree block due to a split of the
- * bmap btree of the file. The result of this may lead to an
- * infinite loop in xfssyncd when the file gets flushed to disk and
- * all delayed extents need to be actually allocated. To get around
- * this, we explicitly set aside a few blocks which will not be
- * reserved in delayed allocation. Considering the minimum number of
- * needed freelist blocks is 4 fsbs _per AG_, a potential split of file's bmap
- * btree requires 1 fsb, so we set the number of set-aside blocks
- * to 4 + 4*agcount.
- */
-#define XFS_ALLOC_SET_ASIDE(mp)  (4 + ((mp)->m_sb.sb_agcount * 4))
-
-/*
- * When deciding how much space to allocate out of an AG, we limit the
- * allocation maximum size to the size the AG. However, we cannot use all the
- * blocks in the AG - some are permanently used by metadata. These
- * blocks are generally:
- *	- the AG superblock, AGF, AGI and AGFL
- *	- the AGF (bno and cnt) and AGI btree root blocks
- *	- 4 blocks on the AGFL according to XFS_ALLOC_SET_ASIDE() limits
- *
- * The AG headers are sector sized, so the amount of space they take up is
- * dependent on filesystem geometry. The others are all single blocks.
- */
-#define XFS_ALLOC_AG_MAX_USABLE(mp)	\
-	((mp)->m_sb.sb_agblocks - XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)) - 7)
-
-
-/*
  * Argument structure for xfs_alloc routines.
  * This is turned into a structure to avoid having 20 arguments passed
  * down several levels of the stack.
@@ -133,6 +97,11 @@ typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
 #define XFS_ALLOC_USERDATA_ZERO		(1 << 2)/* zero extent on allocation */
 
+/* freespace limit calculations */
+#define XFS_ALLOC_AGFL_RESERVE	4
+unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
+unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
+
 xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
 		struct xfs_perag *pag, xfs_extlen_t need);
 unsigned int xfs_alloc_min_freelist(struct xfs_mount *mp,
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2c28f2a..61c0231 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3672,7 +3672,7 @@ xfs_bmap_btalloc(
 	args.fsbno = ap->blkno;
 
 	/* Trim the allocation back to the maximum an AG can fit. */
-	args.maxlen = MIN(ap->length, XFS_ALLOC_AG_MAX_USABLE(mp));
+	args.maxlen = MIN(ap->length, mp->m_ag_max_usable);
 	args.firstblock = *ap->firstblock;
 	blen = 0;
 	if (nullfb) {
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f86226b..59c9f59 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -749,6 +749,8 @@ xfs_sb_mount_common(
 		mp->m_ialloc_min_blks = sbp->sb_spino_align;
 	else
 		mp->m_ialloc_min_blks = mp->m_ialloc_blks;
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 }
 
 /*
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 272c3f8..4ff499a 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -179,7 +179,7 @@ xfs_ioc_trim(
 	 * matter as trimming blocks is an advisory interface.
 	 */
 	if (range.start >= XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks) ||
-	    range.minlen > XFS_FSB_TO_B(mp, XFS_ALLOC_AG_MAX_USABLE(mp)) ||
+	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
 	    range.len < mp->m_sb.sb_blocksize)
 		return -EINVAL;
 
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 8a85e49..3772f6c 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -583,6 +583,7 @@ xfs_growfs_data_private(
 	} else
 		mp->m_maxicount = 0;
 	xfs_set_low_space_thresholds(mp);
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
 	/* update secondary superblocks. */
 	for (agno = 1; agno < nagcount; agno++) {
@@ -720,7 +721,7 @@ xfs_fs_counts(
 	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
 	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
 	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
-							XFS_ALLOC_SET_ASIDE(mp);
+						mp->m_alloc_set_aside;
 
 	spin_lock(&mp->m_sb_lock);
 	cnt->freertx = mp->m_sb.sb_frextents;
@@ -793,7 +794,7 @@ retry:
 		__int64_t	free;
 
 		free = percpu_counter_sum(&mp->m_fdblocks) -
-							XFS_ALLOC_SET_ASIDE(mp);
+						mp->m_alloc_set_aside;
 		if (!free)
 			goto out; /* ENOSPC and fdblks_delta = 0 */
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 0c41bd2..b33187b 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -5027,6 +5027,7 @@ xlog_do_recover(
 		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
 		return error;
 	}
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
 	xlog_recover_check_summary(log);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 8af1c88..879f3ef 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1219,7 +1219,7 @@ xfs_mod_fdblocks(
 		batch = XFS_FDBLOCKS_BATCH;
 
 	__percpu_counter_add(&mp->m_fdblocks, delta, batch);
-	if (__percpu_counter_compare(&mp->m_fdblocks, XFS_ALLOC_SET_ASIDE(mp),
+	if (__percpu_counter_compare(&mp->m_fdblocks, mp->m_alloc_set_aside,
 				     XFS_FDBLOCKS_BATCH) >= 0) {
 		/* we had space! */
 		return 0;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 0ed0f29..b36676c 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -123,6 +123,8 @@ typedef struct xfs_mount {
 	uint			m_in_maxlevels;	/* max inobt btree levels. */
 	uint			m_rmap_maxlevels; /* max rmap btree levels */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
+	uint			m_alloc_set_aside; /* space we can't use */
+	uint			m_ag_max_usable; /* max space per AG */
 	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
 	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
 	struct mutex		m_growlock;	/* growfs mutex */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index bf63f6d..1575849 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1076,7 +1076,7 @@ xfs_fs_statfs(
 	statp->f_blocks = sbp->sb_dblocks - lsize;
 	spin_unlock(&mp->m_sb_lock);
 
-	statp->f_bfree = fdblocks - XFS_ALLOC_SET_ASIDE(mp);
+	statp->f_bfree = fdblocks - mp->m_alloc_set_aside;
 	statp->f_bavail = statp->f_bfree;
 
 	fakeinos = statp->f_bfree << sbp->sb_inopblog;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 032/119] xfs: add rmap btree operations
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (30 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 031/119] xfs: rmap btree requires more reserved free space Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-08 18:33   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 033/119] xfs: support overlapping intervals in the rmap btree Darrick J. Wong
                   ` (86 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Implement the generic btree operations needed to manipulate rmap
btree blocks. This is very similar to the per-ag freespace btree
implementation, and uses the AGFL for allocation and freeing of
blocks.

Adapt the rmap btree to store owner offsets within each rmap record,
and to handle the primary key being redefined as the tuple
[agblk, owner, offset].  The expansion of the primary key is crucial
to allowing multiple owners per extent.

[darrick: adapt the btree ops to deal with offsets]
[darrick: remove init_rec_from_key]
[darrick: move unwritten bit to rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_btree.h      |    1 
 fs/xfs/libxfs/xfs_rmap.c       |   96 ++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.c |  243 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |    9 +
 fs/xfs/xfs_trace.h             |    3 
 5 files changed, 352 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 90ea2a7..9963c48 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -216,6 +216,7 @@ union xfs_btree_irec {
 	xfs_alloc_rec_incore_t		a;
 	xfs_bmbt_irec_t			b;
 	xfs_inobt_rec_incore_t		i;
+	struct xfs_rmap_irec		r;
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index d1fd471..c6a5a0b 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -37,6 +37,102 @@
 #include "xfs_error.h"
 #include "xfs_extent_busy.h"
 
+/*
+ * Lookup the first record less than or equal to [bno, len, owner, offset]
+ * in the btree given by cur.
+ */
+int
+xfs_rmap_lookup_le(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags,
+	int			*stat)
+{
+	cur->bc_rec.r.rm_startblock = bno;
+	cur->bc_rec.r.rm_blockcount = len;
+	cur->bc_rec.r.rm_owner = owner;
+	cur->bc_rec.r.rm_offset = offset;
+	cur->bc_rec.r.rm_flags = flags;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
+}
+
+/*
+ * Lookup the record exactly matching [bno, len, owner, offset]
+ * in the btree given by cur.
+ */
+int
+xfs_rmap_lookup_eq(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags,
+	int			*stat)
+{
+	cur->bc_rec.r.rm_startblock = bno;
+	cur->bc_rec.r.rm_blockcount = len;
+	cur->bc_rec.r.rm_owner = owner;
+	cur->bc_rec.r.rm_offset = offset;
+	cur->bc_rec.r.rm_flags = flags;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
+}
+
+/*
+ * Update the record referred to by cur to the value given
+ * by [bno, len, owner, offset].
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int
+xfs_rmap_update(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*irec)
+{
+	union xfs_btree_rec	rec;
+
+	rec.rmap.rm_startblock = cpu_to_be32(irec->rm_startblock);
+	rec.rmap.rm_blockcount = cpu_to_be32(irec->rm_blockcount);
+	rec.rmap.rm_owner = cpu_to_be64(irec->rm_owner);
+	rec.rmap.rm_offset = cpu_to_be64(
+			xfs_rmap_irec_offset_pack(irec));
+	return xfs_btree_update(cur, &rec);
+}
+
+static int
+xfs_rmapbt_btrec_to_irec(
+	union xfs_btree_rec	*rec,
+	struct xfs_rmap_irec	*irec)
+{
+	irec->rm_flags = 0;
+	irec->rm_startblock = be32_to_cpu(rec->rmap.rm_startblock);
+	irec->rm_blockcount = be32_to_cpu(rec->rmap.rm_blockcount);
+	irec->rm_owner = be64_to_cpu(rec->rmap.rm_owner);
+	return xfs_rmap_irec_offset_unpack(be64_to_cpu(rec->rmap.rm_offset),
+			irec);
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int
+xfs_rmap_get_rec(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*irec,
+	int			*stat)
+{
+	union xfs_btree_rec	*rec;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (error || !*stat)
+		return error;
+
+	return xfs_rmapbt_btrec_to_irec(rec, irec);
+}
+
 int
 xfs_rmap_free(
 	struct xfs_trans	*tp,
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 7a35c78..c50c725 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -35,6 +35,31 @@
 #include "xfs_error.h"
 #include "xfs_extent_busy.h"
 
+/*
+ * Reverse map btree.
+ *
+ * This is a per-ag tree used to track the owner(s) of a given extent. With
+ * reflink it is possible for there to be multiple owners, which is a departure
+ * from classic XFS. Owner records for data extents are inserted when the
+ * extent is mapped and removed when an extent is unmapped.  Owner records for
+ * all other block types (i.e. metadata) are inserted when an extent is
+ * allocated and removed when an extent is freed. There can only be one owner
+ * of a metadata extent, usually an inode or some other metadata structure like
+ * an AG btree.
+ *
+ * The rmap btree is part of the free space management, so blocks for the tree
+ * are sourced from the agfl. Hence we need transaction reservation support for
+ * this tree so that the freelist is always large enough. This also impacts on
+ * the minimum space we need to leave free in the AG.
+ *
+ * The tree is ordered by [ag block, owner, offset]. This is a large key size,
+ * but it is the only way to enforce unique keys when a block can be owned by
+ * multiple files at any offset. There's no need to order/search by extent
+ * size for online updating/management of the tree. It is intended that most
+ * reverse lookups will be to find the owner(s) of a particular block, or to
+ * try to recover tree and file data from corrupt primary metadata.
+ */
+
 static struct xfs_btree_cur *
 xfs_rmapbt_dup_cursor(
 	struct xfs_btree_cur	*cur)
@@ -43,6 +68,173 @@ xfs_rmapbt_dup_cursor(
 			cur->bc_private.a.agbp, cur->bc_private.a.agno);
 }
 
+STATIC void
+xfs_rmapbt_set_root(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	int			inc)
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agnumber_t		seqno = be32_to_cpu(agf->agf_seqno);
+	int			btnum = cur->bc_btnum;
+	struct xfs_perag	*pag = xfs_perag_get(cur->bc_mp, seqno);
+
+	ASSERT(ptr->s != 0);
+
+	agf->agf_roots[btnum] = ptr->s;
+	be32_add_cpu(&agf->agf_levels[btnum], inc);
+	pag->pagf_levels[btnum] += inc;
+	xfs_perag_put(pag);
+
+	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_ROOTS | XFS_AGF_LEVELS);
+}
+
+STATIC int
+xfs_rmapbt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	int			error;
+	xfs_agblock_t		bno;
+
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
+
+	/* Allocate the new block from the freelist. If we can't, give up.  */
+	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
+				       &bno, 1);
+	if (error) {
+		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
+		return error;
+	}
+
+	trace_xfs_rmapbt_alloc_block(cur->bc_mp, cur->bc_private.a.agno,
+			bno, 1);
+	if (bno == NULLAGBLOCK) {
+		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+		*stat = 0;
+		return 0;
+	}
+
+	xfs_extent_busy_reuse(cur->bc_mp, cur->bc_private.a.agno, bno, 1,
+			false);
+
+	xfs_trans_agbtree_delta(cur->bc_tp, 1);
+	new->s = cpu_to_be32(bno);
+
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+	*stat = 1;
+	return 0;
+}
+
+STATIC int
+xfs_rmapbt_free_block(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agblock_t		bno;
+	int			error;
+
+	bno = xfs_daddr_to_agbno(cur->bc_mp, XFS_BUF_ADDR(bp));
+	trace_xfs_rmapbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
+			bno, 1);
+	error = xfs_alloc_put_freelist(cur->bc_tp, agbp, NULL, bno, 1);
+	if (error)
+		return error;
+
+	xfs_extent_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1,
+			      XFS_EXTENT_BUSY_SKIP_DISCARD);
+	xfs_trans_agbtree_delta(cur->bc_tp, -1);
+
+	xfs_trans_binval(cur->bc_tp, bp);
+	return 0;
+}
+
+STATIC int
+xfs_rmapbt_get_minrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_rmap_mnr[level != 0];
+}
+
+STATIC int
+xfs_rmapbt_get_maxrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_rmap_mxr[level != 0];
+}
+
+STATIC void
+xfs_rmapbt_init_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	key->rmap.rm_startblock = rec->rmap.rm_startblock;
+	key->rmap.rm_owner = rec->rmap.rm_owner;
+	key->rmap.rm_offset = rec->rmap.rm_offset;
+}
+
+STATIC void
+xfs_rmapbt_init_rec_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*rec)
+{
+	rec->rmap.rm_startblock = cpu_to_be32(cur->bc_rec.r.rm_startblock);
+	rec->rmap.rm_blockcount = cpu_to_be32(cur->bc_rec.r.rm_blockcount);
+	rec->rmap.rm_owner = cpu_to_be64(cur->bc_rec.r.rm_owner);
+	rec->rmap.rm_offset = cpu_to_be64(
+			xfs_rmap_irec_offset_pack(&cur->bc_rec.r));
+}
+
+STATIC void
+xfs_rmapbt_init_ptr_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(cur->bc_private.a.agbp);
+
+	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agf->agf_seqno));
+	ASSERT(agf->agf_roots[cur->bc_btnum] != 0);
+
+	ptr->s = agf->agf_roots[cur->bc_btnum];
+}
+
+STATIC __int64_t
+xfs_rmapbt_key_diff(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*key)
+{
+	struct xfs_rmap_irec	*rec = &cur->bc_rec.r;
+	struct xfs_rmap_key	*kp = &key->rmap;
+	__u64			x, y;
+	__int64_t		d;
+
+	d = (__int64_t)be32_to_cpu(kp->rm_startblock) - rec->rm_startblock;
+	if (d)
+		return d;
+
+	x = be64_to_cpu(kp->rm_owner);
+	y = rec->rm_owner;
+	if (x > y)
+		return 1;
+	else if (y > x)
+		return -1;
+
+	x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset));
+	y = rec->rm_offset;
+	if (x > y)
+		return 1;
+	else if (y > x)
+		return -1;
+	return 0;
+}
+
 static bool
 xfs_rmapbt_verify(
 	struct xfs_buf		*bp)
@@ -117,12 +309,63 @@ const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
 	.verify_write		= xfs_rmapbt_write_verify,
 };
 
+#if defined(DEBUG) || defined(XFS_WARN)
+STATIC int
+xfs_rmapbt_keys_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	if (be32_to_cpu(k1->rmap.rm_startblock) <
+	    be32_to_cpu(k2->rmap.rm_startblock))
+		return 1;
+	if (be64_to_cpu(k1->rmap.rm_owner) <
+	    be64_to_cpu(k2->rmap.rm_owner))
+		return 1;
+	if (XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)) <=
+	    XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)))
+		return 1;
+	return 0;
+}
+
+STATIC int
+xfs_rmapbt_recs_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*r1,
+	union xfs_btree_rec	*r2)
+{
+	if (be32_to_cpu(r1->rmap.rm_startblock) <
+	    be32_to_cpu(r2->rmap.rm_startblock))
+		return 1;
+	if (XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)) <
+	    XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)))
+		return 1;
+	if (be64_to_cpu(r1->rmap.rm_owner) <=
+	    be64_to_cpu(r2->rmap.rm_owner))
+		return 1;
+	return 0;
+}
+#endif	/* DEBUG */
+
 static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
 	.key_len		= sizeof(struct xfs_rmap_key),
 
 	.dup_cursor		= xfs_rmapbt_dup_cursor,
+	.set_root		= xfs_rmapbt_set_root,
+	.alloc_block		= xfs_rmapbt_alloc_block,
+	.free_block		= xfs_rmapbt_free_block,
+	.get_minrecs		= xfs_rmapbt_get_minrecs,
+	.get_maxrecs		= xfs_rmapbt_get_maxrecs,
+	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
+	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_rmapbt_init_ptr_from_cur,
+	.key_diff		= xfs_rmapbt_key_diff,
 	.buf_ops		= &xfs_rmapbt_buf_ops,
+#if defined(DEBUG) || defined(XFS_WARN)
+	.keys_inorder		= xfs_rmapbt_keys_inorder,
+	.recs_inorder		= xfs_rmapbt_recs_inorder,
+#endif
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 462767f..17fa383 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -52,6 +52,15 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
 int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
 extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
 
+int xfs_rmap_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, uint64_t owner, uint64_t offset,
+		unsigned int flags, int *stat);
+int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, uint64_t owner, uint64_t offset,
+		unsigned int flags, int *stat);
+int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
+		int *stat);
+
 int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
 		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
 		   struct xfs_owner_info *oinfo);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b4ee9c8..28bd991 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2470,6 +2470,9 @@ DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
 DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
 DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
 
+DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
+DEFINE_BUSY_EVENT(xfs_rmapbt_free_block);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 033/119] xfs: support overlapping intervals in the rmap btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (31 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 032/119] xfs: add rmap btree operations Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-08 18:33   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 034/119] xfs: teach rmapbt to support interval queries Darrick J. Wong
                   ` (85 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Now that the generic btree code supports overlapping intervals, plug
in the rmap btree to this functionality.  We will need it to find
potential left neighbors in xfs_rmap_{alloc,free} later in the patch
set.

v2: Fix bit manipulation bug when generating high key offset.
v3: Move unwritten bit to rm_offset.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap_btree.c |   59 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_rmap_btree.h |   10 +++++--
 2 files changed, 66 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index c50c725..9adb930 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -181,6 +181,28 @@ xfs_rmapbt_init_key_from_rec(
 }
 
 STATIC void
+xfs_rmapbt_init_high_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	__uint64_t		off;
+	int			adj;
+
+	adj = be32_to_cpu(rec->rmap.rm_blockcount) - 1;
+
+	key->rmap.rm_startblock = rec->rmap.rm_startblock;
+	be32_add_cpu(&key->rmap.rm_startblock, adj);
+	key->rmap.rm_owner = rec->rmap.rm_owner;
+	key->rmap.rm_offset = rec->rmap.rm_offset;
+	if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) ||
+	    XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset)))
+		return;
+	off = be64_to_cpu(key->rmap.rm_offset);
+	off = (XFS_RMAP_OFF(off) + adj) | (off & ~XFS_RMAP_OFF_MASK);
+	key->rmap.rm_offset = cpu_to_be64(off);
+}
+
+STATIC void
 xfs_rmapbt_init_rec_from_cur(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_rec	*rec)
@@ -235,6 +257,38 @@ xfs_rmapbt_key_diff(
 	return 0;
 }
 
+STATIC __int64_t
+xfs_rmapbt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	struct xfs_rmap_key	*kp1 = &k1->rmap;
+	struct xfs_rmap_key	*kp2 = &k2->rmap;
+	__int64_t		d;
+	__u64			x, y;
+
+	d = (__int64_t)be32_to_cpu(kp2->rm_startblock) -
+		       be32_to_cpu(kp1->rm_startblock);
+	if (d)
+		return d;
+
+	x = be64_to_cpu(kp2->rm_owner);
+	y = be64_to_cpu(kp1->rm_owner);
+	if (x > y)
+		return 1;
+	else if (y > x)
+		return -1;
+
+	x = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset));
+	y = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset));
+	if (x > y)
+		return 1;
+	else if (y > x)
+		return -1;
+	return 0;
+}
+
 static bool
 xfs_rmapbt_verify(
 	struct xfs_buf		*bp)
@@ -350,6 +404,7 @@ xfs_rmapbt_recs_inorder(
 static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
 	.key_len		= sizeof(struct xfs_rmap_key),
+	.flags			= XFS_BTREE_OPS_OVERLAPPING,
 
 	.dup_cursor		= xfs_rmapbt_dup_cursor,
 	.set_root		= xfs_rmapbt_set_root,
@@ -358,10 +413,12 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.get_minrecs		= xfs_rmapbt_get_minrecs,
 	.get_maxrecs		= xfs_rmapbt_get_maxrecs,
 	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_rmapbt_init_high_key_from_rec,
 	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
 	.init_ptr_from_cur	= xfs_rmapbt_init_ptr_from_cur,
 	.key_diff		= xfs_rmapbt_key_diff,
 	.buf_ops		= &xfs_rmapbt_buf_ops,
+	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
 #if defined(DEBUG) || defined(XFS_WARN)
 	.keys_inorder		= xfs_rmapbt_keys_inorder,
 	.recs_inorder		= xfs_rmapbt_recs_inorder,
@@ -410,7 +467,7 @@ xfs_rmapbt_maxrecs(
 	if (leaf)
 		return blocklen / sizeof(struct xfs_rmap_rec);
 	return blocklen /
-		(sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
+		(2 * sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
 }
 
 /* Compute the maximum height of an rmap btree. */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 17fa383..796071c 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -38,12 +38,18 @@ struct xfs_mount;
 #define XFS_RMAP_KEY_ADDR(block, index) \
 	((struct xfs_rmap_key *) \
 		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
-		 ((index) - 1) * sizeof(struct xfs_rmap_key)))
+		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
+
+#define XFS_RMAP_HIGH_KEY_ADDR(block, index) \
+	((struct xfs_rmap_key *) \
+		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
+		 sizeof(struct xfs_rmap_key) + \
+		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
 
 #define XFS_RMAP_PTR_ADDR(block, index, maxrecs) \
 	((xfs_rmap_ptr_t *) \
 		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
-		 (maxrecs) * sizeof(struct xfs_rmap_key) + \
+		 (maxrecs) * 2 * sizeof(struct xfs_rmap_key) + \
 		 ((index) - 1) * sizeof(xfs_rmap_ptr_t)))
 
 struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 034/119] xfs: teach rmapbt to support interval queries
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (32 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 033/119] xfs: support overlapping intervals in the rmap btree Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-08 18:34   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 035/119] xfs: add tracepoints for the rmap functions Darrick J. Wong
                   ` (84 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Now that the generic btree code supports querying all records within a
range of keys, use that functionality to allow us to ask for all the
extents mapped to a range of physical blocks.

v2: Move unwritten bit to rm_offset.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap.c       |   43 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |    9 ++++++++
 2 files changed, 52 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index c6a5a0b..0e1721a 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -184,3 +184,46 @@ out_error:
 	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
 	return error;
 }
+
+struct xfs_rmapbt_query_range_info {
+	xfs_rmapbt_query_range_fn	fn;
+	void				*priv;
+};
+
+/* Format btree record and pass to our callback. */
+STATIC int
+xfs_rmapbt_query_range_helper(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*rec,
+	void			*priv)
+{
+	struct xfs_rmapbt_query_range_info	*query = priv;
+	struct xfs_rmap_irec			irec;
+	int					error;
+
+	error = xfs_rmapbt_btrec_to_irec(rec, &irec);
+	if (error)
+		return error;
+	return query->fn(cur, &irec, query->priv);
+}
+
+/* Find all rmaps between two keys. */
+int
+xfs_rmapbt_query_range(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*low_rec,
+	struct xfs_rmap_irec		*high_rec,
+	xfs_rmapbt_query_range_fn	fn,
+	void				*priv)
+{
+	union xfs_btree_irec		low_brec;
+	union xfs_btree_irec		high_brec;
+	struct xfs_rmapbt_query_range_info	query;
+
+	low_brec.r = *low_rec;
+	high_brec.r = *high_rec;
+	query.priv = priv;
+	query.fn = fn;
+	return xfs_btree_query_range(cur, &low_brec, &high_brec,
+			xfs_rmapbt_query_range_helper, &query);
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 796071c..e926c6e 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -74,4 +74,13 @@ int xfs_rmap_free(struct xfs_trans *tp, struct xfs_buf *agbp,
 		  xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
 		  struct xfs_owner_info *oinfo);
 
+typedef int (*xfs_rmapbt_query_range_fn)(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv);
+
+int xfs_rmapbt_query_range(struct xfs_btree_cur *cur,
+		struct xfs_rmap_irec *low_rec, struct xfs_rmap_irec *high_rec,
+		xfs_rmapbt_query_range_fn fn, void *priv);
+
 #endif	/* __XFS_RMAP_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 035/119] xfs: add tracepoints for the rmap functions
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (33 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 034/119] xfs: teach rmapbt to support interval queries Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-08 18:34   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 036/119] xfs: add an extent to the rmap btree Darrick J. Wong
                   ` (83 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_trace.h |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 28bd991..6daafaf 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2415,8 +2415,6 @@ DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_cancel);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
 DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
 
-DEFINE_MAP_EXTENT_DEFERRED_EVENT(xfs_defer_map_extent);
-
 #define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred);
@@ -2463,6 +2461,36 @@ DEFINE_EVENT(xfs_rmap_class, name, \
 		 struct xfs_owner_info *oinfo), \
 	TP_ARGS(mp, agno, agbno, len, unwritten, oinfo))
 
+/* simple AG-based error/%ip tracepoint class */
+DECLARE_EVENT_CLASS(xfs_ag_error_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, int error,
+		 unsigned long caller_ip),
+	TP_ARGS(mp, agno, error, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(int, error)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->error = error;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d agno %u error %d caller %ps",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->error,
+		  (char *)__entry->caller_ip)
+);
+
+#define DEFINE_AG_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_ag_error_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, int error, \
+		 unsigned long caller_ip), \
+	TP_ARGS(mp, agno, error, caller_ip))
+
 DEFINE_RMAP_EVENT(xfs_rmap_free_extent);
 DEFINE_RMAP_EVENT(xfs_rmap_free_extent_done);
 DEFINE_RMAP_EVENT(xfs_rmap_free_extent_error);
@@ -2470,8 +2498,57 @@ DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
 DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
 DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
 
+DECLARE_EVENT_CLASS(xfs_rmapbt_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agblock_t agbno, xfs_extlen_t len,
+		 uint64_t owner, uint64_t offset, unsigned int flags),
+	TP_ARGS(mp, agno, agbno, len, owner, offset, flags),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(uint64_t, owner)
+		__field(uint64_t, offset)
+		__field(unsigned int, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agbno = agbno;
+		__entry->len = len;
+		__entry->owner = owner;
+		__entry->offset = offset;
+		__entry->flags = flags;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u owner %lld offset %llu flags 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->owner,
+		  __entry->offset,
+		  __entry->flags)
+);
+#define DEFINE_RMAPBT_EVENT(name) \
+DEFINE_EVENT(xfs_rmapbt_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 xfs_agblock_t agbno, xfs_extlen_t len, \
+		 uint64_t owner, uint64_t offset, unsigned int flags), \
+	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
+
+#define DEFINE_RMAP_DEFERRED_EVENT DEFINE_MAP_EXTENT_DEFERRED_EVENT
+DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_defer);
+DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_deferred);
+
 DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
 DEFINE_BUSY_EVENT(xfs_rmapbt_free_block);
+DEFINE_RMAPBT_EVENT(xfs_rmapbt_update);
+DEFINE_RMAPBT_EVENT(xfs_rmapbt_insert);
+DEFINE_RMAPBT_EVENT(xfs_rmapbt_delete);
+DEFINE_AG_ERROR_EVENT(xfs_rmapbt_insert_error);
+DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
+DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
 
 #endif /* _TRACE_XFS_H */
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 036/119] xfs: add an extent to the rmap btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (34 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 035/119] xfs: add tracepoints for the rmap functions Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-11 18:49   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 037/119] xfs: remove an extent from " Darrick J. Wong
                   ` (82 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Now all the btree, free space and transaction infrastructure is in
place, we can finally add the code to insert reverse mappings to the
rmap btree. Freeing will be done in a separate patch, so just the
addition operation can be focussed on here.

v2: Update alloc function to handle non-shared file data.  Isolate the
part that makes changes from the part that initializes the rmap
cursor; this will be useful for deferred updates.

[darrick: handle owner offsets when adding rmaps]
[dchinner: remove remaining debug printk statements]
[darrick: move unwritten bit to rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_rmap.c       |  225 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_rmap_btree.h |    1 
 fs/xfs/xfs_trace.h             |    2 
 3 files changed, 223 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 0e1721a..196e952 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -159,6 +159,218 @@ out_error:
 	return error;
 }
 
+/*
+ * A mergeable rmap should have the same owner, cannot be unwritten, and
+ * must be a bmbt rmap if we're asking about a bmbt rmap.
+ */
+static bool
+xfs_rmap_is_mergeable(
+	struct xfs_rmap_irec	*irec,
+	uint64_t		owner,
+	uint64_t		offset,
+	xfs_extlen_t		len,
+	unsigned int		flags)
+{
+	if (irec->rm_owner == XFS_RMAP_OWN_NULL)
+		return false;
+	if (irec->rm_owner != owner)
+		return false;
+	if ((flags & XFS_RMAP_UNWRITTEN) ^
+	    (irec->rm_flags & XFS_RMAP_UNWRITTEN))
+		return false;
+	if ((flags & XFS_RMAP_ATTR_FORK) ^
+	    (irec->rm_flags & XFS_RMAP_ATTR_FORK))
+		return false;
+	if ((flags & XFS_RMAP_BMBT_BLOCK) ^
+	    (irec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+		return false;
+	return true;
+}
+
+/*
+ * When we allocate a new block, the first thing we do is add a reference to
+ * the extent in the rmap btree. This takes the form of a [agbno, length,
+ * owner, offset] record.  Flags are encoded in the high bits of the offset
+ * field.
+ */
+STATIC int
+__xfs_rmap_alloc(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	ltrec;
+	struct xfs_rmap_irec	gtrec;
+	int			have_gt;
+	int			have_lt;
+	int			error = 0;
+	int			i;
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags = 0;
+	bool			ignore_off;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	ignore_off = XFS_RMAP_NON_INODE_OWNER(owner) ||
+			(flags & XFS_RMAP_BMBT_BLOCK);
+	if (unwritten)
+		flags |= XFS_RMAP_UNWRITTEN;
+	trace_xfs_rmap_alloc_extent(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/*
+	 * For the initial lookup, look for and exact match or the left-adjacent
+	 * record for our insertion point. This will also give us the record for
+	 * start block contiguity tests.
+	 */
+	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, flags,
+			&have_lt);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
+
+	error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
+	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
+			cur->bc_private.a.agno, ltrec.rm_startblock,
+			ltrec.rm_blockcount, ltrec.rm_owner,
+			ltrec.rm_offset, ltrec.rm_flags);
+
+	if (!xfs_rmap_is_mergeable(&ltrec, owner, offset, len, flags))
+		have_lt = 0;
+
+	XFS_WANT_CORRUPTED_GOTO(mp,
+		have_lt == 0 ||
+		ltrec.rm_startblock + ltrec.rm_blockcount <= bno, out_error);
+
+	/*
+	 * Increment the cursor to see if we have a right-adjacent record to our
+	 * insertion point. This will give us the record for end block
+	 * contiguity tests.
+	 */
+	error = xfs_btree_increment(cur, 0, &have_gt);
+	if (error)
+		goto out_error;
+	if (have_gt) {
+		error = xfs_rmap_get_rec(cur, &gtrec, &have_gt);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
+		XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= gtrec.rm_startblock,
+					out_error);
+		trace_xfs_rmap_map_gtrec(cur->bc_mp,
+			cur->bc_private.a.agno, gtrec.rm_startblock,
+			gtrec.rm_blockcount, gtrec.rm_owner,
+			gtrec.rm_offset, gtrec.rm_flags);
+		if (!xfs_rmap_is_mergeable(&gtrec, owner, offset, len, flags))
+			have_gt = 0;
+	}
+
+	/*
+	 * Note: cursor currently points one record to the right of ltrec, even
+	 * if there is no record in the tree to the right.
+	 */
+	if (have_lt &&
+	    ltrec.rm_startblock + ltrec.rm_blockcount == bno &&
+	    (ignore_off || ltrec.rm_offset + ltrec.rm_blockcount == offset)) {
+		/*
+		 * left edge contiguous, merge into left record.
+		 *
+		 *       ltbno     ltlen
+		 * orig:   |ooooooooo|
+		 * adding:           |aaaaaaaaa|
+		 * result: |rrrrrrrrrrrrrrrrrrr|
+		 *                  bno       len
+		 */
+		ltrec.rm_blockcount += len;
+		if (have_gt &&
+		    bno + len == gtrec.rm_startblock &&
+		    (ignore_off || offset + len == gtrec.rm_offset) &&
+		    (unsigned long)ltrec.rm_blockcount + len +
+				gtrec.rm_blockcount <= XFS_RMAP_LEN_MAX) {
+			/*
+			 * right edge also contiguous, delete right record
+			 * and merge into left record.
+			 *
+			 *       ltbno     ltlen    gtbno     gtlen
+			 * orig:   |ooooooooo|         |ooooooooo|
+			 * adding:           |aaaaaaaaa|
+			 * result: |rrrrrrrrrrrrrrrrrrrrrrrrrrrrr|
+			 */
+			ltrec.rm_blockcount += gtrec.rm_blockcount;
+			trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
+					gtrec.rm_startblock,
+					gtrec.rm_blockcount,
+					gtrec.rm_owner,
+					gtrec.rm_offset,
+					gtrec.rm_flags);
+			error = xfs_btree_delete(cur, &i);
+			if (error)
+				goto out_error;
+			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+		}
+
+		/* point the cursor back to the left record and update */
+		error = xfs_btree_decrement(cur, 0, &have_gt);
+		if (error)
+			goto out_error;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+	} else if (have_gt &&
+		   bno + len == gtrec.rm_startblock &&
+		   (ignore_off || offset + len == gtrec.rm_offset)) {
+		/*
+		 * right edge contiguous, merge into right record.
+		 *
+		 *                 gtbno     gtlen
+		 * Orig:             |ooooooooo|
+		 * adding: |aaaaaaaaa|
+		 * Result: |rrrrrrrrrrrrrrrrrrr|
+		 *        bno       len
+		 */
+		gtrec.rm_startblock = bno;
+		gtrec.rm_blockcount += len;
+		if (!ignore_off)
+			gtrec.rm_offset = offset;
+		error = xfs_rmap_update(cur, &gtrec);
+		if (error)
+			goto out_error;
+	} else {
+		/*
+		 * no contiguous edge with identical owner, insert
+		 * new record at current cursor position.
+		 */
+		cur->bc_rec.r.rm_startblock = bno;
+		cur->bc_rec.r.rm_blockcount = len;
+		cur->bc_rec.r.rm_owner = owner;
+		cur->bc_rec.r.rm_offset = offset;
+		cur->bc_rec.r.rm_flags = flags;
+		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno, len,
+			owner, offset, flags);
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+	}
+
+	trace_xfs_rmap_alloc_extent_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+out_error:
+	if (error)
+		trace_xfs_rmap_alloc_extent_error(mp, cur->bc_private.a.agno,
+				bno, len, unwritten, oinfo);
+	return error;
+}
+
+/*
+ * Add a reference to an extent in the rmap btree.
+ */
 int
 xfs_rmap_alloc(
 	struct xfs_trans	*tp,
@@ -169,19 +381,22 @@ xfs_rmap_alloc(
 	struct xfs_owner_info	*oinfo)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	int			error = 0;
+	struct xfs_btree_cur	*cur;
+	int			error;
 
 	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
 		return 0;
 
-	trace_xfs_rmap_alloc_extent(mp, agno, bno, len, false, oinfo);
-	if (1)
+	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
+	error = __xfs_rmap_alloc(cur, bno, len, false, oinfo);
+	if (error)
 		goto out_error;
-	trace_xfs_rmap_alloc_extent_done(mp, agno, bno, len, false, oinfo);
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	return 0;
 
 out_error:
-	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
 	return error;
 }
 
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index e926c6e..9d92da5 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -67,6 +67,7 @@ int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
 		int *stat);
 
+/* functions for updating the rmapbt for bmbt blocks and AG btree blocks */
 int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
 		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
 		   struct xfs_owner_info *oinfo);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6daafaf..3ebceb0 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2549,6 +2549,8 @@ DEFINE_RMAPBT_EVENT(xfs_rmapbt_delete);
 DEFINE_AG_ERROR_EVENT(xfs_rmapbt_insert_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
+DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
+DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
 
 #endif /* _TRACE_XFS_H */
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 037/119] xfs: remove an extent from the rmap btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (35 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 036/119] xfs: add an extent to the rmap btree Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-07-11 18:49   ` Brian Foster
  2016-06-17  1:21 ` [PATCH 038/119] xfs: convert unwritten status of reverse mappings Darrick J. Wong
                   ` (81 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Now that we have records in the rmap btree, we need to remove them
when extents are freed. This needs to find the relevant record in
the btree and remove/trim/split it accordingly.

v2: Update the free function to deal with non-shared file data, and
isolate the part that does the rmap update from the part that deals
with cursors.  This will be useful for deferred ops.

[darrick.wong@oracle.com: make rmap routines handle the enlarged keyspace]
[dchinner: remove remaining unused debug printks]
[darrick: fix a bug when growfs in an AG with an rmap ending at EOFS]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_rmap.c |  220 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 215 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 196e952..1043c63 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -133,6 +133,212 @@ xfs_rmap_get_rec(
 	return xfs_rmapbt_btrec_to_irec(rec, irec);
 }
 
+/*
+ * Find the extent in the rmap btree and remove it.
+ *
+ * The record we find should always be an exact match for the extent that we're
+ * looking for, since we insert them into the btree without modification.
+ *
+ * Special Case #1: when growing the filesystem, we "free" an extent when
+ * growing the last AG. This extent is new space and so it is not tracked as
+ * used space in the btree. The growfs code will pass in an owner of
+ * XFS_RMAP_OWN_NULL to indicate that it expected that there is no owner of this
+ * extent. We verify that - the extent lookup result in a record that does not
+ * overlap.
+ *
+ * Special Case #2: EFIs do not record the owner of the extent, so when
+ * recovering EFIs from the log we pass in XFS_RMAP_OWN_UNKNOWN to tell the rmap
+ * btree to ignore the owner (i.e. wildcard match) so we don't trigger
+ * corruption checks during log recovery.
+ */
+STATIC int
+__xfs_rmap_free(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	ltrec;
+	uint64_t		ltoff;
+	int			error = 0;
+	int			i;
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags;
+	bool			ignore_off;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	ignore_off = XFS_RMAP_NON_INODE_OWNER(owner) ||
+			(flags & XFS_RMAP_BMBT_BLOCK);
+	if (unwritten)
+		flags |= XFS_RMAP_UNWRITTEN;
+	trace_xfs_rmap_free_extent(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/*
+	 * We should always have a left record because there's a static record
+	 * for the AG headers at rm_startblock == 0 created by mkfs/growfs that
+	 * will not ever be removed from the tree.
+	 */
+	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, flags, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+
+	error = xfs_rmap_get_rec(cur, &ltrec, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
+			cur->bc_private.a.agno, ltrec.rm_startblock,
+			ltrec.rm_blockcount, ltrec.rm_owner,
+			ltrec.rm_offset, ltrec.rm_flags);
+	ltoff = ltrec.rm_offset;
+
+	/*
+	 * For growfs, the incoming extent must be beyond the left record we
+	 * just found as it is new space and won't be used by anyone. This is
+	 * just a corruption check as we don't actually do anything with this
+	 * extent.  Note that we need to use >= instead of > because it might
+	 * be the case that the "left" extent goes all the way to EOFS.
+	 */
+	if (owner == XFS_RMAP_OWN_NULL) {
+		XFS_WANT_CORRUPTED_GOTO(mp, bno >= ltrec.rm_startblock +
+						ltrec.rm_blockcount, out_error);
+		goto out_done;
+	}
+
+	/* Make sure the unwritten flag matches. */
+	XFS_WANT_CORRUPTED_GOTO(mp, (flags & XFS_RMAP_UNWRITTEN) ==
+			(ltrec.rm_flags & XFS_RMAP_UNWRITTEN), out_error);
+
+	/* Make sure the extent we found covers the entire freeing range. */
+	XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_startblock <= bno &&
+		ltrec.rm_startblock + ltrec.rm_blockcount >=
+		bno + len, out_error);
+
+	/* Make sure the owner matches what we expect to find in the tree. */
+	XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner ||
+				    XFS_RMAP_NON_INODE_OWNER(owner), out_error);
+
+	/* Check the offset, if necessary. */
+	if (!XFS_RMAP_NON_INODE_OWNER(owner)) {
+		if (flags & XFS_RMAP_BMBT_BLOCK) {
+			XFS_WANT_CORRUPTED_GOTO(mp,
+					ltrec.rm_flags & XFS_RMAP_BMBT_BLOCK,
+					out_error);
+		} else {
+			XFS_WANT_CORRUPTED_GOTO(mp,
+					ltrec.rm_offset <= offset, out_error);
+			XFS_WANT_CORRUPTED_GOTO(mp,
+					ltoff + ltrec.rm_blockcount >= offset + len,
+					out_error);
+		}
+	}
+
+	if (ltrec.rm_startblock == bno && ltrec.rm_blockcount == len) {
+		/* exact match, simply remove the record from rmap tree */
+		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
+				ltrec.rm_startblock, ltrec.rm_blockcount,
+				ltrec.rm_owner, ltrec.rm_offset,
+				ltrec.rm_flags);
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+	} else if (ltrec.rm_startblock == bno) {
+		/*
+		 * overlap left hand side of extent: move the start, trim the
+		 * length and update the current record.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing: |fffffffff|
+		 * Result:            |rrrrrrrrrr|
+		 *         bno       len
+		 */
+		ltrec.rm_startblock += len;
+		ltrec.rm_blockcount -= len;
+		if (!ignore_off)
+			ltrec.rm_offset += len;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+	} else if (ltrec.rm_startblock + ltrec.rm_blockcount == bno + len) {
+		/*
+		 * overlap right hand side of extent: trim the length and update
+		 * the current record.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing:            |fffffffff|
+		 * Result:  |rrrrrrrrrr|
+		 *                    bno       len
+		 */
+		ltrec.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+	} else {
+
+		/*
+		 * overlap middle of extent: trim the length of the existing
+		 * record to the length of the new left-extent size, increment
+		 * the insertion position so we can insert a new record
+		 * containing the remaining right-extent space.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing:       |fffffffff|
+		 * Result:  |rrrrr|         |rrrr|
+		 *               bno       len
+		 */
+		xfs_extlen_t	orig_len = ltrec.rm_blockcount;
+
+		ltrec.rm_blockcount = bno - ltrec.rm_startblock;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+
+		error = xfs_btree_increment(cur, 0, &i);
+		if (error)
+			goto out_error;
+
+		cur->bc_rec.r.rm_startblock = bno + len;
+		cur->bc_rec.r.rm_blockcount = orig_len - len -
+						     ltrec.rm_blockcount;
+		cur->bc_rec.r.rm_owner = ltrec.rm_owner;
+		if (ignore_off)
+			cur->bc_rec.r.rm_offset = 0;
+		else
+			cur->bc_rec.r.rm_offset = offset + len;
+		cur->bc_rec.r.rm_flags = flags;
+		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno,
+				cur->bc_rec.r.rm_startblock,
+				cur->bc_rec.r.rm_blockcount,
+				cur->bc_rec.r.rm_owner,
+				cur->bc_rec.r.rm_offset,
+				cur->bc_rec.r.rm_flags);
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto out_error;
+	}
+
+out_done:
+	trace_xfs_rmap_free_extent_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+out_error:
+	if (error)
+		trace_xfs_rmap_free_extent_error(mp, cur->bc_private.a.agno,
+				bno, len, unwritten, oinfo);
+	return error;
+}
+
+/*
+ * Remove a reference to an extent in the rmap btree.
+ */
 int
 xfs_rmap_free(
 	struct xfs_trans	*tp,
@@ -143,19 +349,23 @@ xfs_rmap_free(
 	struct xfs_owner_info	*oinfo)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	int			error = 0;
+	struct xfs_btree_cur	*cur;
+	int			error;
 
 	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
 		return 0;
 
-	trace_xfs_rmap_free_extent(mp, agno, bno, len, false, oinfo);
-	if (1)
+	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
+
+	error = __xfs_rmap_free(cur, bno, len, false, oinfo);
+	if (error)
 		goto out_error;
-	trace_xfs_rmap_free_extent_done(mp, agno, bno, len, false, oinfo);
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
 	return 0;
 
 out_error:
-	trace_xfs_rmap_free_extent_error(mp, agno, bno, len, false, oinfo);
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
 	return error;
 }
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 038/119] xfs: convert unwritten status of reverse mappings
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (36 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 037/119] xfs: remove an extent from " Darrick J. Wong
@ 2016-06-17  1:21 ` Darrick J. Wong
  2016-06-30  0:15   ` Darrick J. Wong
  2016-07-13 18:27   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 039/119] xfs: add rmap btree insert and delete helpers Darrick J. Wong
                   ` (80 subsequent siblings)
  118 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:21 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Provide a function to convert an unwritten extent to a real one and
vice versa.

v2: Move unwritten bit to rm_offset.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap.c |  442 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.h       |    6 +
 2 files changed, 448 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 1043c63..53ba14e 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -610,6 +610,448 @@ out_error:
 	return error;
 }
 
+#define RMAP_LEFT_CONTIG	(1 << 0)
+#define RMAP_RIGHT_CONTIG	(1 << 1)
+#define RMAP_LEFT_FILLING	(1 << 2)
+#define RMAP_RIGHT_FILLING	(1 << 3)
+#define RMAP_LEFT_VALID		(1 << 6)
+#define RMAP_RIGHT_VALID	(1 << 7)
+
+#define LEFT		r[0]
+#define RIGHT		r[1]
+#define PREV		r[2]
+#define NEW		r[3]
+
+/*
+ * Convert an unwritten extent to a real extent or vice versa.
+ * Does not handle overlapping extents.
+ */
+STATIC int
+__xfs_rmap_convert(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	r[4];	/* neighbor extent entries */
+					/* left is 0, right is 1, prev is 2 */
+					/* new is 3 */
+	uint64_t		owner;
+	uint64_t		offset;
+	uint64_t		new_endoff;
+	unsigned int		oldext;
+	unsigned int		newext;
+	unsigned int		flags = 0;
+	int			i;
+	int			state = 0;
+	int			error;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	ASSERT(!(XFS_RMAP_NON_INODE_OWNER(owner) ||
+			(flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))));
+	oldext = unwritten ? XFS_RMAP_UNWRITTEN : 0;
+	new_endoff = offset + len;
+	trace_xfs_rmap_convert(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/*
+	 * For the initial lookup, look for and exact match or the left-adjacent
+	 * record for our insertion point. This will also give us the record for
+	 * start block contiguity tests.
+	 */
+	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, oldext, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+
+	error = xfs_rmap_get_rec(cur, &PREV, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
+			cur->bc_private.a.agno, PREV.rm_startblock,
+			PREV.rm_blockcount, PREV.rm_owner,
+			PREV.rm_offset, PREV.rm_flags);
+
+	ASSERT(PREV.rm_offset <= offset);
+	ASSERT(PREV.rm_offset + PREV.rm_blockcount >= new_endoff);
+	ASSERT((PREV.rm_flags & XFS_RMAP_UNWRITTEN) == oldext);
+	newext = ~oldext & XFS_RMAP_UNWRITTEN;
+
+	/*
+	 * Set flags determining what part of the previous oldext allocation
+	 * extent is being replaced by a newext allocation.
+	 */
+	if (PREV.rm_offset == offset)
+		state |= RMAP_LEFT_FILLING;
+	if (PREV.rm_offset + PREV.rm_blockcount == new_endoff)
+		state |= RMAP_RIGHT_FILLING;
+
+	/*
+	 * Decrement the cursor to see if we have a left-adjacent record to our
+	 * insertion point. This will give us the record for end block
+	 * contiguity tests.
+	 */
+	error = xfs_btree_decrement(cur, 0, &i);
+	if (error)
+		goto done;
+	if (i) {
+		state |= RMAP_LEFT_VALID;
+		error = xfs_rmap_get_rec(cur, &LEFT, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		XFS_WANT_CORRUPTED_GOTO(mp,
+				LEFT.rm_startblock + LEFT.rm_blockcount <= bno,
+				done);
+		trace_xfs_rmap_find_left_neighbor_result(cur->bc_mp,
+				cur->bc_private.a.agno, LEFT.rm_startblock,
+				LEFT.rm_blockcount, LEFT.rm_owner,
+				LEFT.rm_offset, LEFT.rm_flags);
+		if (LEFT.rm_startblock + LEFT.rm_blockcount == bno &&
+		    LEFT.rm_offset + LEFT.rm_blockcount == offset &&
+		    xfs_rmap_is_mergeable(&LEFT, owner, offset, len, newext))
+			state |= RMAP_LEFT_CONTIG;
+	}
+
+	/*
+	 * Increment the cursor to see if we have a right-adjacent record to our
+	 * insertion point. This will give us the record for end block
+	 * contiguity tests.
+	 */
+	error = xfs_btree_increment(cur, 0, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+	error = xfs_btree_increment(cur, 0, &i);
+	if (error)
+		goto done;
+	if (i) {
+		state |= RMAP_RIGHT_VALID;
+		error = xfs_rmap_get_rec(cur, &RIGHT, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= RIGHT.rm_startblock,
+					done);
+		trace_xfs_rmap_convert_gtrec(cur->bc_mp,
+				cur->bc_private.a.agno, RIGHT.rm_startblock,
+				RIGHT.rm_blockcount, RIGHT.rm_owner,
+				RIGHT.rm_offset, RIGHT.rm_flags);
+		if (bno + len == RIGHT.rm_startblock &&
+		    offset + len == RIGHT.rm_offset &&
+		    xfs_rmap_is_mergeable(&RIGHT, owner, offset, len, newext))
+			state |= RMAP_RIGHT_CONTIG;
+	}
+
+	/* check that left + prev + right is not too long */
+	if ((state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) ==
+	    (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG) &&
+	    (unsigned long)LEFT.rm_blockcount + len +
+	     RIGHT.rm_blockcount > XFS_RMAP_LEN_MAX)
+		state &= ~RMAP_RIGHT_CONTIG;
+
+	trace_xfs_rmap_convert_state(mp, cur->bc_private.a.agno, state,
+			_RET_IP_);
+
+	/* reset the cursor back to PREV */
+	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, oldext, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+
+	/*
+	 * Switch out based on the FILLING and CONTIG state bits.
+	 */
+	switch (state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) {
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The left and right neighbors are both contiguous with new.
+		 */
+		error = xfs_btree_increment(cur, 0, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
+				RIGHT.rm_startblock, RIGHT.rm_blockcount,
+				RIGHT.rm_owner, RIGHT.rm_offset,
+				RIGHT.rm_flags);
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		error = xfs_btree_decrement(cur, 0, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
+				PREV.rm_startblock, PREV.rm_blockcount,
+				PREV.rm_owner, PREV.rm_offset,
+				PREV.rm_flags);
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		error = xfs_btree_decrement(cur, 0, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW = LEFT;
+		NEW.rm_blockcount += PREV.rm_blockcount + RIGHT.rm_blockcount;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The left neighbor is contiguous, the right is not.
+		 */
+		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
+				PREV.rm_startblock, PREV.rm_blockcount,
+				PREV.rm_owner, PREV.rm_offset,
+				PREV.rm_flags);
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		error = xfs_btree_decrement(cur, 0, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW = LEFT;
+		NEW.rm_blockcount += PREV.rm_blockcount;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The right neighbor is contiguous, the left is not.
+		 */
+		error = xfs_btree_increment(cur, 0, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
+				RIGHT.rm_startblock, RIGHT.rm_blockcount,
+				RIGHT.rm_owner, RIGHT.rm_offset,
+				RIGHT.rm_flags);
+		error = xfs_btree_delete(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		error = xfs_btree_decrement(cur, 0, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_startblock = bno;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = offset;
+		NEW.rm_blockcount = len + RIGHT.rm_blockcount;
+		NEW.rm_flags = newext;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * Neither the left nor right neighbors are contiguous with
+		 * the new one.
+		 */
+		NEW = PREV;
+		NEW.rm_flags = newext;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG:
+		/*
+		 * Setting the first part of a previous oldext extent to newext.
+		 * The left neighbor is contiguous.
+		 */
+		NEW = PREV;
+		NEW.rm_offset += len;
+		NEW.rm_startblock += len;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		error = xfs_btree_decrement(cur, 0, &i);
+		if (error)
+			goto done;
+		NEW = LEFT;
+		NEW.rm_blockcount += len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING:
+		/*
+		 * Setting the first part of a previous oldext extent to newext.
+		 * The left neighbor is not contiguous.
+		 */
+		NEW = PREV;
+		NEW.rm_startblock += len;
+		NEW.rm_offset += len;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		NEW.rm_startblock = bno;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = offset;
+		NEW.rm_blockcount = len;
+		NEW.rm_flags = newext;
+		cur->bc_rec.r = NEW;
+		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno,
+				len, owner, offset, newext);
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		break;
+
+	case RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting the last part of a previous oldext extent to newext.
+		 * The right neighbor is contiguous with the new allocation.
+		 */
+		NEW = PREV;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		error = xfs_btree_increment(cur, 0, &i);
+		if (error)
+			goto done;
+		NEW = RIGHT;
+		NEW.rm_offset = offset;
+		NEW.rm_startblock = bno;
+		NEW.rm_blockcount += len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_RIGHT_FILLING:
+		/*
+		 * Setting the last part of a previous oldext extent to newext.
+		 * The right neighbor is not contiguous.
+		 */
+		NEW = PREV;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		error = xfs_rmap_lookup_eq(cur, bno, len, owner, offset,
+				oldext, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 0, done);
+		NEW.rm_startblock = bno;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = offset;
+		NEW.rm_blockcount = len;
+		NEW.rm_flags = newext;
+		cur->bc_rec.r = NEW;
+		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno,
+				len, owner, offset, newext);
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		break;
+
+	case 0:
+		/*
+		 * Setting the middle part of a previous oldext extent to
+		 * newext.  Contiguity is impossible here.
+		 * One extent becomes three extents.
+		 */
+		/* new right extent - oldext */
+		NEW.rm_startblock = bno + len;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = new_endoff;
+		NEW.rm_blockcount = PREV.rm_offset + PREV.rm_blockcount -
+				new_endoff;
+		NEW.rm_flags = PREV.rm_flags;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		/* new left extent - oldext */
+		NEW = PREV;
+		NEW.rm_blockcount = offset - PREV.rm_offset;
+		cur->bc_rec.r = NEW;
+		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno,
+				NEW.rm_startblock, NEW.rm_blockcount,
+				NEW.rm_owner, NEW.rm_offset,
+				NEW.rm_flags);
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		/*
+		 * Reset the cursor to the position of the new extent
+		 * we are about to insert as we can't trust it after
+		 * the previous insert.
+		 */
+		error = xfs_rmap_lookup_eq(cur, bno, len, owner, offset,
+				oldext, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 0, done);
+		/* new middle extent - newext */
+		cur->bc_rec.b.br_state = newext;
+		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno, len,
+				owner, offset, newext);
+		error = xfs_btree_insert(cur, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_CONTIG:
+	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
+	case RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_LEFT_CONTIG:
+	case RMAP_RIGHT_CONTIG:
+		/*
+		 * These cases are all impossible.
+		 */
+		ASSERT(0);
+	}
+
+	trace_xfs_rmap_convert_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+done:
+	if (error)
+		trace_xfs_rmap_convert_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+#undef	NEW
+#undef	LEFT
+#undef	RIGHT
+#undef	PREV
+
 struct xfs_rmapbt_query_range_info {
 	xfs_rmapbt_query_range_fn	fn;
 	void				*priv;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 3ebceb0..6466adc 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2497,6 +2497,10 @@ DEFINE_RMAP_EVENT(xfs_rmap_free_extent_error);
 DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
 DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
 DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
+DEFINE_RMAP_EVENT(xfs_rmap_convert);
+DEFINE_RMAP_EVENT(xfs_rmap_convert_done);
+DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_error);
+DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_state);
 
 DECLARE_EVENT_CLASS(xfs_rmapbt_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
@@ -2551,6 +2555,8 @@ DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
 DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
 DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
+DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
+DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
 #endif /* _TRACE_XFS_H */
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 039/119] xfs: add rmap btree insert and delete helpers
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (37 preceding siblings ...)
  2016-06-17  1:21 ` [PATCH 038/119] xfs: convert unwritten status of reverse mappings Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-13 18:28   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents Darrick J. Wong
                   ` (79 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

Add a couple of helper functions to encapsulate rmap btree insert and
delete operations.  Add tracepoints to the update function.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_rmap.c       |   78 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_rmap_btree.h |    3 ++
 2 files changed, 80 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 53ba14e..f92eaa1 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -92,13 +92,89 @@ xfs_rmap_update(
 	struct xfs_rmap_irec	*irec)
 {
 	union xfs_btree_rec	rec;
+	int			error;
+
+	trace_xfs_rmapbt_update(cur->bc_mp, cur->bc_private.a.agno,
+			irec->rm_startblock, irec->rm_blockcount,
+			irec->rm_owner, irec->rm_offset, irec->rm_flags);
 
 	rec.rmap.rm_startblock = cpu_to_be32(irec->rm_startblock);
 	rec.rmap.rm_blockcount = cpu_to_be32(irec->rm_blockcount);
 	rec.rmap.rm_owner = cpu_to_be64(irec->rm_owner);
 	rec.rmap.rm_offset = cpu_to_be64(
 			xfs_rmap_irec_offset_pack(irec));
-	return xfs_btree_update(cur, &rec);
+	error = xfs_btree_update(cur, &rec);
+	if (error)
+		trace_xfs_rmapbt_update_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+int
+xfs_rmapbt_insert(
+	struct xfs_btree_cur	*rcur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags)
+{
+	int			i;
+	int			error;
+
+	trace_xfs_rmapbt_insert(rcur->bc_mp, rcur->bc_private.a.agno, agbno,
+			len, owner, offset, flags);
+
+	error = xfs_rmap_lookup_eq(rcur, agbno, len, owner, offset, flags, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 0, done);
+
+	rcur->bc_rec.r.rm_startblock = agbno;
+	rcur->bc_rec.r.rm_blockcount = len;
+	rcur->bc_rec.r.rm_owner = owner;
+	rcur->bc_rec.r.rm_offset = offset;
+	rcur->bc_rec.r.rm_flags = flags;
+	error = xfs_btree_insert(rcur, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
+done:
+	if (error)
+		trace_xfs_rmapbt_insert_error(rcur->bc_mp,
+				rcur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+STATIC int
+xfs_rmapbt_delete(
+	struct xfs_btree_cur	*rcur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags)
+{
+	int			i;
+	int			error;
+
+	trace_xfs_rmapbt_delete(rcur->bc_mp, rcur->bc_private.a.agno, agbno,
+			len, owner, offset, flags);
+
+	error = xfs_rmap_lookup_eq(rcur, agbno, len, owner, offset, flags, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
+
+	error = xfs_btree_delete(rcur, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
+done:
+	if (error)
+		trace_xfs_rmapbt_delete_error(rcur->bc_mp,
+				rcur->bc_private.a.agno, error, _RET_IP_);
+	return error;
 }
 
 static int
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 9d92da5..6674340 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -64,6 +64,9 @@ int xfs_rmap_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, uint64_t owner, uint64_t offset,
 		unsigned int flags, int *stat);
+int xfs_rmapbt_insert(struct xfs_btree_cur *rcur, xfs_agblock_t agbno,
+		xfs_extlen_t len, uint64_t owner, uint64_t offset,
+		unsigned int flags);
 int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
 		int *stat);
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (38 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 039/119] xfs: add rmap btree insert and delete helpers Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-13 18:28   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 041/119] xfs: create rmap update intent log items Darrick J. Wong
                   ` (78 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create two helper functions to assist with mapping, unmapping, and
converting flag status of extents in a file's data/attr forks.  For
non-shared files we can use the _alloc, _free, and _convert functions;
when reflink comes these functions will be augmented to deal with
shared extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index f92eaa1..76fc5c2 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -1123,11 +1123,53 @@ done:
 	return error;
 }
 
+/*
+ * Convert an unwritten extent to a real extent or vice versa.
+ */
+STATIC int
+xfs_rmap_convert(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	return __xfs_rmap_convert(cur, bno, len, unwritten, oinfo);
+}
+
 #undef	NEW
 #undef	LEFT
 #undef	RIGHT
 #undef	PREV
 
+/*
+ * Find an extent in the rmap btree and unmap it.
+ */
+STATIC int
+xfs_rmap_unmap(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	return __xfs_rmap_free(cur, bno, len, unwritten, oinfo);
+}
+
+/*
+ * Find an extent in the rmap btree and map it.
+ */
+STATIC int
+xfs_rmap_map(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	return __xfs_rmap_alloc(cur, bno, len, unwritten, oinfo);
+}
+
 struct xfs_rmapbt_query_range_info {
 	xfs_rmapbt_query_range_fn	fn;
 	void				*priv;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 041/119] xfs: create rmap update intent log items
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (39 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-15 18:33   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 042/119] xfs: log rmap intent items Darrick J. Wong
                   ` (77 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create rmap update intent/done log items to record redo information in
the log.  Because we need to roll transactions between updating the
bmbt mapping and updating the reverse mapping, we also have to track
the status of the metadata updates that will be recorded in the
post-roll transactions, just in case we crash before committing the
final transaction.  This mechanism enables log recovery to finish what
was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_log_format.h |   67 ++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |   19 ++
 fs/xfs/xfs_rmap_item.c         |  459 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_rmap_item.h         |  100 +++++++++
 fs/xfs/xfs_super.c             |   21 ++
 6 files changed, 665 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_rmap_item.c
 create mode 100644 fs/xfs/xfs_rmap_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 2de8c20..8ae0a10 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -104,6 +104,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_extfree_item.o \
 				   xfs_icreate_item.o \
 				   xfs_inode_item.o \
+				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
 				   xfs_trans_buf.o \
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index e5baba3..b9627b7 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -110,7 +110,9 @@ static inline uint xlog_get_cycle(char *ptr)
 #define XLOG_REG_TYPE_COMMIT		18
 #define XLOG_REG_TYPE_TRANSHDR		19
 #define XLOG_REG_TYPE_ICREATE		20
-#define XLOG_REG_TYPE_MAX		20
+#define XLOG_REG_TYPE_RUI_FORMAT	21
+#define XLOG_REG_TYPE_RUD_FORMAT	22
+#define XLOG_REG_TYPE_MAX		22
 
 /*
  * Flags to log operation header
@@ -227,6 +229,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_DQUOT		0x123d
 #define	XFS_LI_QUOTAOFF		0x123e
 #define	XFS_LI_ICREATE		0x123f
+#define	XFS_LI_RUI		0x1240	/* rmap update intent */
+#define	XFS_LI_RUD		0x1241
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -236,7 +240,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_BUF,		"XFS_LI_BUF" }, \
 	{ XFS_LI_DQUOT,		"XFS_LI_DQUOT" }, \
 	{ XFS_LI_QUOTAOFF,	"XFS_LI_QUOTAOFF" }, \
-	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }
+	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }, \
+	{ XFS_LI_RUI,		"XFS_LI_RUI" }, \
+	{ XFS_LI_RUD,		"XFS_LI_RUD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -604,6 +610,63 @@ typedef struct xfs_efd_log_format_64 {
 } xfs_efd_log_format_64_t;
 
 /*
+ * RUI/RUD (reverse mapping) log format definitions
+ */
+struct xfs_map_extent {
+	__uint64_t		me_owner;
+	__uint64_t		me_startblock;
+	__uint64_t		me_startoff;
+	__uint32_t		me_len;
+	__uint32_t		me_flags;
+};
+
+/* rmap me_flags: upper bits are flags, lower byte is type code */
+#define XFS_RMAP_EXTENT_MAP		1
+#define XFS_RMAP_EXTENT_MAP_SHARED	2
+#define XFS_RMAP_EXTENT_UNMAP		3
+#define XFS_RMAP_EXTENT_UNMAP_SHARED	4
+#define XFS_RMAP_EXTENT_CONVERT		5
+#define XFS_RMAP_EXTENT_CONVERT_SHARED	6
+#define XFS_RMAP_EXTENT_ALLOC		7
+#define XFS_RMAP_EXTENT_FREE		8
+#define XFS_RMAP_EXTENT_TYPE_MASK	0xFF
+
+#define XFS_RMAP_EXTENT_ATTR_FORK	(1U << 31)
+#define XFS_RMAP_EXTENT_BMBT_BLOCK	(1U << 30)
+#define XFS_RMAP_EXTENT_UNWRITTEN	(1U << 29)
+
+#define XFS_RMAP_EXTENT_FLAGS		(XFS_RMAP_EXTENT_TYPE_MASK | \
+					 XFS_RMAP_EXTENT_ATTR_FORK | \
+					 XFS_RMAP_EXTENT_BMBT_BLOCK | \
+					 XFS_RMAP_EXTENT_UNWRITTEN)
+
+/*
+ * This is the structure used to lay out an rui log item in the
+ * log.  The rui_extents field is a variable size array whose
+ * size is given by rui_nextents.
+ */
+struct xfs_rui_log_format {
+	__uint16_t		rui_type;	/* rui log item type */
+	__uint16_t		rui_size;	/* size of this item */
+	__uint32_t		rui_nextents;	/* # extents to free */
+	__uint64_t		rui_id;		/* rui identifier */
+	struct xfs_map_extent	rui_extents[1];	/* array of extents to rmap */
+};
+
+/*
+ * This is the structure used to lay out an rud log item in the
+ * log.  The rud_extents array is a variable size array whose
+ * size is given by rud_nextents;
+ */
+struct xfs_rud_log_format {
+	__uint16_t		rud_type;	/* rud log item type */
+	__uint16_t		rud_size;	/* size of this item */
+	__uint32_t		rud_nextents;	/* # of extents freed */
+	__uint64_t		rud_rui_id;	/* id of corresponding rui */
+	struct xfs_map_extent	rud_extents[1];	/* array of extents rmapped */
+};
+
+/*
  * Dquot Log format definitions.
  *
  * The first two fields must be the type and size fitting into
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 6674340..aff60dc 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -87,4 +87,23 @@ int xfs_rmapbt_query_range(struct xfs_btree_cur *cur,
 		struct xfs_rmap_irec *low_rec, struct xfs_rmap_irec *high_rec,
 		xfs_rmapbt_query_range_fn fn, void *priv);
 
+enum xfs_rmap_intent_type {
+	XFS_RMAP_MAP,
+	XFS_RMAP_MAP_SHARED,
+	XFS_RMAP_UNMAP,
+	XFS_RMAP_UNMAP_SHARED,
+	XFS_RMAP_CONVERT,
+	XFS_RMAP_CONVERT_SHARED,
+	XFS_RMAP_ALLOC,
+	XFS_RMAP_FREE,
+};
+
+struct xfs_rmap_intent {
+	struct list_head			ri_list;
+	enum xfs_rmap_intent_type		ri_type;
+	__uint64_t				ri_owner;
+	int					ri_whichfork;
+	struct xfs_bmbt_irec			ri_bmap;
+};
+
 #endif	/* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
new file mode 100644
index 0000000..91a3b2c
--- /dev/null
+++ b/fs/xfs/xfs_rmap_item.c
@@ -0,0 +1,459 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_buf_item.h"
+#include "xfs_rmap_item.h"
+#include "xfs_log.h"
+
+
+kmem_zone_t	*xfs_rui_zone;
+kmem_zone_t	*xfs_rud_zone;
+
+static inline struct xfs_rui_log_item *RUI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_rui_log_item, rui_item);
+}
+
+void
+xfs_rui_item_free(
+	struct xfs_rui_log_item	*ruip)
+{
+	if (ruip->rui_format.rui_nextents > XFS_RUI_MAX_FAST_EXTENTS)
+		kmem_free(ruip);
+	else
+		kmem_zone_free(xfs_rui_zone, ruip);
+}
+
+/*
+ * This returns the number of iovecs needed to log the given rui item.
+ * We only need 1 iovec for an rui item.  It just logs the rui_log_format
+ * structure.
+ */
+static inline int
+xfs_rui_item_sizeof(
+	struct xfs_rui_log_item *ruip)
+{
+	return sizeof(struct xfs_rui_log_format) +
+			(ruip->rui_format.rui_nextents - 1) *
+			sizeof(struct xfs_map_extent);
+}
+
+STATIC void
+xfs_rui_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += xfs_rui_item_sizeof(RUI_ITEM(lip));
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given rui log item. We use only 1 iovec, and we point that
+ * at the rui_log_format structure embedded in the rui item.
+ * It is at this point that we assert that all of the extent
+ * slots in the rui item have been filled.
+ */
+STATIC void
+xfs_rui_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_rui_log_item	*ruip = RUI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(atomic_read(&ruip->rui_next_extent) ==
+			ruip->rui_format.rui_nextents);
+
+	ruip->rui_format.rui_type = XFS_LI_RUI;
+	ruip->rui_format.rui_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_RUI_FORMAT, &ruip->rui_format,
+			xfs_rui_item_sizeof(ruip));
+}
+
+/*
+ * Pinning has no meaning for an rui item, so just return.
+ */
+STATIC void
+xfs_rui_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * The unpin operation is the last place an RUI is manipulated in the log. It is
+ * either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the RUI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the RUI to either construct
+ * and commit the RUD or drop the RUD's reference in the event of error. Simply
+ * drop the log's RUI reference now that the log is done with it.
+ */
+STATIC void
+xfs_rui_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_rui_log_item	*ruip = RUI_ITEM(lip);
+
+	xfs_rui_release(ruip);
+}
+
+/*
+ * RUI items have no locking or pushing.  However, since RUIs are pulled from
+ * the AIL when their corresponding RUDs are committed to disk, their situation
+ * is very similar to being pinned.  Return XFS_ITEM_PINNED so that the caller
+ * will eventually flush the log.  This should help in getting the RUI out of
+ * the AIL.
+ */
+STATIC uint
+xfs_rui_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The RUI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an RUD isn't going to be
+ * constructed and thus we free the RUI here directly.
+ */
+STATIC void
+xfs_rui_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	if (lip->li_flags & XFS_LI_ABORTED)
+		xfs_rui_item_free(RUI_ITEM(lip));
+}
+
+/*
+ * The RUI is logged only once and cannot be moved in the log, so simply return
+ * the lsn at which it's been logged.
+ */
+STATIC xfs_lsn_t
+xfs_rui_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	return lsn;
+}
+
+/*
+ * The RUI dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_rui_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all rui log items.
+ */
+static const struct xfs_item_ops xfs_rui_item_ops = {
+	.iop_size	= xfs_rui_item_size,
+	.iop_format	= xfs_rui_item_format,
+	.iop_pin	= xfs_rui_item_pin,
+	.iop_unpin	= xfs_rui_item_unpin,
+	.iop_unlock	= xfs_rui_item_unlock,
+	.iop_committed	= xfs_rui_item_committed,
+	.iop_push	= xfs_rui_item_push,
+	.iop_committing = xfs_rui_item_committing,
+};
+
+/*
+ * Allocate and initialize an rui item with the given number of extents.
+ */
+struct xfs_rui_log_item *
+xfs_rui_init(
+	struct xfs_mount		*mp,
+	uint				nextents)
+
+{
+	struct xfs_rui_log_item		*ruip;
+	uint				size;
+
+	ASSERT(nextents > 0);
+	if (nextents > XFS_RUI_MAX_FAST_EXTENTS) {
+		size = (uint)(sizeof(struct xfs_rui_log_item) +
+			((nextents - 1) * sizeof(struct xfs_map_extent)));
+		ruip = kmem_zalloc(size, KM_SLEEP);
+	} else {
+		ruip = kmem_zone_zalloc(xfs_rui_zone, KM_SLEEP);
+	}
+
+	xfs_log_item_init(mp, &ruip->rui_item, XFS_LI_RUI, &xfs_rui_item_ops);
+	ruip->rui_format.rui_nextents = nextents;
+	ruip->rui_format.rui_id = (uintptr_t)(void *)ruip;
+	atomic_set(&ruip->rui_next_extent, 0);
+	atomic_set(&ruip->rui_refcount, 2);
+
+	return ruip;
+}
+
+/*
+ * Copy an RUI format buffer from the given buf, and into the destination
+ * RUI format structure.  The RUI/RUD items were designed not to need any
+ * special alignment handling.
+ */
+int
+xfs_rui_copy_format(
+	struct xfs_log_iovec		*buf,
+	struct xfs_rui_log_format	*dst_rui_fmt)
+{
+	struct xfs_rui_log_format	*src_rui_fmt;
+	uint				len;
+
+	src_rui_fmt = buf->i_addr;
+	len = sizeof(struct xfs_rui_log_format) +
+			(src_rui_fmt->rui_nextents - 1) *
+			sizeof(struct xfs_map_extent);
+
+	if (buf->i_len == len) {
+		memcpy((char *)dst_rui_fmt, (char *)src_rui_fmt, len);
+		return 0;
+	}
+	return -EFSCORRUPTED;
+}
+
+/*
+ * Freeing the RUI requires that we remove it from the AIL if it has already
+ * been placed there. However, the RUI may not yet have been placed in the AIL
+ * when called by xfs_rui_release() from RUD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the RUI.
+ */
+void
+xfs_rui_release(
+	struct xfs_rui_log_item	*ruip)
+{
+	if (atomic_dec_and_test(&ruip->rui_refcount)) {
+		xfs_trans_ail_remove(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_rui_item_free(ruip);
+	}
+}
+
+static inline struct xfs_rud_log_item *RUD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_rud_log_item, rud_item);
+}
+
+STATIC void
+xfs_rud_item_free(struct xfs_rud_log_item *rudp)
+{
+	if (rudp->rud_format.rud_nextents > XFS_RUD_MAX_FAST_EXTENTS)
+		kmem_free(rudp);
+	else
+		kmem_zone_free(xfs_rud_zone, rudp);
+}
+
+/*
+ * This returns the number of iovecs needed to log the given rud item.
+ * We only need 1 iovec for an rud item.  It just logs the rud_log_format
+ * structure.
+ */
+static inline int
+xfs_rud_item_sizeof(
+	struct xfs_rud_log_item	*rudp)
+{
+	return sizeof(struct xfs_rud_log_format) +
+			(rudp->rud_format.rud_nextents - 1) *
+			sizeof(struct xfs_map_extent);
+}
+
+STATIC void
+xfs_rud_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += xfs_rud_item_sizeof(RUD_ITEM(lip));
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given rud log item. We use only 1 iovec, and we point that
+ * at the rud_log_format structure embedded in the rud item.
+ * It is at this point that we assert that all of the extent
+ * slots in the rud item have been filled.
+ */
+STATIC void
+xfs_rud_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(rudp->rud_next_extent == rudp->rud_format.rud_nextents);
+
+	rudp->rud_format.rud_type = XFS_LI_RUD;
+	rudp->rud_format.rud_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_RUD_FORMAT, &rudp->rud_format,
+			xfs_rud_item_sizeof(rudp));
+}
+
+/*
+ * Pinning has no meaning for an rud item, so just return.
+ */
+STATIC void
+xfs_rud_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * Since pinning has no meaning for an rud item, unpinning does
+ * not either.
+ */
+STATIC void
+xfs_rud_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+}
+
+/*
+ * There isn't much you can do to push on an rud item.  It is simply stuck
+ * waiting for the log to be flushed to disk.
+ */
+STATIC uint
+xfs_rud_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The RUD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the RUI and free the
+ * RUD.
+ */
+STATIC void
+xfs_rud_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
+
+	if (lip->li_flags & XFS_LI_ABORTED) {
+		xfs_rui_release(rudp->rud_ruip);
+		xfs_rud_item_free(rudp);
+	}
+}
+
+/*
+ * When the rud item is committed to disk, all we need to do is delete our
+ * reference to our partner rui item and then free ourselves. Since we're
+ * freeing ourselves we must return -1 to keep the transaction code from
+ * further referencing this item.
+ */
+STATIC xfs_lsn_t
+xfs_rud_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
+
+	/*
+	 * Drop the RUI reference regardless of whether the RUD has been
+	 * aborted. Once the RUD transaction is constructed, it is the sole
+	 * responsibility of the RUD to release the RUI (even if the RUI is
+	 * aborted due to log I/O error).
+	 */
+	xfs_rui_release(rudp->rud_ruip);
+	xfs_rud_item_free(rudp);
+
+	return (xfs_lsn_t)-1;
+}
+
+/*
+ * The RUD dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_rud_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all rud log items.
+ */
+static const struct xfs_item_ops xfs_rud_item_ops = {
+	.iop_size	= xfs_rud_item_size,
+	.iop_format	= xfs_rud_item_format,
+	.iop_pin	= xfs_rud_item_pin,
+	.iop_unpin	= xfs_rud_item_unpin,
+	.iop_unlock	= xfs_rud_item_unlock,
+	.iop_committed	= xfs_rud_item_committed,
+	.iop_push	= xfs_rud_item_push,
+	.iop_committing = xfs_rud_item_committing,
+};
+
+/*
+ * Allocate and initialize an rud item with the given number of extents.
+ */
+struct xfs_rud_log_item *
+xfs_rud_init(
+	struct xfs_mount		*mp,
+	struct xfs_rui_log_item		*ruip,
+	uint				nextents)
+
+{
+	struct xfs_rud_log_item	*rudp;
+	uint			size;
+
+	ASSERT(nextents > 0);
+	if (nextents > XFS_RUD_MAX_FAST_EXTENTS) {
+		size = (uint)(sizeof(struct xfs_rud_log_item) +
+			((nextents - 1) * sizeof(struct xfs_map_extent)));
+		rudp = kmem_zalloc(size, KM_SLEEP);
+	} else {
+		rudp = kmem_zone_zalloc(xfs_rud_zone, KM_SLEEP);
+	}
+
+	xfs_log_item_init(mp, &rudp->rud_item, XFS_LI_RUD, &xfs_rud_item_ops);
+	rudp->rud_ruip = ruip;
+	rudp->rud_format.rud_nextents = nextents;
+	rudp->rud_format.rud_rui_id = ruip->rui_format.rui_id;
+
+	return rudp;
+}
diff --git a/fs/xfs/xfs_rmap_item.h b/fs/xfs/xfs_rmap_item.h
new file mode 100644
index 0000000..bd36ab5
--- /dev/null
+++ b/fs/xfs/xfs_rmap_item.h
@@ -0,0 +1,100 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef	__XFS_RMAP_ITEM_H__
+#define	__XFS_RMAP_ITEM_H__
+
+/*
+ * There are (currently) three pairs of rmap btree redo item types: map, unmap,
+ * and convert.  The common abbreviations for these are RUI (rmap update
+ * intent) and RUD (rmap update done).  The redo item type is encoded in the
+ * flags field of each xfs_map_extent.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated rmapbt updates.  Typically, the first
+ * transaction will record a bmbt update, followed by some number of
+ * transactions containing rmapbt updates, and finally transactions with any
+ * bnobt/cntbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * (rmapbt/bnobt/cntbt) metadata updates in the non-first transaction.
+ */
+
+/* kernel only RUI/RUD definitions */
+
+struct xfs_mount;
+struct kmem_zone;
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_RUI_MAX_FAST_EXTENTS	16
+
+/*
+ * Define RUI flag bits. Manipulated by set/clear/test_bit operators.
+ */
+#define	XFS_RUI_RECOVERED		1
+
+/*
+ * This is the "rmap update intent" log item.  It is used to log the fact that
+ * some reverse mappings need to change.  It is used in conjunction with the
+ * "rmap update done" log item described below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_rui_log_item {
+	struct xfs_log_item		rui_item;
+	atomic_t			rui_refcount;
+	atomic_t			rui_next_extent;
+	unsigned long			rui_flags;	/* misc flags */
+	struct xfs_rui_log_format	rui_format;
+};
+
+/*
+ * This is the "rmap update done" log item.  It is used to log the fact that
+ * some rmapbt updates mentioned in an earlier rui item have been performed.
+ */
+struct xfs_rud_log_item {
+	struct xfs_log_item		rud_item;
+	struct xfs_rui_log_item		*rud_ruip;
+	uint				rud_next_extent;
+	struct xfs_rud_log_format	rud_format;
+};
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_RUD_MAX_FAST_EXTENTS	16
+
+extern struct kmem_zone	*xfs_rui_zone;
+extern struct kmem_zone	*xfs_rud_zone;
+
+struct xfs_rui_log_item *xfs_rui_init(struct xfs_mount *, uint);
+struct xfs_rud_log_item *xfs_rud_init(struct xfs_mount *,
+		struct xfs_rui_log_item *, uint);
+int xfs_rui_copy_format(struct xfs_log_iovec *buf,
+		struct xfs_rui_log_format *dst_rui_fmt);
+void xfs_rui_item_free(struct xfs_rui_log_item *);
+void xfs_rui_release(struct xfs_rui_log_item *);
+
+#endif	/* __XFS_RMAP_ITEM_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1575849..a8300e4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -47,6 +47,7 @@
 #include "xfs_sysfs.h"
 #include "xfs_ondisk.h"
 #include "xfs_defer.h"
+#include "xfs_rmap_item.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1762,8 +1763,26 @@ xfs_init_zones(void)
 	if (!xfs_icreate_zone)
 		goto out_destroy_ili_zone;
 
+	xfs_rud_zone = kmem_zone_init((sizeof(struct xfs_rud_log_item) +
+			((XFS_RUD_MAX_FAST_EXTENTS - 1) *
+				 sizeof(struct xfs_map_extent))),
+			"xfs_rud_item");
+	if (!xfs_rud_zone)
+		goto out_destroy_icreate_zone;
+
+	xfs_rui_zone = kmem_zone_init((sizeof(struct xfs_rui_log_item) +
+			((XFS_RUI_MAX_FAST_EXTENTS - 1) *
+				sizeof(struct xfs_map_extent))),
+			"xfs_rui_item");
+	if (!xfs_rui_zone)
+		goto out_destroy_rud_zone;
+
 	return 0;
 
+ out_destroy_rud_zone:
+	kmem_zone_destroy(xfs_rud_zone);
+ out_destroy_icreate_zone:
+	kmem_zone_destroy(xfs_icreate_zone);
  out_destroy_ili_zone:
 	kmem_zone_destroy(xfs_ili_zone);
  out_destroy_inode_zone:
@@ -1802,6 +1821,8 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_zone_destroy(xfs_rui_zone);
+	kmem_zone_destroy(xfs_rud_zone);
 	kmem_zone_destroy(xfs_icreate_zone);
 	kmem_zone_destroy(xfs_ili_zone);
 	kmem_zone_destroy(xfs_inode_zone);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 042/119] xfs: log rmap intent items
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (40 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 041/119] xfs: create rmap update intent log items Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-15 18:33   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 043/119] xfs: enable the xfs_defer mechanism to process rmaps to update Darrick J. Wong
                   ` (76 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Provide a mechanism for higher levels to create RUI/RUD items, submit
them to the log, and a stub function to deal with recovered RUI items.
These parts will be connected to the rmapbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile          |    1 
 fs/xfs/xfs_log_recover.c |  344 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trans.h       |   17 ++
 fs/xfs/xfs_trans_rmap.c  |  235 +++++++++++++++++++++++++++++++
 4 files changed, 589 insertions(+), 8 deletions(-)
 create mode 100644 fs/xfs/xfs_trans_rmap.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 8ae0a10..1980110 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -110,6 +110,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_trans_buf.o \
 				   xfs_trans_extfree.o \
 				   xfs_trans_inode.o \
+				   xfs_trans_rmap.o \
 
 # optional features
 xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index b33187b..c9fe0c4 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -44,6 +44,7 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_error.h"
 #include "xfs_dir2.h"
+#include "xfs_rmap_item.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -1912,6 +1913,8 @@ xlog_recover_reorder_trans(
 		case XFS_LI_QUOTAOFF:
 		case XFS_LI_EFD:
 		case XFS_LI_EFI:
+		case XFS_LI_RUI:
+		case XFS_LI_RUD:
 			trace_xfs_log_recover_item_reorder_tail(log,
 							trans, item, pass);
 			list_move_tail(&item->ri_list, &inode_list);
@@ -3416,6 +3419,101 @@ xlog_recover_efd_pass2(
 }
 
 /*
+ * This routine is called to create an in-core extent rmap update
+ * item from the rui format structure which was logged on disk.
+ * It allocates an in-core rui, copies the extents from the format
+ * structure into it, and adds the rui to the AIL with the given
+ * LSN.
+ */
+STATIC int
+xlog_recover_rui_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	int				error;
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_rui_log_item		*ruip;
+	struct xfs_rui_log_format	*rui_formatp;
+
+	rui_formatp = item->ri_buf[0].i_addr;
+
+	ruip = xfs_rui_init(mp, rui_formatp->rui_nextents);
+	error = xfs_rui_copy_format(&item->ri_buf[0], &ruip->rui_format);
+	if (error) {
+		xfs_rui_item_free(ruip);
+		return error;
+	}
+	atomic_set(&ruip->rui_next_extent, rui_formatp->rui_nextents);
+
+	spin_lock(&log->l_ailp->xa_lock);
+	/*
+	 * The RUI has two references. One for the RUD and one for RUI to ensure
+	 * it makes it into the AIL. Insert the RUI into the AIL directly and
+	 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
+	 * AIL lock.
+	 */
+	xfs_trans_ail_update(log->l_ailp, &ruip->rui_item, lsn);
+	xfs_rui_release(ruip);
+	return 0;
+}
+
+
+/*
+ * This routine is called when an RUD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding RUI if it
+ * was still in the log. To do this it searches the AIL for the RUI with an id
+ * equal to that in the RUD format structure. If we find it we drop the RUD
+ * reference, which removes the RUI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_rud_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item)
+{
+	struct xfs_rud_log_format	*rud_formatp;
+	struct xfs_rui_log_item		*ruip = NULL;
+	struct xfs_log_item		*lip;
+	__uint64_t			rui_id;
+	struct xfs_ail_cursor		cur;
+	struct xfs_ail			*ailp = log->l_ailp;
+
+	rud_formatp = item->ri_buf[0].i_addr;
+	ASSERT(item->ri_buf[0].i_len == (sizeof(struct xfs_rud_log_format) +
+			((rud_formatp->rud_nextents - 1) *
+			sizeof(struct xfs_map_extent))));
+	rui_id = rud_formatp->rud_rui_id;
+
+	/*
+	 * Search for the RUI with the id in the RUD format structure in the
+	 * AIL.
+	 */
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		if (lip->li_type == XFS_LI_RUI) {
+			ruip = (struct xfs_rui_log_item *)lip;
+			if (ruip->rui_format.rui_id == rui_id) {
+				/*
+				 * Drop the RUD reference to the RUI. This
+				 * removes the RUI from the AIL and frees it.
+				 */
+				spin_unlock(&ailp->xa_lock);
+				xfs_rui_release(ruip);
+				spin_lock(&ailp->xa_lock);
+				break;
+			}
+		}
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+
+	return 0;
+}
+
+/*
  * This routine is called when an inode create format structure is found in a
  * committed transaction in the log.  It's purpose is to initialise the inodes
  * being allocated on disk. This requires us to get inode cluster buffers that
@@ -3640,6 +3738,8 @@ xlog_recover_ra_pass2(
 	case XFS_LI_EFI:
 	case XFS_LI_EFD:
 	case XFS_LI_QUOTAOFF:
+	case XFS_LI_RUI:
+	case XFS_LI_RUD:
 	default:
 		break;
 	}
@@ -3663,6 +3763,8 @@ xlog_recover_commit_pass1(
 	case XFS_LI_EFD:
 	case XFS_LI_DQUOT:
 	case XFS_LI_ICREATE:
+	case XFS_LI_RUI:
+	case XFS_LI_RUD:
 		/* nothing to do in pass 1 */
 		return 0;
 	default:
@@ -3693,6 +3795,10 @@ xlog_recover_commit_pass2(
 		return xlog_recover_efi_pass2(log, item, trans->r_lsn);
 	case XFS_LI_EFD:
 		return xlog_recover_efd_pass2(log, item);
+	case XFS_LI_RUI:
+		return xlog_recover_rui_pass2(log, item, trans->r_lsn);
+	case XFS_LI_RUD:
+		return xlog_recover_rud_pass2(log, item);
 	case XFS_LI_DQUOT:
 		return xlog_recover_dquot_pass2(log, buffer_list, item,
 						trans->r_lsn);
@@ -4165,6 +4271,18 @@ xlog_recover_process_data(
 	return 0;
 }
 
+/* Is this log item a deferred action intent? */
+static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
+{
+	switch (lip->li_type) {
+	case XFS_LI_EFI:
+	case XFS_LI_RUI:
+		return true;
+	default:
+		return false;
+	}
+}
+
 /*
  * Process an extent free intent item that was recovered from
  * the log.  We need to free the extents that it describes.
@@ -4265,17 +4383,23 @@ xlog_recover_process_efis(
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
 	while (lip != NULL) {
 		/*
-		 * We're done when we see something other than an EFI.
-		 * There should be no EFIs left in the AIL now.
+		 * We're done when we see something other than an intent.
+		 * There should be no intents left in the AIL now.
 		 */
-		if (lip->li_type != XFS_LI_EFI) {
+		if (!xlog_item_is_intent(lip)) {
 #ifdef DEBUG
 			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
-				ASSERT(lip->li_type != XFS_LI_EFI);
+				ASSERT(!xlog_item_is_intent(lip));
 #endif
 			break;
 		}
 
+		/* Skip anything that isn't an EFI */
+		if (lip->li_type != XFS_LI_EFI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
 		/*
 		 * Skip EFIs that we've already processed.
 		 */
@@ -4320,14 +4444,20 @@ xlog_recover_cancel_efis(
 		 * We're done when we see something other than an EFI.
 		 * There should be no EFIs left in the AIL now.
 		 */
-		if (lip->li_type != XFS_LI_EFI) {
+		if (!xlog_item_is_intent(lip)) {
 #ifdef DEBUG
 			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
-				ASSERT(lip->li_type != XFS_LI_EFI);
+				ASSERT(!xlog_item_is_intent(lip));
 #endif
 			break;
 		}
 
+		/* Skip anything that isn't an EFI */
+		if (lip->li_type != XFS_LI_EFI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
 		efip = container_of(lip, struct xfs_efi_log_item, efi_item);
 
 		spin_unlock(&ailp->xa_lock);
@@ -4343,6 +4473,190 @@ xlog_recover_cancel_efis(
 }
 
 /*
+ * Process an rmap update intent item that was recovered from the log.
+ * We need to update the rmapbt.
+ */
+STATIC int
+xlog_recover_process_rui(
+	struct xfs_mount		*mp,
+	struct xfs_rui_log_item		*ruip)
+{
+	int				i;
+	int				error = 0;
+	struct xfs_map_extent		*rmap;
+	xfs_fsblock_t			startblock_fsb;
+	bool				op_ok;
+
+	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
+
+	/*
+	 * First check the validity of the extents described by the
+	 * RUI.  If any are bad, then assume that all are bad and
+	 * just toss the RUI.
+	 */
+	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
+		rmap = &(ruip->rui_format.rui_extents[i]);
+		startblock_fsb = XFS_BB_TO_FSB(mp,
+				   XFS_FSB_TO_DADDR(mp, rmap->me_startblock));
+		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
+		case XFS_RMAP_EXTENT_MAP:
+		case XFS_RMAP_EXTENT_MAP_SHARED:
+		case XFS_RMAP_EXTENT_UNMAP:
+		case XFS_RMAP_EXTENT_UNMAP_SHARED:
+		case XFS_RMAP_EXTENT_CONVERT:
+		case XFS_RMAP_EXTENT_CONVERT_SHARED:
+		case XFS_RMAP_EXTENT_ALLOC:
+		case XFS_RMAP_EXTENT_FREE:
+			op_ok = true;
+			break;
+		default:
+			op_ok = false;
+			break;
+		}
+		if (!op_ok || (startblock_fsb == 0) ||
+		    (rmap->me_len == 0) ||
+		    (startblock_fsb >= mp->m_sb.sb_dblocks) ||
+		    (rmap->me_len >= mp->m_sb.sb_agblocks) ||
+		    (rmap->me_flags & ~XFS_RMAP_EXTENT_FLAGS)) {
+			/*
+			 * This will pull the RUI from the AIL and
+			 * free the memory associated with it.
+			 */
+			set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
+			xfs_rui_release(ruip);
+			return -EIO;
+		}
+	}
+
+	/* XXX: do nothing for now */
+	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
+	xfs_rui_release(ruip);
+	return error;
+}
+
+/*
+ * When this is called, all of the RUIs which did not have
+ * corresponding RUDs should be in the AIL.  What we do now
+ * is update the rmaps associated with each one.
+ *
+ * Since we process the RUIs in normal transactions, they
+ * will be removed at some point after the commit.  This prevents
+ * us from just walking down the list processing each one.
+ * We'll use a flag in the RUI to skip those that we've already
+ * processed and use the AIL iteration mechanism's generation
+ * count to try to speed this up at least a bit.
+ *
+ * When we start, we know that the RUIs are the only things in
+ * the AIL.  As we process them, however, other items are added
+ * to the AIL.  Since everything added to the AIL must come after
+ * everything already in the AIL, we stop processing as soon as
+ * we see something other than an RUI in the AIL.
+ */
+STATIC int
+xlog_recover_process_ruis(
+	struct xlog		*log)
+{
+	struct xfs_log_item	*lip;
+	struct xfs_rui_log_item	*ruip;
+	int			error = 0;
+	struct xfs_ail_cursor	cur;
+	struct xfs_ail		*ailp;
+
+	ailp = log->l_ailp;
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		/*
+		 * We're done when we see something other than an intent.
+		 * There should be no intents left in the AIL now.
+		 */
+		if (!xlog_item_is_intent(lip)) {
+#ifdef DEBUG
+			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
+				ASSERT(!xlog_item_is_intent(lip));
+#endif
+			break;
+		}
+
+		/* Skip anything that isn't an RUI */
+		if (lip->li_type != XFS_LI_RUI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		/*
+		 * Skip RUIs that we've already processed.
+		 */
+		ruip = container_of(lip, struct xfs_rui_log_item, rui_item);
+		if (test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags)) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		spin_unlock(&ailp->xa_lock);
+		error = xlog_recover_process_rui(log->l_mp, ruip);
+		spin_lock(&ailp->xa_lock);
+		if (error)
+			goto out;
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+out:
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+	return error;
+}
+
+/*
+ * A cancel occurs when the mount has failed and we're bailing out. Release all
+ * pending RUIs so they don't pin the AIL.
+ */
+STATIC int
+xlog_recover_cancel_ruis(
+	struct xlog		*log)
+{
+	struct xfs_log_item	*lip;
+	struct xfs_rui_log_item	*ruip;
+	int			error = 0;
+	struct xfs_ail_cursor	cur;
+	struct xfs_ail		*ailp;
+
+	ailp = log->l_ailp;
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		/*
+		 * We're done when we see something other than an RUI.
+		 * There should be no RUIs left in the AIL now.
+		 */
+		if (!xlog_item_is_intent(lip)) {
+#ifdef DEBUG
+			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
+				ASSERT(!xlog_item_is_intent(lip));
+#endif
+			break;
+		}
+
+		/* Skip anything that isn't an RUI */
+		if (lip->li_type != XFS_LI_RUI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		ruip = container_of(lip, struct xfs_rui_log_item, rui_item);
+
+		spin_unlock(&ailp->xa_lock);
+		xfs_rui_release(ruip);
+		spin_lock(&ailp->xa_lock);
+
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+	return error;
+}
+
+/*
  * This routine performs a transaction to null out a bad inode pointer
  * in an agi unlinked inode hash bucket.
  */
@@ -5144,11 +5458,19 @@ xlog_recover_finish(
 	 */
 	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
 		int	error;
+
+		error = xlog_recover_process_ruis(log);
+		if (error) {
+			xfs_alert(log->l_mp, "Failed to recover RUIs");
+			return error;
+		}
+
 		error = xlog_recover_process_efis(log);
 		if (error) {
 			xfs_alert(log->l_mp, "Failed to recover EFIs");
 			return error;
 		}
+
 		/*
 		 * Sync the log to get all the EFIs out of the AIL.
 		 * This isn't absolutely necessary, but it helps in
@@ -5176,9 +5498,15 @@ xlog_recover_cancel(
 	struct xlog	*log)
 {
 	int		error = 0;
+	int		err2;
 
-	if (log->l_flags & XLOG_RECOVERY_NEEDED)
-		error = xlog_recover_cancel_efis(log);
+	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
+		error = xlog_recover_cancel_ruis(log);
+
+		err2 = xlog_recover_cancel_efis(log);
+		if (err2 && !error)
+			error = err2;
+	}
 
 	return error;
 }
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f8d363f..c48be63 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -235,4 +235,21 @@ void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
 extern kmem_zone_t	*xfs_trans_zone;
 extern kmem_zone_t	*xfs_log_item_desc_zone;
 
+enum xfs_rmap_intent_type;
+
+struct xfs_rui_log_item *xfs_trans_get_rui(struct xfs_trans *tp, uint nextents);
+void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
+		struct xfs_rui_log_item *ruip, enum xfs_rmap_intent_type type,
+		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
+		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
+		xfs_exntst_t state);
+
+struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
+		struct xfs_rui_log_item *ruip, uint nextents);
+int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
+		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
+		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
+		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
+		xfs_exntst_t state);
+
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
new file mode 100644
index 0000000..b55a725
--- /dev/null
+++ b/fs/xfs/xfs_trans_rmap.c
@@ -0,0 +1,235 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_rmap_item.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap_btree.h"
+
+/*
+ * This routine is called to allocate an "rmap update intent"
+ * log item that will hold nextents worth of extents.  The
+ * caller must use all nextents extents, because we are not
+ * flexible about this at all.
+ */
+struct xfs_rui_log_item *
+xfs_trans_get_rui(
+	struct xfs_trans		*tp,
+	uint				nextents)
+{
+	struct xfs_rui_log_item		*ruip;
+
+	ASSERT(tp != NULL);
+	ASSERT(nextents > 0);
+
+	ruip = xfs_rui_init(tp->t_mountp, nextents);
+	ASSERT(ruip != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &ruip->rui_item);
+	return ruip;
+}
+
+/*
+ * This routine is called to indicate that the described
+ * extent is to be logged as needing to be freed.  It should
+ * be called once for each extent to be freed.
+ */
+void
+xfs_trans_log_start_rmap_update(
+	struct xfs_trans		*tp,
+	struct xfs_rui_log_item		*ruip,
+	enum xfs_rmap_intent_type	type,
+	__uint64_t			owner,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state)
+{
+	uint				next_extent;
+	struct xfs_map_extent		*rmap;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	ruip->rui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	/*
+	 * atomic_inc_return gives us the value after the increment;
+	 * we want to use it as an array index so we need to subtract 1 from
+	 * it.
+	 */
+	next_extent = atomic_inc_return(&ruip->rui_next_extent) - 1;
+	ASSERT(next_extent < ruip->rui_format.rui_nextents);
+	rmap = &(ruip->rui_format.rui_extents[next_extent]);
+	rmap->me_owner = owner;
+	rmap->me_startblock = startblock;
+	rmap->me_startoff = startoff;
+	rmap->me_len = blockcount;
+	rmap->me_flags = 0;
+	if (state == XFS_EXT_UNWRITTEN)
+		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
+	if (whichfork == XFS_ATTR_FORK)
+		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
+	switch (type) {
+	case XFS_RMAP_MAP:
+		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
+		break;
+	case XFS_RMAP_MAP_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
+		break;
+	case XFS_RMAP_UNMAP:
+		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
+		break;
+	case XFS_RMAP_UNMAP_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
+		break;
+	case XFS_RMAP_CONVERT:
+		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
+		break;
+	case XFS_RMAP_CONVERT_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
+		break;
+	case XFS_RMAP_ALLOC:
+		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
+		break;
+	case XFS_RMAP_FREE:
+		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
+		break;
+	default:
+		ASSERT(0);
+	}
+}
+
+
+/*
+ * This routine is called to allocate an "extent free done"
+ * log item that will hold nextents worth of extents.  The
+ * caller must use all nextents extents, because we are not
+ * flexible about this at all.
+ */
+struct xfs_rud_log_item *
+xfs_trans_get_rud(
+	struct xfs_trans		*tp,
+	struct xfs_rui_log_item		*ruip,
+	uint				nextents)
+{
+	struct xfs_rud_log_item		*rudp;
+
+	ASSERT(tp != NULL);
+	ASSERT(nextents > 0);
+
+	rudp = xfs_rud_init(tp->t_mountp, ruip, nextents);
+	ASSERT(rudp != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &rudp->rud_item);
+	return rudp;
+}
+
+/*
+ * Finish an rmap update and log it to the RUD. Note that the transaction is
+ * marked dirty regardless of whether the rmap update succeeds or fails to
+ * support the RUI/RUD lifecycle rules.
+ */
+int
+xfs_trans_log_finish_rmap_update(
+	struct xfs_trans		*tp,
+	struct xfs_rud_log_item		*rudp,
+	enum xfs_rmap_intent_type	type,
+	__uint64_t			owner,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state)
+{
+	uint				next_extent;
+	struct xfs_map_extent		*rmap;
+	int				error;
+
+	/* XXX: actually finish the rmap update here */
+	error = -EFSCORRUPTED;
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the RUI and frees the RUD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	rudp->rud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	next_extent = rudp->rud_next_extent;
+	ASSERT(next_extent < rudp->rud_format.rud_nextents);
+	rmap = &(rudp->rud_format.rud_extents[next_extent]);
+	rmap->me_owner = owner;
+	rmap->me_startblock = startblock;
+	rmap->me_startoff = startoff;
+	rmap->me_len = blockcount;
+	rmap->me_flags = 0;
+	if (state == XFS_EXT_UNWRITTEN)
+		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
+	if (whichfork == XFS_ATTR_FORK)
+		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
+	switch (type) {
+	case XFS_RMAP_MAP:
+		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
+		break;
+	case XFS_RMAP_MAP_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
+		break;
+	case XFS_RMAP_UNMAP:
+		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
+		break;
+	case XFS_RMAP_UNMAP_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
+		break;
+	case XFS_RMAP_CONVERT:
+		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
+		break;
+	case XFS_RMAP_CONVERT_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
+		break;
+	case XFS_RMAP_ALLOC:
+		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
+		break;
+	case XFS_RMAP_FREE:
+		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
+		break;
+	default:
+		ASSERT(0);
+	}
+	rudp->rud_next_extent++;
+
+	return error;
+}


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 043/119] xfs: enable the xfs_defer mechanism to process rmaps to update
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (41 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 042/119] xfs: log rmap intent items Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-15 18:33   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 044/119] xfs: propagate bmap updates to rmapbt Darrick J. Wong
                   ` (75 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Connect the xfs_defer mechanism with the pieces that we'll need to
handle deferred rmap updates.  We'll wire up the existing code to
our new deferred mechanism later.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.h |    1 
 fs/xfs/xfs_defer_item.c   |  124 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 743fc32..920642e62 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -51,6 +51,7 @@ struct xfs_defer_pending {
  * find all the space it needs.
  */
 enum xfs_defer_ops_type {
+	XFS_DEFER_OPS_TYPE_RMAP,
 	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_MAX,
 };
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
index 1c2d556..dbd10fc 100644
--- a/fs/xfs/xfs_defer_item.c
+++ b/fs/xfs/xfs_defer_item.c
@@ -31,6 +31,8 @@
 #include "xfs_trace.h"
 #include "xfs_bmap.h"
 #include "xfs_extfree_item.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rmap_item.h"
 
 /* Extent Freeing */
 
@@ -136,11 +138,133 @@ const struct xfs_defer_op_type xfs_extent_free_defer_type = {
 	.cancel_item	= xfs_bmap_free_cancel_item,
 };
 
+/* Reverse Mapping */
+
+/* Sort rmap intents by AG. */
+static int
+xfs_rmap_update_diff_items(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_mount		*mp = priv;
+	struct xfs_rmap_intent		*ra;
+	struct xfs_rmap_intent		*rb;
+
+	ra = container_of(a, struct xfs_rmap_intent, ri_list);
+	rb = container_of(b, struct xfs_rmap_intent, ri_list);
+	return  XFS_FSB_TO_AGNO(mp, ra->ri_bmap.br_startblock) -
+		XFS_FSB_TO_AGNO(mp, rb->ri_bmap.br_startblock);
+}
+
+/* Get an RUI. */
+STATIC void *
+xfs_rmap_update_create_intent(
+	struct xfs_trans		*tp,
+	unsigned int			count)
+{
+	return xfs_trans_get_rui(tp, count);
+}
+
+/* Log rmap updates in the intent item. */
+STATIC void
+xfs_rmap_update_log_item(
+	struct xfs_trans		*tp,
+	void				*intent,
+	struct list_head		*item)
+{
+	struct xfs_rmap_intent		*rmap;
+
+	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
+	xfs_trans_log_start_rmap_update(tp, intent, rmap->ri_type,
+			rmap->ri_owner, rmap->ri_whichfork,
+			rmap->ri_bmap.br_startoff,
+			rmap->ri_bmap.br_startblock,
+			rmap->ri_bmap.br_blockcount,
+			rmap->ri_bmap.br_state);
+}
+
+/* Get an RUD so we can process all the deferred rmap updates. */
+STATIC void *
+xfs_rmap_update_create_done(
+	struct xfs_trans		*tp,
+	void				*intent,
+	unsigned int			count)
+{
+	return xfs_trans_get_rud(tp, intent, count);
+}
+
+/* Process a deferred rmap update. */
+STATIC int
+xfs_rmap_update_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop,
+	struct list_head		*item,
+	void				*done_item,
+	void				**state)
+{
+	struct xfs_rmap_intent		*rmap;
+	int				error;
+
+	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
+	error = xfs_trans_log_finish_rmap_update(tp, done_item,
+			rmap->ri_type,
+			rmap->ri_owner, rmap->ri_whichfork,
+			rmap->ri_bmap.br_startoff,
+			rmap->ri_bmap.br_startblock,
+			rmap->ri_bmap.br_blockcount,
+			rmap->ri_bmap.br_state);
+	kmem_free(rmap);
+	return error;
+}
+
+/* Clean up after processing deferred rmaps. */
+STATIC void
+xfs_rmap_update_finish_cleanup(
+	struct xfs_trans	*tp,
+	void			*state,
+	int			error)
+{
+}
+
+/* Abort all pending RUIs. */
+STATIC void
+xfs_rmap_update_abort_intent(
+	void				*intent)
+{
+	xfs_rui_release(intent);
+}
+
+/* Cancel a deferred rmap update. */
+STATIC void
+xfs_rmap_update_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_rmap_intent		*rmap;
+
+	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
+	kmem_free(rmap);
+}
+
+const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
+	.type		= XFS_DEFER_OPS_TYPE_RMAP,
+	.max_items	= XFS_RUI_MAX_FAST_EXTENTS,
+	.diff_items	= xfs_rmap_update_diff_items,
+	.create_intent	= xfs_rmap_update_create_intent,
+	.abort_intent	= xfs_rmap_update_abort_intent,
+	.log_item	= xfs_rmap_update_log_item,
+	.create_done	= xfs_rmap_update_create_done,
+	.finish_item	= xfs_rmap_update_finish_item,
+	.finish_cleanup = xfs_rmap_update_finish_cleanup,
+	.cancel_item	= xfs_rmap_update_cancel_item,
+};
+
 /* Deferred Item Initialization */
 
 /* Initialize the deferred operation types. */
 void
 xfs_defer_init_types(void)
 {
+	xfs_defer_init_op_type(&xfs_rmap_update_defer_type);
 	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
 }


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (42 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 043/119] xfs: enable the xfs_defer mechanism to process rmaps to update Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-15 18:33   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 045/119] xfs: add rmap btree geometry feature flag Darrick J. Wong
                   ` (74 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When we map, unmap, or convert an extent in a file's data or attr
fork, schedule a respective update in the rmapbt.  Previous versions
of this patch required a 1:1 correspondence between bmap and rmap,
but this is no longer true.

v2: Remove the 1:1 correspondence requirement now that we have the
ability to make interval queries against the rmapbt.  Update the
commit message to reflect the broad restructuring of this patch.
Fix the bmap shift code to adjust the rmaps correctly.

v3: Use the deferred operations code to handle redo operations
atomically and deadlock free.  Plumb in all five rmap actions
(map, unmap, convert extent, alloc, free); we'll use the first
three now for file data, and reflink will want the last two.
Add an error injection site to test log recovery.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       |   56 ++++++++-
 fs/xfs/libxfs/xfs_rmap.c       |  252 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |   24 ++++
 fs/xfs/xfs_bmap_util.c         |    1 
 fs/xfs/xfs_defer_item.c        |    6 +
 fs/xfs/xfs_error.h             |    4 -
 fs/xfs/xfs_log_recover.c       |   56 +++++++++
 fs/xfs/xfs_trans.h             |    3 
 fs/xfs/xfs_trans_rmap.c        |    7 +
 9 files changed, 393 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 61c0231..507fd74 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -46,6 +46,7 @@
 #include "xfs_symlink.h"
 #include "xfs_attr_leaf.h"
 #include "xfs_filestream.h"
+#include "xfs_rmap_btree.h"
 
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
@@ -2178,6 +2179,11 @@ xfs_bmap_add_extent_delay_real(
 		ASSERT(0);
 	}
 
+	/* add reverse mapping */
+	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
+	if (error)
+		goto done;
+
 	/* convert to a btree if necessary */
 	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
 		int	tmp_logflags;	/* partial log flag return val */
@@ -2714,6 +2720,11 @@ xfs_bmap_add_extent_unwritten_real(
 		ASSERT(0);
 	}
 
+	/* update reverse mappings */
+	error = xfs_rmap_convert_extent(mp, dfops, ip, XFS_DATA_FORK, new);
+	if (error)
+		goto done;
+
 	/* convert to a btree if necessary */
 	if (xfs_bmap_needs_btree(ip, XFS_DATA_FORK)) {
 		int	tmp_logflags;	/* partial log flag return val */
@@ -3106,6 +3117,11 @@ xfs_bmap_add_extent_hole_real(
 		break;
 	}
 
+	/* add reverse mapping */
+	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
+	if (error)
+		goto done;
+
 	/* convert to a btree if necessary */
 	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
 		int	tmp_logflags;	/* partial log flag return val */
@@ -5032,6 +5048,14 @@ xfs_bmap_del_extent(
 		++*idx;
 		break;
 	}
+
+	/* remove reverse mapping */
+	if (!delay) {
+		error = xfs_rmap_unmap_extent(mp, dfops, ip, whichfork, del);
+		if (error)
+			goto done;
+	}
+
 	/*
 	 * If we need to, add to list of extents to delete.
 	 */
@@ -5569,7 +5593,8 @@ xfs_bmse_shift_one(
 	struct xfs_bmbt_rec_host	*gotp,
 	struct xfs_btree_cur		*cur,
 	int				*logflags,
-	enum shift_direction		direction)
+	enum shift_direction		direction,
+	struct xfs_defer_ops		*dfops)
 {
 	struct xfs_ifork		*ifp;
 	struct xfs_mount		*mp;
@@ -5617,9 +5642,13 @@ xfs_bmse_shift_one(
 		/* check whether to merge the extent or shift it down */
 		if (xfs_bmse_can_merge(&adj_irec, &got,
 				       offset_shift_fsb)) {
-			return xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
-					      *current_ext, gotp, adj_irecp,
-					      cur, logflags);
+			error = xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
+					       *current_ext, gotp, adj_irecp,
+					       cur, logflags);
+			if (error)
+				return error;
+			adj_irec = got;
+			goto update_rmap;
 		}
 	} else {
 		startoff = got.br_startoff + offset_shift_fsb;
@@ -5656,9 +5685,10 @@ update_current_ext:
 		(*current_ext)--;
 	xfs_bmbt_set_startoff(gotp, startoff);
 	*logflags |= XFS_ILOG_CORE;
+	adj_irec = got;
 	if (!cur) {
 		*logflags |= XFS_ILOG_DEXT;
-		return 0;
+		goto update_rmap;
 	}
 
 	error = xfs_bmbt_lookup_eq(cur, got.br_startoff, got.br_startblock,
@@ -5668,8 +5698,18 @@ update_current_ext:
 	XFS_WANT_CORRUPTED_RETURN(mp, i == 1);
 
 	got.br_startoff = startoff;
-	return xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
-			       got.br_blockcount, got.br_state);
+	error = xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
+			got.br_blockcount, got.br_state);
+	if (error)
+		return error;
+
+update_rmap:
+	/* update reverse mapping */
+	error = xfs_rmap_unmap_extent(mp, dfops, ip, whichfork, &adj_irec);
+	if (error)
+		return error;
+	adj_irec.br_startoff = startoff;
+	return xfs_rmap_map_extent(mp, dfops, ip, whichfork, &adj_irec);
 }
 
 /*
@@ -5797,7 +5837,7 @@ xfs_bmap_shift_extents(
 	while (nexts++ < num_exts) {
 		error = xfs_bmse_shift_one(ip, whichfork, offset_shift_fsb,
 					   &current_ext, gotp, cur, &logflags,
-					   direction);
+					   direction, dfops);
 		if (error)
 			goto del_cursor;
 		/*
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 76fc5c2..f179ea4 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -36,6 +36,8 @@
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_extent_busy.h"
+#include "xfs_bmap.h"
+#include "xfs_inode.h"
 
 /*
  * Lookup the first record less than or equal to [bno, len, owner, offset]
@@ -1212,3 +1214,253 @@ xfs_rmapbt_query_range(
 	return xfs_btree_query_range(cur, &low_brec, &high_brec,
 			xfs_rmapbt_query_range_helper, &query);
 }
+
+/* Clean up after calling xfs_rmap_finish_one. */
+void
+xfs_rmap_finish_one_cleanup(
+	struct xfs_trans	*tp,
+	struct xfs_btree_cur	*rcur,
+	int			error)
+{
+	struct xfs_buf		*agbp;
+
+	if (rcur == NULL)
+		return;
+	agbp = rcur->bc_private.a.agbp;
+	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(tp, agbp);
+}
+
+/*
+ * Process one of the deferred rmap operations.  We pass back the
+ * btree cursor to maintain our lock on the rmapbt between calls.
+ * This saves time and eliminates a buffer deadlock between the
+ * superblock and the AGF because we'll always grab them in the same
+ * order.
+ */
+int
+xfs_rmap_finish_one(
+	struct xfs_trans		*tp,
+	enum xfs_rmap_intent_type	type,
+	__uint64_t			owner,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state,
+	struct xfs_btree_cur		**pcur)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_btree_cur		*rcur;
+	struct xfs_buf			*agbp = NULL;
+	int				error = 0;
+	xfs_agnumber_t			agno;
+	struct xfs_owner_info		oinfo;
+	xfs_agblock_t			bno;
+	bool				unwritten;
+
+	agno = XFS_FSB_TO_AGNO(mp, startblock);
+	ASSERT(agno != NULLAGNUMBER);
+	bno = XFS_FSB_TO_AGBNO(mp, startblock);
+
+	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
+			startoff, blockcount, state);
+
+	if (XFS_TEST_ERROR(false, mp,
+			XFS_ERRTAG_RMAP_FINISH_ONE,
+			XFS_RANDOM_RMAP_FINISH_ONE))
+		return -EIO;
+
+	/*
+	 * If we haven't gotten a cursor or the cursor AG doesn't match
+	 * the startblock, get one now.
+	 */
+	rcur = *pcur;
+	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
+		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
+		rcur = NULL;
+		*pcur = NULL;
+	}
+	if (rcur == NULL) {
+		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
+		if (error)
+			return error;
+		if (!agbp)
+			return -EFSCORRUPTED;
+
+		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
+		if (!rcur) {
+			error = -ENOMEM;
+			goto out_cur;
+		}
+	}
+	*pcur = rcur;
+
+	xfs_rmap_ino_owner(&oinfo, owner, whichfork, startoff);
+	unwritten = state == XFS_EXT_UNWRITTEN;
+	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, startblock);
+
+	switch (type) {
+	case XFS_RMAP_MAP:
+		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
+		break;
+	case XFS_RMAP_UNMAP:
+		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
+				&oinfo);
+		break;
+	case XFS_RMAP_CONVERT:
+		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
+				&oinfo);
+		break;
+	case XFS_RMAP_ALLOC:
+		error = __xfs_rmap_alloc(rcur, bno, blockcount, unwritten,
+				&oinfo);
+		break;
+	case XFS_RMAP_FREE:
+		error = __xfs_rmap_free(rcur, bno, blockcount, unwritten,
+				&oinfo);
+		break;
+	default:
+		ASSERT(0);
+		error = -EFSCORRUPTED;
+	}
+	return error;
+
+out_cur:
+	xfs_trans_brelse(tp, agbp);
+
+	return error;
+}
+
+/*
+ * Record a rmap intent; the list is kept sorted first by AG and then by
+ * increasing age.
+ */
+static int
+__xfs_rmap_add(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_rmap_intent	*ri)
+{
+	struct xfs_rmap_intent	*new;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return 0;
+
+	trace_xfs_rmap_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock),
+			ri->ri_type,
+			XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock),
+			ri->ri_owner, ri->ri_whichfork,
+			ri->ri_bmap.br_startoff,
+			ri->ri_bmap.br_blockcount,
+			ri->ri_bmap.br_state);
+
+	new = kmem_zalloc(sizeof(struct xfs_rmap_intent), KM_SLEEP | KM_NOFS);
+	*new = *ri;
+
+	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_RMAP, &new->ri_list);
+	return 0;
+}
+
+/* Map an extent into a file. */
+int
+xfs_rmap_map_extent(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*PREV)
+{
+	struct xfs_rmap_intent	ri;
+
+	ri.ri_type = XFS_RMAP_MAP;
+	ri.ri_owner = ip->i_ino;
+	ri.ri_whichfork = whichfork;
+	ri.ri_bmap = *PREV;
+
+	return __xfs_rmap_add(mp, dfops, &ri);
+}
+
+/* Unmap an extent out of a file. */
+int
+xfs_rmap_unmap_extent(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*PREV)
+{
+	struct xfs_rmap_intent	ri;
+
+	ri.ri_type = XFS_RMAP_UNMAP;
+	ri.ri_owner = ip->i_ino;
+	ri.ri_whichfork = whichfork;
+	ri.ri_bmap = *PREV;
+
+	return __xfs_rmap_add(mp, dfops, &ri);
+}
+
+/* Convert a data fork extent from unwritten to real or vice versa. */
+int
+xfs_rmap_convert_extent(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*PREV)
+{
+	struct xfs_rmap_intent	ri;
+
+	ri.ri_type = XFS_RMAP_CONVERT;
+	ri.ri_owner = ip->i_ino;
+	ri.ri_whichfork = whichfork;
+	ri.ri_bmap = *PREV;
+
+	return __xfs_rmap_add(mp, dfops, &ri);
+}
+
+/* Schedule the creation of an rmap for non-file data. */
+int
+xfs_rmap_alloc_defer(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	__uint64_t		owner)
+{
+	struct xfs_rmap_intent	ri;
+
+	ri.ri_type = XFS_RMAP_ALLOC;
+	ri.ri_owner = owner;
+	ri.ri_whichfork = XFS_DATA_FORK;
+	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
+	ri.ri_bmap.br_blockcount = len;
+	ri.ri_bmap.br_startoff = 0;
+	ri.ri_bmap.br_state = XFS_EXT_NORM;
+
+	return __xfs_rmap_add(mp, dfops, &ri);
+}
+
+/* Schedule the deletion of an rmap for non-file data. */
+int
+xfs_rmap_free_defer(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	__uint64_t		owner)
+{
+	struct xfs_rmap_intent	ri;
+
+	ri.ri_type = XFS_RMAP_FREE;
+	ri.ri_owner = owner;
+	ri.ri_whichfork = XFS_DATA_FORK;
+	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
+	ri.ri_bmap.br_blockcount = len;
+	ri.ri_bmap.br_startoff = 0;
+	ri.ri_bmap.br_state = XFS_EXT_NORM;
+
+	return __xfs_rmap_add(mp, dfops, &ri);
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index aff60dc..5df406e 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -106,4 +106,28 @@ struct xfs_rmap_intent {
 	struct xfs_bmbt_irec			ri_bmap;
 };
 
+/* functions for updating the rmapbt based on bmbt map/unmap operations */
+int xfs_rmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_bmbt_irec *imap);
+int xfs_rmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_bmbt_irec *imap);
+int xfs_rmap_convert_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_bmbt_irec *imap);
+int xfs_rmap_alloc_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
+		__uint64_t owner);
+int xfs_rmap_free_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
+		__uint64_t owner);
+
+void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
+		struct xfs_btree_cur *rcur, int error);
+int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
+		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
+		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
+		xfs_exntst_t state, struct xfs_btree_cur **pcur);
+
 #endif	/* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 62d194e..450fd49 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -41,6 +41,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_log.h"
+#include "xfs_rmap_btree.h"
 
 /* Kernel only BMAP related definitions and functions */
 
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
index dbd10fc..9ed060d 100644
--- a/fs/xfs/xfs_defer_item.c
+++ b/fs/xfs/xfs_defer_item.c
@@ -213,7 +213,8 @@ xfs_rmap_update_finish_item(
 			rmap->ri_bmap.br_startoff,
 			rmap->ri_bmap.br_startblock,
 			rmap->ri_bmap.br_blockcount,
-			rmap->ri_bmap.br_state);
+			rmap->ri_bmap.br_state,
+			(struct xfs_btree_cur **)state);
 	kmem_free(rmap);
 	return error;
 }
@@ -225,6 +226,9 @@ xfs_rmap_update_finish_cleanup(
 	void			*state,
 	int			error)
 {
+	struct xfs_btree_cur	*rcur = state;
+
+	xfs_rmap_finish_one_cleanup(tp, rcur, error);
 }
 
 /* Abort all pending RUIs. */
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index ee4680e..6bc614c 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -91,7 +91,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_DIOWRITE_IOERR			20
 #define XFS_ERRTAG_BMAPIFORMAT				21
 #define XFS_ERRTAG_FREE_EXTENT				22
-#define XFS_ERRTAG_MAX					23
+#define XFS_ERRTAG_RMAP_FINISH_ONE			23
+#define XFS_ERRTAG_MAX					24
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -119,6 +120,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_DIOWRITE_IOERR			(XFS_RANDOM_DEFAULT/10)
 #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
 #define XFS_RANDOM_FREE_EXTENT				1
+#define XFS_RANDOM_RMAP_FINISH_ONE			1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index c9fe0c4..f7f9635 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -45,6 +45,7 @@
 #include "xfs_error.h"
 #include "xfs_dir2.h"
 #include "xfs_rmap_item.h"
+#include "xfs_rmap_btree.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -4486,6 +4487,12 @@ xlog_recover_process_rui(
 	struct xfs_map_extent		*rmap;
 	xfs_fsblock_t			startblock_fsb;
 	bool				op_ok;
+	struct xfs_rud_log_item		*rudp;
+	enum xfs_rmap_intent_type	type;
+	int				whichfork;
+	xfs_exntst_t			state;
+	struct xfs_trans		*tp;
+	struct xfs_btree_cur		*rcur = NULL;
 
 	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
 
@@ -4528,9 +4535,54 @@ xlog_recover_process_rui(
 		}
 	}
 
-	/* XXX: do nothing for now */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error)
+		return error;
+	rudp = xfs_trans_get_rud(tp, ruip, ruip->rui_format.rui_nextents);
+
+	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
+		rmap = &(ruip->rui_format.rui_extents[i]);
+		state = (rmap->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
+				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
+		whichfork = (rmap->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
+				XFS_ATTR_FORK : XFS_DATA_FORK;
+		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
+		case XFS_RMAP_EXTENT_MAP:
+			type = XFS_RMAP_MAP;
+			break;
+		case XFS_RMAP_EXTENT_UNMAP:
+			type = XFS_RMAP_UNMAP;
+			break;
+		case XFS_RMAP_EXTENT_CONVERT:
+			type = XFS_RMAP_CONVERT;
+			break;
+		case XFS_RMAP_EXTENT_ALLOC:
+			type = XFS_RMAP_ALLOC;
+			break;
+		case XFS_RMAP_EXTENT_FREE:
+			type = XFS_RMAP_FREE;
+			break;
+		default:
+			error = -EFSCORRUPTED;
+			goto abort_error;
+		}
+		error = xfs_trans_log_finish_rmap_update(tp, rudp, type,
+				rmap->me_owner, whichfork,
+				rmap->me_startoff, rmap->me_startblock,
+				rmap->me_len, state, &rcur);
+		if (error)
+			goto abort_error;
+
+	}
+
+	xfs_rmap_finish_one_cleanup(tp, rcur, error);
 	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
-	xfs_rui_release(ruip);
+	error = xfs_trans_commit(tp);
+	return error;
+
+abort_error:
+	xfs_rmap_finish_one_cleanup(tp, rcur, error);
+	xfs_trans_cancel(tp);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index c48be63..f59d934 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -244,12 +244,13 @@ void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
 		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
 		xfs_exntst_t state);
 
+struct xfs_btree_cur;
 struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
 		struct xfs_rui_log_item *ruip, uint nextents);
 int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
 		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
 		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
 		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
-		xfs_exntst_t state);
+		xfs_exntst_t state, struct xfs_btree_cur **pcur);
 
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
index b55a725..0c0df18 100644
--- a/fs/xfs/xfs_trans_rmap.c
+++ b/fs/xfs/xfs_trans_rmap.c
@@ -170,14 +170,15 @@ xfs_trans_log_finish_rmap_update(
 	xfs_fileoff_t			startoff,
 	xfs_fsblock_t			startblock,
 	xfs_filblks_t			blockcount,
-	xfs_exntst_t			state)
+	xfs_exntst_t			state,
+	struct xfs_btree_cur		**pcur)
 {
 	uint				next_extent;
 	struct xfs_map_extent		*rmap;
 	int				error;
 
-	/* XXX: actually finish the rmap update here */
-	error = -EFSCORRUPTED;
+	error = xfs_rmap_finish_one(tp, type, owner, whichfork, startoff,
+			startblock, blockcount, state, pcur);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 045/119] xfs: add rmap btree geometry feature flag
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (43 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 044/119] xfs: propagate bmap updates to rmapbt Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-18 13:34   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 046/119] xfs: add rmap btree block detection to log recovery Darrick J. Wong
                   ` (73 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

So xfs_info and other userspace utilities know the filesystem is
using this feature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_fs.h |    1 +
 fs/xfs/xfs_fsops.c     |    4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index f5ec9c5..7945505 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -206,6 +206,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
 #define XFS_FSOP_GEOM_FLAGS_SPINODES	0x40000	/* sparse inode chunks	*/
+#define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* Reverse mapping btree */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 3772f6c..5980d5c 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -105,7 +105,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hasfinobt(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_FINOBT : 0) |
 			(xfs_sb_version_hassparseinodes(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_SPINODES : 0);
+				XFS_FSOP_GEOM_FLAGS_SPINODES : 0) |
+			(xfs_sb_version_hasrmapbt(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 046/119] xfs: add rmap btree block detection to log recovery
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (44 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 045/119] xfs: add rmap btree geometry feature flag Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-18 13:34   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 047/119] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled Darrick J. Wong
                   ` (72 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

So such blocks can be correctly identified and have their operations
structutes attached to validate recovery has not resulted in a
correct block.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_recover.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index f7f9635..dbfbc26 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2233,6 +2233,7 @@ xlog_recover_get_buf_lsn(
 	case XFS_ABTC_CRC_MAGIC:
 	case XFS_ABTB_MAGIC:
 	case XFS_ABTC_MAGIC:
+	case XFS_RMAP_CRC_MAGIC:
 	case XFS_IBT_CRC_MAGIC:
 	case XFS_IBT_MAGIC: {
 		struct xfs_btree_block *btb = blk;
@@ -2401,6 +2402,9 @@ xlog_recover_validate_buf_type(
 		case XFS_BMAP_MAGIC:
 			bp->b_ops = &xfs_bmbt_buf_ops;
 			break;
+		case XFS_RMAP_CRC_MAGIC:
+			bp->b_ops = &xfs_rmapbt_buf_ops;
+			break;
 		default:
 			xfs_warn(mp, "Bad btree block magic!");
 			ASSERT(0);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 047/119] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (45 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 046/119] xfs: add rmap btree block detection to log recovery Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-18 13:34   ` Brian Foster
  2016-06-17  1:22 ` [PATCH 048/119] xfs: don't update rmapbt when fixing agfl Darrick J. Wong
                   ` (71 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

Swapping extents between two inodes requires the owner to be updated
in the rmap tree for all the extents that are swapped. This code
does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until
support has been implemented. This will need to be done before the
rmap btree code can have the experimental tag removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick.wong@oracle.com: fix extent swapping when rmap enabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 450fd49..8666873 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1618,6 +1618,10 @@ xfs_swap_extents(
 	__uint64_t	tmp;
 	int		lock_flags;
 
+	/* XXX: we can't do this with rmap, will fix later */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
 	tempifp = kmem_alloc(sizeof(xfs_ifork_t), KM_MAYFAIL);
 	if (!tempifp) {
 		error = -ENOMEM;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 048/119] xfs: don't update rmapbt when fixing agfl
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (46 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 047/119] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled Darrick J. Wong
@ 2016-06-17  1:22 ` Darrick J. Wong
  2016-07-18 13:34   ` Brian Foster
  2016-06-17  1:23 ` [PATCH 049/119] xfs: enable the rmap btree functionality Darrick J. Wong
                   ` (70 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:22 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates
when fixing the AG freelist.  xfs_repair needs this during phase 5
to be able to adjust the freelist while it's reconstructing the rmap
btree; the missing entries will be added back at the very end of
phase 5 once the AGFL contents settle down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c |   40 ++++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_alloc.h |    3 +++
 2 files changed, 29 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 4c8ffd4..6eabab1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2092,26 +2092,38 @@ xfs_alloc_fix_freelist(
 	 * anything other than extra overhead when we need to put more blocks
 	 * back on the free list? Maybe we should only do this when space is
 	 * getting low or the AGFL is more than half full?
+	 *
+	 * The NOSHRINK flag prevents the AGFL from being shrunk if it's too
+	 * big; the NORMAP flag prevents AGFL expand/shrink operations from
+	 * updating the rmapbt.  Both flags are used in xfs_repair while we're
+	 * rebuilding the rmapbt, and neither are used by the kernel.  They're
+	 * both required to ensure that rmaps are correctly recorded for the
+	 * regenerated AGFL, bnobt, and cntbt.  See repair/phase5.c and
+	 * repair/rmap.c in xfsprogs for details.
 	 */
-	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
-	while (pag->pagf_flcount > need) {
-		struct xfs_buf	*bp;
+	memset(&targs, 0, sizeof(targs));
+	if (!(flags & XFS_ALLOC_FLAG_NOSHRINK)) {
+		if (!(flags & XFS_ALLOC_FLAG_NORMAP))
+			xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
+		while (pag->pagf_flcount > need) {
+			struct xfs_buf	*bp;
 
-		error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
-		if (error)
-			goto out_agbp_relse;
-		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
-					   &targs.oinfo, 1);
-		if (error)
-			goto out_agbp_relse;
-		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
-		xfs_trans_binval(tp, bp);
+			error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
+			if (error)
+				goto out_agbp_relse;
+			error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
+						   &targs.oinfo, 1);
+			if (error)
+				goto out_agbp_relse;
+			bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
+			xfs_trans_binval(tp, bp);
+		}
 	}
 
-	memset(&targs, 0, sizeof(targs));
 	targs.tp = tp;
 	targs.mp = mp;
-	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
+	if (!(flags & XFS_ALLOC_FLAG_NORMAP))
+		xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
 	targs.agbp = agbp;
 	targs.agno = args->agno;
 	targs.alignment = targs.minlen = targs.prod = targs.isfl = 1;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 7b6c66b..7b9e67e 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -54,6 +54,9 @@ typedef unsigned int xfs_alloctype_t;
  */
 #define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
 #define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
+#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
+#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
+
 
 /*
  * Argument structure for xfs_alloc routines.


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 049/119] xfs: enable the rmap btree functionality
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (47 preceding siblings ...)
  2016-06-17  1:22 ` [PATCH 048/119] xfs: don't update rmapbt when fixing agfl Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-07-18 13:34   ` Brian Foster
  2016-06-17  1:23 ` [PATCH 050/119] xfs: count the blocks in a btree Darrick J. Wong
                   ` (69 subsequent siblings)
  118 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Add the feature flag to the supported matrix so that the kernel can
mount and use rmap btree enabled filesystems

v2: Move the EXPERIMENTAL message to fill_super so it only prints once.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick.wong@oracle.com: move the experimental tag]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h |    3 ++-
 fs/xfs/xfs_super.c         |    4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 6efc7a3..1b08237 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -457,7 +457,8 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
-		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
+		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
+		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a8300e4..9328821 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1571,6 +1571,10 @@ xfs_fs_fill_super(
 		xfs_alert(mp,
 	"EXPERIMENTAL sparse inode feature enabled. Use at your own risk!");
 
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		xfs_alert(mp,
+	"EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!");
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 050/119] xfs: count the blocks in a btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (48 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 049/119] xfs: enable the rmap btree functionality Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:23 ` [PATCH 051/119] xfs: introduce tracepoints for AG reservation code Darrick J. Wong
                   ` (68 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Provide a helper method to count the number of blocks in a short form
btree.  The refcount and rmap btrees need to know the number of blocks
already in use to set up their per-AG block reservations during mount.

v2: Use btree_visit_blocks instead of open-coding our own traversal
routine.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c |   22 ++++++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h |    2 ++
 2 files changed, 24 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 4b90419..50b2c32 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4803,3 +4803,25 @@ xfs_btree_query_range(
 	return xfs_btree_overlapped_query_range(cur, low_rec, high_rec,
 			fn, priv);
 }
+
+int
+xfs_btree_count_blocks_helper(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	void			*data)
+{
+	xfs_extlen_t		*blocks = data;
+	(*blocks)++;
+
+	return 0;
+}
+
+/* Count the blocks in a btree and return the result in *blocks. */
+int
+xfs_btree_count_blocks(
+	struct xfs_btree_cur	*cur,
+	xfs_extlen_t		*blocks)
+{
+	return xfs_btree_visit_blocks(cur, xfs_btree_count_blocks_helper,
+			blocks);
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 9963c48..6fa13a9 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -519,4 +519,6 @@ typedef int (*xfs_btree_visit_blocks_fn)(struct xfs_btree_cur *cur, int level,
 int xfs_btree_visit_blocks(struct xfs_btree_cur *cur,
 		xfs_btree_visit_blocks_fn fn, void *data);
 
+int xfs_btree_count_blocks(struct xfs_btree_cur *cur, xfs_extlen_t *blocks);
+
 #endif	/* __XFS_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 051/119] xfs: introduce tracepoints for AG reservation code
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (49 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 050/119] xfs: count the blocks in a btree Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:23 ` [PATCH 052/119] xfs: set up per-AG free space reservations Darrick J. Wong
                   ` (67 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_trace.h |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6466adc..c50479a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2558,6 +2558,75 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
 DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
+/* dummy definitions to avoid breaking bisectability; will be removed later */
+#ifndef XFS_AG_RESV_DUMMY
+#define XFS_AG_RESV_DUMMY
+enum xfs_ag_resv_type {
+	XFS_AG_RESV_NONE = 0,
+	XFS_AG_RESV_METADATA,
+	XFS_AG_RESV_AGFL,
+};
+struct xfs_ag_resv {
+	xfs_extlen_t	ar_reserved;
+	xfs_extlen_t	ar_asked;
+};
+#define xfs_perag_resv(...)	NULL
+#endif
+
+/* per-AG reservation */
+DECLARE_EVENT_CLASS(xfs_ag_resv_class,
+	TP_PROTO(struct xfs_perag *pag, enum xfs_ag_resv_type resv,
+		 xfs_extlen_t len),
+	TP_ARGS(pag, resv, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(int, resv)
+		__field(xfs_extlen_t, freeblks)
+		__field(xfs_extlen_t, flcount)
+		__field(xfs_extlen_t, reserved)
+		__field(xfs_extlen_t, asked)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		struct xfs_ag_resv	*r = xfs_perag_resv(pag, resv);
+
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->resv = resv;
+		__entry->freeblks = pag->pagf_freeblks;
+		__entry->flcount = pag->pagf_flcount;
+		__entry->reserved = r ? r->ar_reserved : 0;
+		__entry->asked = r ? r->ar_asked : 0;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d agno %u resv %d freeblks %u flcount %u resv %u ask %u len %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->resv,
+		  __entry->freeblks,
+		  __entry->flcount,
+		  __entry->reserved,
+		  __entry->asked,
+		  __entry->len)
+)
+#define DEFINE_AG_RESV_EVENT(name) \
+DEFINE_EVENT(xfs_ag_resv_class, name, \
+	TP_PROTO(struct xfs_perag *pag, enum xfs_ag_resv_type type, \
+		 xfs_extlen_t len), \
+	TP_ARGS(pag, type, len))
+
+/* per-AG reservation tracepoints */
+DEFINE_AG_RESV_EVENT(xfs_ag_resv_init);
+DEFINE_AG_RESV_EVENT(xfs_ag_resv_free);
+DEFINE_AG_RESV_EVENT(xfs_ag_resv_alloc_extent);
+DEFINE_AG_RESV_EVENT(xfs_ag_resv_free_extent);
+DEFINE_AG_RESV_EVENT(xfs_ag_resv_critical);
+DEFINE_AG_RESV_EVENT(xfs_ag_resv_needed);
+
+DEFINE_AG_ERROR_EVENT(xfs_ag_resv_free_error);
+DEFINE_AG_ERROR_EVENT(xfs_ag_resv_init_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 052/119] xfs: set up per-AG free space reservations
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (50 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 051/119] xfs: introduce tracepoints for AG reservation code Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:23 ` [PATCH 053/119] xfs: define tracepoints for refcount btree activities Darrick J. Wong
                   ` (66 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

One unfortunate quirk of the reference count btree -- it can expand in
size when blocks are written to *other* allocation groups if, say, one
large extent becomes a lot of tiny extents.  Since we don't want to
start throwing errors in the middle of CoWing, we need to reserve some
blocks to handle future expansion.

Use the count of how many reserved blocks we need to have on hand to
create a virtual reservation in the AG.  Through selective clamping of
the maximum length of allocation requests and of the length of the
longest free extent, we can make it look like there's less free space
in the AG unless the reservation owner is asking for blocks.

In other words, play some accounting tricks in-core to make sure that
we always have blocks available.  On the plus side, there's nothing to
clean up if we crash, which is contrast to the strategy that the rough
draft used (actually removing extents from the freespace btrees).

v2: There's really only two kinds of per-AG reservation pools -- one
to feed the AGFL (rmapbt), and one to feed everything else
(refcountbt).  Bearing that in mind, we can embed the reservation
controls in xfs_perag and greatly simplify the block accounting.
Furthermore, fix some longstanding accounting bugs that were a direct
result of the goofy "allocate a block and later fix up the accounting"
strategy by integrating the reservation accounting code more tightly
with the allocator.  This eliminates the ENOSPC complaints resulting
from refcount btree splits during truncate operations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                  |    1 
 fs/xfs/libxfs/xfs_ag_resv.c      |  318 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_ag_resv.h      |   35 ++++
 fs/xfs/libxfs/xfs_alloc.c        |   93 ++++++++---
 fs/xfs/libxfs/xfs_alloc.h        |    8 +
 fs/xfs/libxfs/xfs_bmap.c         |    6 -
 fs/xfs/libxfs/xfs_ialloc_btree.c |    2 
 fs/xfs/xfs_filestream.c          |    4 
 fs/xfs/xfs_fsops.c               |    2 
 fs/xfs/xfs_mount.h               |   34 ++++
 fs/xfs/xfs_trace.h               |   36 +---
 fs/xfs/xfs_trans_extfree.c       |    3 
 12 files changed, 485 insertions(+), 57 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_ag_resv.c
 create mode 100644 fs/xfs/libxfs/xfs_ag_resv.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 1980110..c7a864e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -52,6 +52,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_inode_fork.o \
 				   xfs_inode_buf.o \
 				   xfs_log_rlimit.o \
+				   xfs_ag_resv.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
 				   xfs_sb.o \
diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
new file mode 100644
index 0000000..4d390b7
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -0,0 +1,318 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_alloc.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_cksum.h"
+#include "xfs_trans.h"
+#include "xfs_bit.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_ag_resv.h"
+#include "xfs_trans_space.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_btree.h"
+
+/*
+ * Per-AG Block Reservations
+ *
+ * For some kinds of allocation group metadata structures, it is advantageous
+ * to reserve a small number of blocks in each AG so that future expansions of
+ * that data structure do not encounter ENOSPC because errors during a btree
+ * split cause the filesystem to go offline.
+ *
+ * Prior to the introduction of reflink, this wasn't an issue because the free
+ * space btrees maintain a reserve of space (the AGFL) to handle any expansion
+ * that may be necessary; and allocations of other metadata (inodes, BMBT,
+ * dir/attr) aren't restricted to a single AG.  However, with reflink it is
+ * possible to allocate all the space in an AG, have subsequent reflink/CoW
+ * activity expand the refcount btree, and discover that there's no space left
+ * to handle that expansion.  Since we can calculate the maximum size of the
+ * refcount btree, we can reserve space for it and avoid ENOSPC.
+ *
+ * Handling per-AG reservations consists of three changes to the allocator's
+ * behavior:  First, because these reservations are always needed, we decrease
+ * the ag_max_usable counter to reflect the size of the AG after the reserved
+ * blocks are taken.  Second, the reservations must be reflected in the
+ * fdblocks count to maintain proper accounting.  Third, each AG must maintain
+ * its own reserved block counter so that we can calculate the amount of space
+ * that must remain free to maintain the reservations.  Fourth, the "remaining
+ * reserved blocks" count must be used when calculating the length of the
+ * longest free extent in an AG and to clamp maxlen in the per-AG allocation
+ * functions.  In other words, we maintain a virtual allocation via in-core
+ * accounting tricks so that we don't have to clean up after a crash. :)
+ *
+ * Reserved blocks can be managed by passing one of the enum xfs_ag_resv_type
+ * values via struct xfs_alloc_arg or directly to the xfs_free_extent
+ * function.  It might seem a little funny to maintain a reservoir of blocks
+ * to feed another reservoir, but the AGFL only holds enough blocks to get
+ * through the next transaction.  The per-AG reservation is to ensure (we
+ * hope) that each AG never runs out of blocks.  Each data structure wanting
+ * to use the reservation system should update ask/used in xfs_ag_resv_init.
+ */
+
+/*
+ * Are we critically low on blocks?  For now we'll define that as the number
+ * of blocks we can get our hands on being less than 10% of what we reserved
+ * or less than some arbitrary number (eight).
+ */
+bool
+xfs_ag_resv_critical(
+	struct xfs_perag		*pag,
+	enum xfs_ag_resv_type		type)
+{
+	xfs_extlen_t			avail;
+	xfs_extlen_t			orig;
+
+	switch (type) {
+	case XFS_AG_RESV_METADATA:
+		avail = pag->pagf_freeblks - pag->pag_agfl_resv.ar_reserved;
+		orig = pag->pag_meta_resv.ar_asked;
+		break;
+	case XFS_AG_RESV_AGFL:
+		avail = pag->pagf_freeblks + pag->pagf_flcount -
+			pag->pag_meta_resv.ar_reserved;
+		orig = pag->pag_agfl_resv.ar_asked;
+		break;
+	default:
+		ASSERT(0);
+		return false;
+	}
+
+	trace_xfs_ag_resv_critical(pag, type, avail);
+
+	return avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS;
+}
+
+/*
+ * How many blocks are reserved but not used, and therefore must not be
+ * allocated away?
+ */
+xfs_extlen_t
+xfs_ag_resv_needed(
+	struct xfs_perag		*pag,
+	enum xfs_ag_resv_type		type)
+{
+	xfs_extlen_t			len;
+
+	len = pag->pag_meta_resv.ar_reserved + pag->pag_agfl_resv.ar_reserved;
+	switch (type) {
+	case XFS_AG_RESV_METADATA:
+	case XFS_AG_RESV_AGFL:
+		len -= xfs_perag_resv(pag, type)->ar_reserved;
+		break;
+	case XFS_AG_RESV_NONE:
+		/* empty */
+		break;
+	default:
+		ASSERT(0);
+	}
+
+	trace_xfs_ag_resv_needed(pag, type, len);
+
+	return len;
+}
+
+/* Clean out a reservation */
+static int
+__xfs_ag_resv_free(
+	struct xfs_perag		*pag,
+	enum xfs_ag_resv_type		type)
+{
+	struct xfs_ag_resv		*resv;
+	struct xfs_ag_resv		t;
+	int				error;
+
+	trace_xfs_ag_resv_free(pag, type, 0);
+
+	resv = xfs_perag_resv(pag, type);
+	t = *resv;
+	resv->ar_reserved = 0;
+	resv->ar_asked = 0;
+	pag->pag_mount->m_ag_max_usable += t.ar_asked;
+
+	error = xfs_mod_fdblocks(pag->pag_mount, t.ar_reserved, true);
+	if (error)
+		trace_xfs_ag_resv_free_error(pag->pag_mount, pag->pag_agno,
+				error, _RET_IP_);
+	return error;
+}
+
+/* Free a per-AG reservation. */
+int
+xfs_ag_resv_free(
+	struct xfs_perag		*pag)
+{
+	int				error = 0;
+	int				err2;
+
+	err2 = __xfs_ag_resv_free(pag, XFS_AG_RESV_AGFL);
+	if (err2 && !error)
+		error = err2;
+	err2 = __xfs_ag_resv_free(pag, XFS_AG_RESV_METADATA);
+	if (err2 && !error)
+		error = err2;
+	return error;
+}
+
+static int
+__xfs_ag_resv_init(
+	struct xfs_perag		*pag,
+	enum xfs_ag_resv_type		type,
+	xfs_extlen_t			ask,
+	xfs_extlen_t			used)
+{
+	struct xfs_mount		*mp = pag->pag_mount;
+	struct xfs_ag_resv		*resv;
+	int				error;
+
+	resv = xfs_perag_resv(pag, type);
+	if (used > ask)
+		ask = used;
+	resv->ar_asked = ask;
+	resv->ar_reserved = ask - used;
+	mp->m_ag_max_usable -= ask;
+
+	trace_xfs_ag_resv_init(pag, type, ask);
+
+	error = xfs_mod_fdblocks(mp, -(int64_t)resv->ar_reserved, true);
+	if (error)
+		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
+				error, _RET_IP_);
+
+	return error;
+}
+
+/* Create a per-AG block reservation. */
+int
+xfs_ag_resv_init(
+	struct xfs_perag		*pag)
+{
+	xfs_extlen_t			ask;
+	xfs_extlen_t			used;
+	int				error = 0;
+	int				err2;
+
+	if (pag->pag_meta_resv.ar_asked)
+		goto init_agfl;
+
+	/* Create the metadata reservation. */
+	ask = used = 0;
+
+	err2 = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA, ask, used);
+	if (err2 && !error)
+		error = err2;
+
+init_agfl:
+	if (pag->pag_agfl_resv.ar_asked)
+		return error;
+
+	/* Create the AGFL metadata reservation */
+	ask = used = 0;
+
+	err2 = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
+	if (err2 && !error)
+		error = err2;
+
+	return error;
+}
+
+/* Allocate a block from the reservation. */
+void
+xfs_ag_resv_alloc_extent(
+	struct xfs_perag		*pag,
+	enum xfs_ag_resv_type		type,
+	struct xfs_alloc_arg		*args)
+{
+	struct xfs_ag_resv		*resv;
+	xfs_extlen_t			leftover;
+	uint				field;
+
+	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
+
+	switch (type) {
+	case XFS_AG_RESV_METADATA:
+	case XFS_AG_RESV_AGFL:
+		resv = xfs_perag_resv(pag, type);
+		break;
+	default:
+		ASSERT(0);
+		/* fall through */
+	case XFS_AG_RESV_NONE:
+		field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS :
+				       XFS_TRANS_SB_FDBLOCKS;
+		xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
+		return;
+	}
+
+	if (args->len > resv->ar_reserved) {
+		leftover = args->len - resv->ar_reserved;
+		if (type != XFS_AG_RESV_AGFL)
+			xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_FDBLOCKS,
+					-(int64_t)leftover);
+		resv->ar_reserved = 0;
+	} else
+		resv->ar_reserved -= args->len;
+}
+
+/* Free a block to the reservation. */
+void
+xfs_ag_resv_free_extent(
+	struct xfs_perag		*pag,
+	enum xfs_ag_resv_type		type,
+	struct xfs_trans		*tp,
+	xfs_extlen_t			len)
+{
+	xfs_extlen_t			leftover;
+	struct xfs_ag_resv		*resv;
+
+	trace_xfs_ag_resv_free_extent(pag, type, len);
+
+	switch (type) {
+	case XFS_AG_RESV_METADATA:
+	case XFS_AG_RESV_AGFL:
+		resv = xfs_perag_resv(pag, type);
+		break;
+	default:
+		ASSERT(0);
+		/* fall through */
+	case XFS_AG_RESV_NONE:
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
+		return;
+	}
+
+	if (resv->ar_reserved + len > resv->ar_asked) {
+		leftover = resv->ar_reserved + len - resv->ar_asked;
+		if (type != XFS_AG_RESV_AGFL)
+			xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS,
+					(int64_t)leftover);
+		resv->ar_reserved = resv->ar_asked;
+	} else
+		resv->ar_reserved += len;
+}
diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
new file mode 100644
index 0000000..8d6c687
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_ag_resv.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_AG_RESV_H__
+#define	__XFS_AG_RESV_H__
+
+int xfs_ag_resv_free(struct xfs_perag *pag);
+int xfs_ag_resv_init(struct xfs_perag *pag);
+
+bool xfs_ag_resv_critical(struct xfs_perag *pag, enum xfs_ag_resv_type type);
+xfs_extlen_t xfs_ag_resv_needed(struct xfs_perag *pag,
+		enum xfs_ag_resv_type type);
+
+void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
+		struct xfs_alloc_arg *args);
+void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
+		struct xfs_trans *tp, xfs_extlen_t len);
+
+#endif	/* __XFS_AG_RESV_H__ */
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 6eabab1..5f05c4e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -37,6 +37,7 @@
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
+#include "xfs_ag_resv.h"
 
 struct workqueue_struct *xfs_alloc_wq;
 
@@ -682,12 +683,29 @@ xfs_alloc_ag_vextent(
 	xfs_alloc_arg_t	*args)	/* argument structure for allocation */
 {
 	int		error=0;
+	xfs_extlen_t	reservation;
+	xfs_extlen_t	oldmax;
 
 	ASSERT(args->minlen > 0);
 	ASSERT(args->maxlen > 0);
 	ASSERT(args->minlen <= args->maxlen);
 	ASSERT(args->mod < args->prod);
 	ASSERT(args->alignment > 0);
+
+	/*
+	 * Clamp maxlen to the amount of free space minus any reservations
+	 * that have been made.
+	 */
+	oldmax = args->maxlen;
+	reservation = xfs_ag_resv_needed(args->pag, args->resv);
+	if (args->maxlen > args->pag->pagf_freeblks - reservation)
+		args->maxlen = args->pag->pagf_freeblks - reservation;
+	if (args->maxlen == 0) {
+		args->agbno = NULLAGBLOCK;
+		args->maxlen = oldmax;
+		return 0;
+	}
+
 	/*
 	 * Branch to correct routine based on the type.
 	 */
@@ -707,12 +725,14 @@ xfs_alloc_ag_vextent(
 		/* NOTREACHED */
 	}
 
+	args->maxlen = oldmax;
+
 	if (error || args->agbno == NULLAGBLOCK)
 		return error;
 
 	ASSERT(args->len >= args->minlen);
 	ASSERT(args->len <= args->maxlen);
-	ASSERT(!args->wasfromfl || !args->isfl);
+	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
 	ASSERT(args->agbno % args->alignment == 0);
 
 	/* if not file data, insert new block into the reverse map btree */
@@ -734,12 +754,7 @@ xfs_alloc_ag_vextent(
 					      args->agbno, args->len));
 	}
 
-	if (!args->isfl) {
-		xfs_trans_mod_sb(args->tp, args->wasdel ?
-				 XFS_TRANS_SB_RES_FDBLOCKS :
-				 XFS_TRANS_SB_FDBLOCKS,
-				 -((long)(args->len)));
-	}
+	xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
 
 	XFS_STATS_INC(args->mp, xs_allocx);
 	XFS_STATS_ADD(args->mp, xs_allocb, args->len);
@@ -1601,7 +1616,8 @@ xfs_alloc_ag_vextent_small(
 	 * to respect minleft even when pulling from the
 	 * freelist.
 	 */
-	else if (args->minlen == 1 && args->alignment == 1 && !args->isfl &&
+	else if (args->minlen == 1 && args->alignment == 1 &&
+		 args->resv != XFS_AG_RESV_AGFL &&
 		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
 		  > args->minleft)) {
 		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
@@ -1672,7 +1688,7 @@ xfs_free_ag_extent(
 	xfs_agblock_t	bno,	/* starting block number */
 	xfs_extlen_t	len,	/* length of extent */
 	struct xfs_owner_info	*oinfo,	/* extent owner */
-	int		isfl)	/* set if is freelist blocks - no sb acctg */
+	enum xfs_ag_resv_type	type) /* extent reservation type */
 {
 	xfs_btree_cur_t	*bno_cur;	/* cursor for by-block btree */
 	xfs_btree_cur_t	*cnt_cur;	/* cursor for by-size btree */
@@ -1900,21 +1916,22 @@ xfs_free_ag_extent(
 	 */
 	pag = xfs_perag_get(mp, agno);
 	error = xfs_alloc_update_counters(tp, pag, agbp, len);
+	xfs_ag_resv_free_extent(pag, type, tp, len);
 	xfs_perag_put(pag);
 	if (error)
 		goto error0;
 
-	if (!isfl)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (long)len);
 	XFS_STATS_INC(mp, xs_freex);
 	XFS_STATS_ADD(mp, xs_freeb, len);
 
-	trace_xfs_free_extent(mp, agno, bno, len, isfl, haveleft, haveright);
+	trace_xfs_free_extent(mp, agno, bno, len, type == XFS_AG_RESV_AGFL,
+			haveleft, haveright);
 
 	return 0;
 
  error0:
-	trace_xfs_free_extent(mp, agno, bno, len, isfl, -1, -1);
+	trace_xfs_free_extent(mp, agno, bno, len, type == XFS_AG_RESV_AGFL,
+			-1, -1);
 	if (bno_cur)
 		xfs_btree_del_cursor(bno_cur, XFS_BTREE_ERROR);
 	if (cnt_cur)
@@ -1939,21 +1956,43 @@ xfs_alloc_compute_maxlevels(
 }
 
 /*
- * Find the length of the longest extent in an AG.
+ * Find the length of the longest extent in an AG.  The 'need' parameter
+ * specifies how much space we're going to need for the AGFL and the
+ * 'reserved' parameter tells us how many blocks in this AG are reserved for
+ * other callers.
  */
 xfs_extlen_t
 xfs_alloc_longest_free_extent(
 	struct xfs_mount	*mp,
 	struct xfs_perag	*pag,
-	xfs_extlen_t		need)
+	xfs_extlen_t		need,
+	xfs_extlen_t		reserved)
 {
 	xfs_extlen_t		delta = 0;
 
+	/*
+	 * If the AGFL needs a recharge, we'll have to subtract that from the
+	 * longest extent.
+	 */
 	if (need > pag->pagf_flcount)
 		delta = need - pag->pagf_flcount;
 
+	/*
+	 * If we cannot maintain others' reservations with space from the
+	 * not-longest freesp extents, we'll have to subtract /that/ from
+	 * the longest extent too.
+	 */
+	if (pag->pagf_freeblks - pag->pagf_longest < reserved)
+		delta += reserved - (pag->pagf_freeblks - pag->pagf_longest);
+
+	/*
+	 * If the longest extent is long enough to satisfy all the
+	 * reservations and AGFL rules in place, we can return this extent.
+	 */
 	if (pag->pagf_longest > delta)
 		return pag->pagf_longest - delta;
+
+	/* Otherwise, let the caller try for 1 block if there's space. */
 	return pag->pagf_flcount > 0 || pag->pagf_longest > 0;
 }
 
@@ -1993,20 +2032,24 @@ xfs_alloc_space_available(
 {
 	struct xfs_perag	*pag = args->pag;
 	xfs_extlen_t		longest;
+	xfs_extlen_t		reservation; /* blocks that are still reserved */
 	int			available;
 
 	if (flags & XFS_ALLOC_FLAG_FREEING)
 		return true;
 
+	reservation = xfs_ag_resv_needed(pag, args->resv);
+
 	/* do we have enough contiguous free space for the allocation? */
-	longest = xfs_alloc_longest_free_extent(args->mp, pag, min_free);
+	longest = xfs_alloc_longest_free_extent(args->mp, pag, min_free,
+			reservation);
 	if ((args->minlen + args->alignment + args->minalignslop - 1) > longest)
 		return false;
 
-	/* do have enough free space remaining for the allocation? */
+	/* do we have enough free space remaining for the allocation? */
 	available = (int)(pag->pagf_freeblks + pag->pagf_flcount -
-			  min_free - args->total);
-	if (available < (int)args->minleft)
+			  reservation - min_free - args->total);
+	if (available < (int)args->minleft || available <= 0)
 		return false;
 
 	return true;
@@ -2112,7 +2155,8 @@ xfs_alloc_fix_freelist(
 			if (error)
 				goto out_agbp_relse;
 			error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
-						   &targs.oinfo, 1);
+						   &targs.oinfo,
+						   XFS_AG_RESV_AGFL);
 			if (error)
 				goto out_agbp_relse;
 			bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
@@ -2126,7 +2170,7 @@ xfs_alloc_fix_freelist(
 		xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
 	targs.agbp = agbp;
 	targs.agno = args->agno;
-	targs.alignment = targs.minlen = targs.prod = targs.isfl = 1;
+	targs.alignment = targs.minlen = targs.prod = 1;
 	targs.type = XFS_ALLOCTYPE_THIS_AG;
 	targs.pag = pag;
 	error = xfs_alloc_read_agfl(mp, tp, targs.agno, &agflbp);
@@ -2137,6 +2181,7 @@ xfs_alloc_fix_freelist(
 	while (pag->pagf_flcount < need) {
 		targs.agbno = 0;
 		targs.maxlen = need - pag->pagf_flcount;
+		targs.resv = XFS_AG_RESV_AGFL;
 
 		/* Allocate as many blocks as possible at once. */
 		error = xfs_alloc_ag_vextent(&targs);
@@ -2815,7 +2860,8 @@ xfs_free_extent(
 	struct xfs_trans	*tp,	/* transaction pointer */
 	xfs_fsblock_t		bno,	/* starting block number of extent */
 	xfs_extlen_t		len,	/* length of extent */
-	struct xfs_owner_info	*oinfo)	/* extent owner */
+	struct xfs_owner_info	*oinfo,	/* extent owner */
+	enum xfs_ag_resv_type	type)	/* block reservation type */
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_buf		*agbp;
@@ -2824,6 +2870,7 @@ xfs_free_extent(
 	int			error;
 
 	ASSERT(len != 0);
+	ASSERT(type != XFS_AG_RESV_AGFL);
 
 	trace_xfs_bmap_free_deferred(mp, agno, 0, agbno, len);
 
@@ -2843,7 +2890,7 @@ xfs_free_extent(
 			agbno + len <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length),
 			err);
 
-	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, 0);
+	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, type);
 	if (error)
 		goto err;
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 7b9e67e..9f6373a4 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -87,10 +87,10 @@ typedef struct xfs_alloc_arg {
 	xfs_alloctype_t	otype;		/* original allocation type */
 	char		wasdel;		/* set if allocation was prev delayed */
 	char		wasfromfl;	/* set if allocation is from freelist */
-	char		isfl;		/* set if is freelist blocks - !acctg */
 	char		userdata;	/* mask defining userdata treatment */
 	xfs_fsblock_t	firstblock;	/* io first block allocated */
 	struct xfs_owner_info	oinfo;	/* owner of blocks being allocated */
+	enum xfs_ag_resv_type	resv;	/* block reservation to use */
 } xfs_alloc_arg_t;
 
 /*
@@ -106,7 +106,8 @@ unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
 unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
 
 xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
-		struct xfs_perag *pag, xfs_extlen_t need);
+		struct xfs_perag *pag, xfs_extlen_t need,
+		xfs_extlen_t reserved);
 unsigned int xfs_alloc_min_freelist(struct xfs_mount *mp,
 		struct xfs_perag *pag);
 
@@ -184,7 +185,8 @@ xfs_free_extent(
 	struct xfs_trans *tp,	/* transaction pointer */
 	xfs_fsblock_t	bno,	/* starting block number of extent */
 	xfs_extlen_t	len,	/* length of extent */
-	struct xfs_owner_info	*oinfo);	/* extent owner */
+	struct xfs_owner_info	*oinfo,	/* extent owner */
+	enum xfs_ag_resv_type	type);	/* block reservation type */
 
 int				/* error */
 xfs_alloc_lookup_ge(
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 507fd74..972dfc2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -47,6 +47,7 @@
 #include "xfs_attr_leaf.h"
 #include "xfs_filestream.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_ag_resv.h"
 
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
@@ -3501,7 +3502,8 @@ xfs_bmap_longest_free_extent(
 	}
 
 	longest = xfs_alloc_longest_free_extent(mp, pag,
-					xfs_alloc_min_freelist(mp, pag));
+				xfs_alloc_min_freelist(mp, pag),
+				xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE));
 	if (*blen < longest)
 		*blen = longest;
 
@@ -3780,7 +3782,7 @@ xfs_bmap_btalloc(
 	}
 	args.minleft = ap->minleft;
 	args.wasdel = ap->wasdel;
-	args.isfl = 0;
+	args.resv = XFS_AG_RESV_NONE;
 	args.userdata = ap->userdata;
 	if (ap->userdata & XFS_ALLOC_USERDATA_ZERO)
 		args.ip = ap->ip;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index f9ea86b..fd26550 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -131,7 +131,7 @@ xfs_inobt_free_block(
 	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
 	return xfs_free_extent(cur->bc_tp,
 			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
-			&oinfo);
+			&oinfo, XFS_AG_RESV_NONE);
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c
index 4a33a33..c8005fd 100644
--- a/fs/xfs/xfs_filestream.c
+++ b/fs/xfs/xfs_filestream.c
@@ -30,6 +30,7 @@
 #include "xfs_mru_cache.h"
 #include "xfs_filestream.h"
 #include "xfs_trace.h"
+#include "xfs_ag_resv.h"
 
 struct xfs_fstrm_item {
 	struct xfs_mru_cache_elem	mru;
@@ -198,7 +199,8 @@ xfs_filestream_pick_ag(
 		}
 
 		longest = xfs_alloc_longest_free_extent(mp, pag,
-					xfs_alloc_min_freelist(mp, pag));
+				xfs_alloc_min_freelist(mp, pag),
+				xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE));
 		if (((minlen && longest >= minlen) ||
 		     (!minlen && pag->pagf_freeblks >= minfree)) &&
 		    (!pag->pagf_metadata || !(flags & XFS_PICK_USERDATA) ||
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 5980d5c..cd4de75 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -551,7 +551,7 @@ xfs_growfs_data_private(
 		error = xfs_free_extent(tp,
 				XFS_AGB_TO_FSB(mp, agno,
 					be32_to_cpu(agf->agf_length) - new),
-				new, &oinfo);
+				new, &oinfo, XFS_AG_RESV_NONE);
 		if (error)
 			goto error0;
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b36676c..e18d74e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -325,6 +325,20 @@ xfs_mp_fail_writes(struct xfs_mount *mp)
 }
 #endif
 
+/* per-AG block reservation data structures*/
+enum xfs_ag_resv_type {
+	XFS_AG_RESV_NONE = 0,
+	XFS_AG_RESV_METADATA,
+	XFS_AG_RESV_AGFL,
+};
+
+struct xfs_ag_resv {
+	/* number of block reserved here */
+	xfs_extlen_t			ar_reserved;
+	/* number of blocks originally asked for */
+	xfs_extlen_t			ar_asked;
+};
+
 /*
  * Per-ag incore structure, copies of information in agf and agi, to improve the
  * performance of allocation group selection.
@@ -372,8 +386,28 @@ typedef struct xfs_perag {
 	/* for rcu-safe freeing */
 	struct rcu_head	rcu_head;
 	int		pagb_count;	/* pagb slots in use */
+
+	/* Blocks reserved for all kinds of metadata. */
+	struct xfs_ag_resv	pag_meta_resv;
+	/* Blocks reserved for just AGFL-based metadata. */
+	struct xfs_ag_resv	pag_agfl_resv;
 } xfs_perag_t;
 
+static inline struct xfs_ag_resv *
+xfs_perag_resv(
+	struct xfs_perag	*pag,
+	enum xfs_ag_resv_type	type)
+{
+	switch (type) {
+	case XFS_AG_RESV_METADATA:
+		return &pag->pag_meta_resv;
+	case XFS_AG_RESV_AGFL:
+		return &pag->pag_agfl_resv;
+	default:
+		return NULL;
+	}
+}
+
 extern void	xfs_uuid_table_free(void);
 extern int	xfs_log_sbcount(xfs_mount_t *);
 extern __uint64_t xfs_default_resblks(xfs_mount_t *mp);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c50479a..b421b28 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1569,14 +1569,15 @@ TRACE_EVENT(xfs_agf,
 
 TRACE_EVENT(xfs_free_extent,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
-		 xfs_extlen_t len, bool isfl, int haveleft, int haveright),
-	TP_ARGS(mp, agno, agbno, len, isfl, haveleft, haveright),
+		 xfs_extlen_t len, enum xfs_ag_resv_type resv, int haveleft,
+		 int haveright),
+	TP_ARGS(mp, agno, agbno, len, resv, haveleft, haveright),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
 		__field(xfs_agblock_t, agbno)
 		__field(xfs_extlen_t, len)
-		__field(int, isfl)
+		__field(int, resv)
 		__field(int, haveleft)
 		__field(int, haveright)
 	),
@@ -1585,16 +1586,16 @@ TRACE_EVENT(xfs_free_extent,
 		__entry->agno = agno;
 		__entry->agbno = agbno;
 		__entry->len = len;
-		__entry->isfl = isfl;
+		__entry->resv = resv;
 		__entry->haveleft = haveleft;
 		__entry->haveright = haveright;
 	),
-	TP_printk("dev %d:%d agno %u agbno %u len %u isfl %d %s",
+	TP_printk("dev %d:%d agno %u agbno %u len %u resv %d %s",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
 		  __entry->agbno,
 		  __entry->len,
-		  __entry->isfl,
+		  __entry->resv,
 		  __entry->haveleft ?
 			(__entry->haveright ? "both" : "left") :
 			(__entry->haveright ? "right" : "none"))
@@ -1621,7 +1622,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__field(short, otype)
 		__field(char, wasdel)
 		__field(char, wasfromfl)
-		__field(char, isfl)
+		__field(int, resv)
 		__field(char, userdata)
 		__field(xfs_fsblock_t, firstblock)
 	),
@@ -1642,13 +1643,13 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		__entry->otype = args->otype;
 		__entry->wasdel = args->wasdel;
 		__entry->wasfromfl = args->wasfromfl;
-		__entry->isfl = args->isfl;
+		__entry->resv = args->resv;
 		__entry->userdata = args->userdata;
 		__entry->firstblock = args->firstblock;
 	),
 	TP_printk("dev %d:%d agno %u agbno %u minlen %u maxlen %u mod %u "
 		  "prod %u minleft %u total %u alignment %u minalignslop %u "
-		  "len %u type %s otype %s wasdel %d wasfromfl %d isfl %d "
+		  "len %u type %s otype %s wasdel %d wasfromfl %d resv %d "
 		  "userdata %d firstblock 0x%llx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
@@ -1666,7 +1667,7 @@ DECLARE_EVENT_CLASS(xfs_alloc_class,
 		  __print_symbolic(__entry->otype, XFS_ALLOC_TYPES),
 		  __entry->wasdel,
 		  __entry->wasfromfl,
-		  __entry->isfl,
+		  __entry->resv,
 		  __entry->userdata,
 		  (unsigned long long)__entry->firstblock)
 )
@@ -2558,21 +2559,6 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
 DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
-/* dummy definitions to avoid breaking bisectability; will be removed later */
-#ifndef XFS_AG_RESV_DUMMY
-#define XFS_AG_RESV_DUMMY
-enum xfs_ag_resv_type {
-	XFS_AG_RESV_NONE = 0,
-	XFS_AG_RESV_METADATA,
-	XFS_AG_RESV_AGFL,
-};
-struct xfs_ag_resv {
-	xfs_extlen_t	ar_reserved;
-	xfs_extlen_t	ar_asked;
-};
-#define xfs_perag_resv(...)	NULL
-#endif
-
 /* per-AG reservation */
 DECLARE_EVENT_CLASS(xfs_ag_resv_class,
 	TP_PROTO(struct xfs_perag *pag, enum xfs_ag_resv_type resv,
diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
index d1b8833..ecb9a68 100644
--- a/fs/xfs/xfs_trans_extfree.c
+++ b/fs/xfs/xfs_trans_extfree.c
@@ -125,7 +125,8 @@ xfs_trans_free_extent(
 	struct xfs_extent	*extp;
 	int			error;
 
-	error = xfs_free_extent(tp, start_block, ext_len, oinfo);
+	error = xfs_free_extent(tp, start_block, ext_len, oinfo,
+			XFS_AG_RESV_NONE);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 053/119] xfs: define tracepoints for refcount btree activities
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (51 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 052/119] xfs: set up per-AG free space reservations Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:23 ` [PATCH 054/119] xfs: introduce refcount btree definitions Darrick J. Wong
                   ` (65 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Define all the tracepoints we need to inspect the refcount btree
runtime operation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_trace.h |  302 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 302 insertions(+)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b421b28..6ed7cbf 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -40,6 +40,16 @@ struct xfs_inode_log_format;
 struct xfs_bmbt_irec;
 struct xfs_btree_cur;
 
+#ifndef XFS_REFCOUNT_IREC_PLACEHOLDER
+#define XFS_REFCOUNT_IREC_PLACEHOLDER
+/* Placeholder definition to avoid breaking bisectability. */
+struct xfs_refcount_irec {
+	xfs_agblock_t	rc_startblock;	/* starting block number */
+	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
+	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
+};
+#endif
+
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),
 	TP_ARGS(ctx),
@@ -2613,6 +2623,298 @@ DEFINE_AG_RESV_EVENT(xfs_ag_resv_needed);
 DEFINE_AG_ERROR_EVENT(xfs_ag_resv_free_error);
 DEFINE_AG_ERROR_EVENT(xfs_ag_resv_init_error);
 
+/* refcount tracepoint classes */
+
+/* reuse the discard trace class for agbno/aglen-based traces */
+#define DEFINE_AG_EXTENT_EVENT(name) DEFINE_DISCARD_EVENT(name)
+
+/* ag btree lookup tracepoint class */
+#define XFS_AG_BTREE_CMP_FORMAT_STR \
+	{ XFS_LOOKUP_EQ,	"eq" }, \
+	{ XFS_LOOKUP_LE,	"le" }, \
+	{ XFS_LOOKUP_GE,	"ge" }
+DECLARE_EVENT_CLASS(xfs_ag_btree_lookup_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agblock_t agbno, xfs_lookup_t dir),
+	TP_ARGS(mp, agno, agbno, dir),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_lookup_t, dir)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agbno = agbno;
+		__entry->dir = dir;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u cmp %s(%d)\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __print_symbolic(__entry->dir, XFS_AG_BTREE_CMP_FORMAT_STR),
+		  __entry->dir)
+)
+
+#define DEFINE_AG_BTREE_LOOKUP_EVENT(name) \
+DEFINE_EVENT(xfs_ag_btree_lookup_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 xfs_agblock_t agbno, xfs_lookup_t dir), \
+	TP_ARGS(mp, agno, agbno, dir))
+
+/* single-rcext tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_extent_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *irec),
+	TP_ARGS(mp, agno, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, startblock)
+		__field(xfs_extlen_t, blockcount)
+		__field(xfs_nlink_t, refcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startblock = irec->rc_startblock;
+		__entry->blockcount = irec->rc_blockcount;
+		__entry->refcount = irec->rc_refcount;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->startblock,
+		  __entry->blockcount,
+		  __entry->refcount)
+)
+
+#define DEFINE_REFCOUNT_EXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_extent_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *irec), \
+	TP_ARGS(mp, agno, irec))
+
+/* single-rcext and an agbno tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_extent_at_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *irec, xfs_agblock_t agbno),
+	TP_ARGS(mp, agno, irec, agbno),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, startblock)
+		__field(xfs_extlen_t, blockcount)
+		__field(xfs_nlink_t, refcount)
+		__field(xfs_agblock_t, agbno)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startblock = irec->rc_startblock;
+		__entry->blockcount = irec->rc_blockcount;
+		__entry->refcount = irec->rc_refcount;
+		__entry->agbno = agbno;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u @ agbno %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->startblock,
+		  __entry->blockcount,
+		  __entry->refcount,
+		  __entry->agbno)
+)
+
+#define DEFINE_REFCOUNT_EXTENT_AT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_extent_at_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *irec, xfs_agblock_t agbno), \
+	TP_ARGS(mp, agno, irec, agbno))
+
+/* double-rcext tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_double_extent_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2),
+	TP_ARGS(mp, agno, i1, i2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, i1_startblock)
+		__field(xfs_extlen_t, i1_blockcount)
+		__field(xfs_nlink_t, i1_refcount)
+		__field(xfs_agblock_t, i2_startblock)
+		__field(xfs_extlen_t, i2_blockcount)
+		__field(xfs_nlink_t, i2_refcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->i1_startblock = i1->rc_startblock;
+		__entry->i1_blockcount = i1->rc_blockcount;
+		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_startblock = i2->rc_startblock;
+		__entry->i2_blockcount = i2->rc_blockcount;
+		__entry->i2_refcount = i2->rc_refcount;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->i1_startblock,
+		  __entry->i1_blockcount,
+		  __entry->i1_refcount,
+		  __entry->i2_startblock,
+		  __entry->i2_blockcount,
+		  __entry->i2_refcount)
+)
+
+#define DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_double_extent_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2), \
+	TP_ARGS(mp, agno, i1, i2))
+
+/* double-rcext and an agbno tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_double_extent_at_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2,
+		 xfs_agblock_t agbno),
+	TP_ARGS(mp, agno, i1, i2, agbno),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, i1_startblock)
+		__field(xfs_extlen_t, i1_blockcount)
+		__field(xfs_nlink_t, i1_refcount)
+		__field(xfs_agblock_t, i2_startblock)
+		__field(xfs_extlen_t, i2_blockcount)
+		__field(xfs_nlink_t, i2_refcount)
+		__field(xfs_agblock_t, agbno)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->i1_startblock = i1->rc_startblock;
+		__entry->i1_blockcount = i1->rc_blockcount;
+		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_startblock = i2->rc_startblock;
+		__entry->i2_blockcount = i2->rc_blockcount;
+		__entry->i2_refcount = i2->rc_refcount;
+		__entry->agbno = agbno;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u @ agbno %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->i1_startblock,
+		  __entry->i1_blockcount,
+		  __entry->i1_refcount,
+		  __entry->i2_startblock,
+		  __entry->i2_blockcount,
+		  __entry->i2_refcount,
+		  __entry->agbno)
+)
+
+#define DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_double_extent_at_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2, \
+		 xfs_agblock_t agbno), \
+	TP_ARGS(mp, agno, i1, i2, agbno))
+
+/* triple-rcext tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_triple_extent_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2,
+		 struct xfs_refcount_irec *i3),
+	TP_ARGS(mp, agno, i1, i2, i3),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, i1_startblock)
+		__field(xfs_extlen_t, i1_blockcount)
+		__field(xfs_nlink_t, i1_refcount)
+		__field(xfs_agblock_t, i2_startblock)
+		__field(xfs_extlen_t, i2_blockcount)
+		__field(xfs_nlink_t, i2_refcount)
+		__field(xfs_agblock_t, i3_startblock)
+		__field(xfs_extlen_t, i3_blockcount)
+		__field(xfs_nlink_t, i3_refcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->i1_startblock = i1->rc_startblock;
+		__entry->i1_blockcount = i1->rc_blockcount;
+		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_startblock = i2->rc_startblock;
+		__entry->i2_blockcount = i2->rc_blockcount;
+		__entry->i2_refcount = i2->rc_refcount;
+		__entry->i3_startblock = i3->rc_startblock;
+		__entry->i3_blockcount = i3->rc_blockcount;
+		__entry->i3_refcount = i3->rc_refcount;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->i1_startblock,
+		  __entry->i1_blockcount,
+		  __entry->i1_refcount,
+		  __entry->i2_startblock,
+		  __entry->i2_blockcount,
+		  __entry->i2_refcount,
+		  __entry->i3_startblock,
+		  __entry->i3_blockcount,
+		  __entry->i3_refcount)
+);
+
+#define DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_triple_extent_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2, \
+		 struct xfs_refcount_irec *i3), \
+	TP_ARGS(mp, agno, i1, i2, i3))
+
+/* refcount btree tracepoints */
+DEFINE_BUSY_EVENT(xfs_refcountbt_alloc_block);
+DEFINE_BUSY_EVENT(xfs_refcountbt_free_block);
+DEFINE_AG_BTREE_LOOKUP_EVENT(xfs_refcountbt_lookup);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcountbt_get);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcountbt_update);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcountbt_insert);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcountbt_delete);
+DEFINE_AG_ERROR_EVENT(xfs_refcountbt_insert_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcountbt_delete_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcountbt_update_error);
+
+/* refcount adjustment tracepoints */
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_increase);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_decrease);
+DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(xfs_refcount_merge_center_extents);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_modify_extent);
+DEFINE_REFCOUNT_EXTENT_AT_EVENT(xfs_refcount_split_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_left_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_right_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_left_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_right_extent);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_center_extents_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_modify_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_split_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_left_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_right_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_find_left_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_rec_order_error);
+
+/* reflink helpers */
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 054/119] xfs: introduce refcount btree definitions
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (52 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 053/119] xfs: define tracepoints for refcount btree activities Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:23 ` [PATCH 055/119] xfs: add refcount btree stats infrastructure Darrick J. Wong
                   ` (64 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

Add new per-AG refcount btree definitions to the per-AG structures.

v2: Move the reflink inode flag out of the way of the DAX flag, and
add the new cowextsize flag.

v3: Don't allow pNFS to export reflinked files; this will be removed
some day when the Linux pNFS server supports it.

[hch: don't allow pNFS export of reflinked files]
[darrick: fix the feature test in hch's patch]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c      |    5 +++++
 fs/xfs/libxfs/xfs_btree.c      |    5 +++--
 fs/xfs/libxfs/xfs_btree.h      |    3 +++
 fs/xfs/libxfs/xfs_format.h     |   29 ++++++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_rmap_btree.c |    7 +++++--
 fs/xfs/libxfs/xfs_types.h      |    2 +-
 fs/xfs/xfs_inode.h             |    5 +++++
 fs/xfs/xfs_mount.h             |    3 +++
 fs/xfs/xfs_pnfs.c              |    7 +++++++
 9 files changed, 58 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 5f05c4e..9009b1f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2448,6 +2448,10 @@ xfs_agf_verify(
 	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
 		return false;
 
+	if (xfs_sb_version_hasreflink(&mp->m_sb) &&
+	    be32_to_cpu(agf->agf_refcount_level) > XFS_BTREE_MAXLEVELS)
+		return false;
+
 	return true;;
 
 }
@@ -2568,6 +2572,7 @@ xfs_alloc_read_agf(
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
 		pag->pagf_levels[XFS_BTNUM_RMAPi] =
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
+		pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
 		spin_lock_init(&pag->pagb_lock);
 		pag->pagb_count = 0;
 #ifdef __KERNEL__
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 50b2c32..1593239 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -45,9 +45,10 @@ kmem_zone_t	*xfs_btree_cur_zone;
  */
 static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
 	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
-	  XFS_FIBT_MAGIC },
+	  XFS_FIBT_MAGIC, 0 },
 	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC,
-	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
+	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC,
+	  XFS_REFC_CRC_MAGIC }
 };
 #define xfs_btree_magic(cur) \
 	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 6fa13a9..9b5a921 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -66,6 +66,7 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_INO	((xfs_btnum_t)XFS_BTNUM_INOi)
 #define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 #define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
+#define	XFS_BTNUM_REFC	((xfs_btnum_t)XFS_BTNUM_REFCi)
 
 /*
  * For logging record fields.
@@ -99,6 +100,7 @@ do {    \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
 	case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(__mp, rmap, stat); break; \
+	case XFS_BTNUM_REFC: break; \
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
@@ -121,6 +123,7 @@ do {    \
 		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
 	case XFS_BTNUM_RMAP:	\
 		__XFS_BTREE_STATS_ADD(__mp, rmap, stat, val); break; \
+	case XFS_BTNUM_REFC: break;	\
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 1b08237..63a97a9 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -456,6 +456,7 @@ xfs_sb_has_compat_feature(
 
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
+#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
@@ -546,6 +547,12 @@ static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
 }
 
+static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
+}
+
 /*
  * end of superblock version macros
  */
@@ -640,12 +647,15 @@ typedef struct xfs_agf {
 	__be32		agf_btreeblks;	/* # of blocks held in AGF btrees */
 	uuid_t		agf_uuid;	/* uuid of filesystem */
 
+	__be32		agf_refcount_root;	/* refcount tree root block */
+	__be32		agf_refcount_level;	/* refcount btree levels */
+
 	/*
 	 * reserve some contiguous space for future logged fields before we add
 	 * the unlogged fields. This makes the range logging via flags and
 	 * structure offsets much simpler.
 	 */
-	__be64		agf_spare64[16];
+	__be64		agf_spare64[15];
 
 	/* unlogged fields, written during buffer writeback. */
 	__be64		agf_lsn;	/* last write sequence */
@@ -1033,9 +1043,14 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
  * 16 bits of the XFS_XFLAG_s range.
  */
 #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
+#define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
+#define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
+#define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
+#define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 
-#define XFS_DIFLAG2_ANY		(XFS_DIFLAG2_DAX)
+#define XFS_DIFLAG2_ANY \
+	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
 
 /*
  * Inode number format:
@@ -1382,7 +1397,8 @@ xfs_rmap_ino_owner(
 #define XFS_RMAP_OWN_AG		(-5ULL)	/* AG freespace btree blocks */
 #define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
 #define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
-#define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
+#define XFS_RMAP_OWN_REFC	(-8ULL) /* refcount tree */
+#define XFS_RMAP_OWN_MIN	(-9ULL) /* guard */
 
 #define XFS_RMAP_NON_INODE_OWNER(owner)	(!!((owner) & (1ULL << 63)))
 
@@ -1530,6 +1546,13 @@ xfs_owner_info_pack(
 }
 
 /*
+ * Reference Count Btree format definitions
+ *
+ */
+#define	XFS_REFC_CRC_MAGIC	0x52334643	/* 'R3FC' */
+
+
+/*
  * BMAP Btree format definitions
  *
  * This includes both the root block definition that sits inside an inode fork
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 9adb930..090dbbe 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -475,6 +475,9 @@ void
 xfs_rmapbt_compute_maxlevels(
 	struct xfs_mount		*mp)
 {
-	mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
-			mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		mp->m_rmap_maxlevels = XFS_BTREE_MAXLEVELS;
+	else
+		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
+				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
 }
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index da87796..690d616 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -112,7 +112,7 @@ typedef enum {
 
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
-	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_MAX
+	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_REFCi, XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 struct xfs_name {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 633f2af..d0ea6ff 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -202,6 +202,11 @@ xfs_get_initial_prid(struct xfs_inode *dp)
 	return XFS_PROJID_DEFAULT;
 }
 
+static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
+{
+	return ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+}
+
 /*
  * In-core inode flags.
  */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e18d74e..823ee63 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -391,6 +391,9 @@ typedef struct xfs_perag {
 	struct xfs_ag_resv	pag_meta_resv;
 	/* Blocks reserved for just AGFL-based metadata. */
 	struct xfs_ag_resv	pag_agfl_resv;
+
+	/* reference count */
+	__uint8_t	pagf_refcount_level;
 } xfs_perag_t;
 
 static inline struct xfs_ag_resv *
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index d5b7566..b21f532 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -139,6 +139,13 @@ xfs_fs_map_blocks(
 		return -ENXIO;
 
 	/*
+	 * The pNFS block layout spec actually supports reflink like
+	 * functionality, but the Linux pNFS server doesn't implement it yet.
+	 */
+	if (xfs_is_reflink_inode(ip))
+		return -ENXIO;
+
+	/*
 	 * Lock out any other I/O before we flush and invalidate the pagecache,
 	 * and then hand out a layout to the remote system.  This is very
 	 * similar to direct I/O, except that the synchronization is much more


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 055/119] xfs: add refcount btree stats infrastructure
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (53 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 054/119] xfs: introduce refcount btree definitions Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:23 ` [PATCH 056/119] xfs: refcount btree add more reserved blocks Darrick J. Wong
                   ` (63 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

The refcount btree presents the same stats as the other btrees, so
add all the code for that now.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.h |    5 +++--
 fs/xfs/xfs_stats.c        |    1 +
 fs/xfs/xfs_stats.h        |   18 +++++++++++++++++-
 3 files changed, 21 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 9b5a921..93e761e 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -100,7 +100,7 @@ do {    \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
 	case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(__mp, rmap, stat); break; \
-	case XFS_BTNUM_REFC: break; \
+	case XFS_BTNUM_REFC: __XFS_BTREE_STATS_INC(__mp, refcbt, stat); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
@@ -123,7 +123,8 @@ do {    \
 		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
 	case XFS_BTNUM_RMAP:	\
 		__XFS_BTREE_STATS_ADD(__mp, rmap, stat, val); break; \
-	case XFS_BTNUM_REFC: break;	\
+	case XFS_BTNUM_REFC:	\
+		__XFS_BTREE_STATS_ADD(__mp, refcbt, stat, val); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
 	}       \
 } while (0)
diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
index f04f547..e447deb 100644
--- a/fs/xfs/xfs_stats.c
+++ b/fs/xfs/xfs_stats.c
@@ -62,6 +62,7 @@ int xfs_stats_format(struct xfsstats __percpu *stats, char *buf)
 		{ "ibt2",		XFSSTAT_END_IBT_V2		},
 		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
 		{ "rmapbt",		XFSSTAT_END_RMAP_V2		},
+		{ "refcntbt",		XFSSTAT_END_REFCOUNT		},
 		/* we print both series of quota information together */
 		{ "qm",			XFSSTAT_END_QM			},
 	};
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index 657865f..79ad2e6 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -213,7 +213,23 @@ struct xfsstats {
 	__uint32_t		xs_rmap_2_alloc;
 	__uint32_t		xs_rmap_2_free;
 	__uint32_t		xs_rmap_2_moves;
-#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_RMAP_V2+6)
+#define XFSSTAT_END_REFCOUNT		(XFSSTAT_END_RMAP_V2 + 15)
+	__uint32_t		xs_refcbt_2_lookup;
+	__uint32_t		xs_refcbt_2_compare;
+	__uint32_t		xs_refcbt_2_insrec;
+	__uint32_t		xs_refcbt_2_delrec;
+	__uint32_t		xs_refcbt_2_newroot;
+	__uint32_t		xs_refcbt_2_killroot;
+	__uint32_t		xs_refcbt_2_increment;
+	__uint32_t		xs_refcbt_2_decrement;
+	__uint32_t		xs_refcbt_2_lshift;
+	__uint32_t		xs_refcbt_2_rshift;
+	__uint32_t		xs_refcbt_2_split;
+	__uint32_t		xs_refcbt_2_join;
+	__uint32_t		xs_refcbt_2_alloc;
+	__uint32_t		xs_refcbt_2_free;
+	__uint32_t		xs_refcbt_2_moves;
+#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_REFCOUNT + 6)
 	__uint32_t		xs_qm_dqreclaims;
 	__uint32_t		xs_qm_dqreclaim_misses;
 	__uint32_t		xs_qm_dquot_dups;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 056/119] xfs: refcount btree add more reserved blocks
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (54 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 055/119] xfs: add refcount btree stats infrastructure Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:23 ` [PATCH 057/119] xfs: define the on-disk refcount btree format Darrick J. Wong
                   ` (62 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c  |   13 +++++++++++++
 fs/xfs/libxfs/xfs_format.h |    2 ++
 2 files changed, 15 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 9009b1f..14f8a69d 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -52,10 +52,23 @@ STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
 		xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
 
+unsigned int
+xfs_refc_block(
+	struct xfs_mount	*mp)
+{
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return XFS_RMAP_BLOCK(mp) + 1;
+	if (xfs_sb_version_hasfinobt(&mp->m_sb))
+		return XFS_FIBT_BLOCK(mp) + 1;
+	return XFS_IBT_BLOCK(mp) + 1;
+}
+
 xfs_extlen_t
 xfs_prealloc_blocks(
 	struct xfs_mount	*mp)
 {
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		return xfs_refc_block(mp) + 1;
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		return XFS_RMAP_BLOCK(mp) + 1;
 	if (xfs_sb_version_hasfinobt(&mp->m_sb))
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 63a97a9..adeeb08 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1551,6 +1551,8 @@ xfs_owner_info_pack(
  */
 #define	XFS_REFC_CRC_MAGIC	0x52334643	/* 'R3FC' */
 
+unsigned int xfs_refc_block(struct xfs_mount *mp);
+
 
 /*
  * BMAP Btree format definitions


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 057/119] xfs: define the on-disk refcount btree format
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (55 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 056/119] xfs: refcount btree add more reserved blocks Darrick J. Wong
@ 2016-06-17  1:23 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 058/119] xfs: add refcount btree support to growfs Darrick J. Wong
                   ` (61 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:23 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.

v2: Calculate a separate maxlevels for the refcount btree.

v3: Enable the tracking of per-cursor stats for refcount btrees.
The refcount update code will use this to guess if it's time to
split a refcountbt update across two transactions to avoid
exhausing the transaction reservation.

xfs_refcountbt_init_cursor can be called under the ilock, so
use KM_NOFS to prevent fs activity with a lock held.  This
should shut up some of the lockdep warnings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: allocate the cursor with KM_NOFS to quiet lockdep]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_btree.c          |    3 +
 fs/xfs/libxfs/xfs_btree.h          |   12 ++
 fs/xfs/libxfs/xfs_format.h         |   32 ++++++
 fs/xfs/libxfs/xfs_refcount_btree.c |  178 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount_btree.h |   67 ++++++++++++++
 fs/xfs/libxfs/xfs_sb.c             |    9 ++
 fs/xfs/libxfs/xfs_shared.h         |    2 
 fs/xfs/libxfs/xfs_trans_resv.c     |    2 
 fs/xfs/libxfs/xfs_trans_resv.h     |    1 
 fs/xfs/xfs_mount.c                 |    2 
 fs/xfs/xfs_mount.h                 |    3 +
 fs/xfs/xfs_ondisk.h                |    3 +
 fs/xfs/xfs_trace.h                 |   11 --
 14 files changed, 315 insertions(+), 11 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.c
 create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index c7a864e..3f579af 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -55,6 +55,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_ag_resv.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
+				   xfs_refcount_btree.o \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_resv.o \
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 1593239..9c84184 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1214,6 +1214,9 @@ xfs_btree_set_refs(
 	case XFS_BTNUM_RMAP:
 		xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF);
 		break;
+	case XFS_BTNUM_REFC:
+		xfs_buf_set_ref(bp, XFS_REFC_BTREE_REF);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 93e761e..dbf299f 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -43,6 +43,7 @@ union xfs_btree_key {
 	xfs_alloc_key_t			alloc;
 	struct xfs_inobt_key		inobt;
 	struct xfs_rmap_key		rmap;
+	struct xfs_refcount_key		refc;
 };
 
 union xfs_btree_rec {
@@ -51,6 +52,7 @@ union xfs_btree_rec {
 	struct xfs_alloc_rec		alloc;
 	struct xfs_inobt_rec		inobt;
 	struct xfs_rmap_rec		rmap;
+	struct xfs_refcount_rec		refc;
 };
 
 /*
@@ -221,6 +223,15 @@ union xfs_btree_irec {
 	xfs_bmbt_irec_t			b;
 	xfs_inobt_rec_incore_t		i;
 	struct xfs_rmap_irec		r;
+	struct xfs_refcount_irec	rc;
+};
+
+/* Per-AG btree private information. */
+union xfs_btree_cur_private {
+	struct {
+		unsigned long	nr_ops;		/* # record updates */
+		int		shape_changes;	/* # of extent splits */
+	} refc;
 };
 
 /*
@@ -247,6 +258,7 @@ typedef struct xfs_btree_cur
 			struct xfs_buf	*agbp;	/* agf/agi buffer pointer */
 			struct xfs_defer_ops *dfops;	/* deferred updates */
 			xfs_agnumber_t	agno;	/* ag number */
+			union xfs_btree_cur_private	priv;
 		} a;
 		struct {			/* needed for BMAP */
 			struct xfs_inode *ip;	/* pointer to our inode */
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index adeeb08..3b0cae2 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1553,6 +1553,38 @@ xfs_owner_info_pack(
 
 unsigned int xfs_refc_block(struct xfs_mount *mp);
 
+/*
+ * Data record/key structure
+ *
+ * Each record associates a range of physical blocks (starting at
+ * rc_startblock and ending rc_blockcount blocks later) with a
+ * reference count (rc_refcount).  A record is only stored in the
+ * btree if the refcount is > 2.  An entry in the free block btree
+ * means that the refcount is 0, and no entries anywhere means that
+ * the refcount is 1, as was true in XFS before reflinking.
+ */
+struct xfs_refcount_rec {
+	__be32		rc_startblock;	/* starting block number */
+	__be32		rc_blockcount;	/* count of blocks */
+	__be32		rc_refcount;	/* number of inodes linked here */
+};
+
+struct xfs_refcount_key {
+	__be32		rc_startblock;	/* starting block number */
+};
+
+struct xfs_refcount_irec {
+	xfs_agblock_t	rc_startblock;	/* starting block number */
+	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
+	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
+};
+
+#define MAXREFCOUNT	((xfs_nlink_t)~0U)
+#define MAXREFCEXTLEN	((xfs_extlen_t)~0U)
+
+/* btree pointer type */
+typedef __be32 xfs_refcount_ptr_t;
+
 
 /*
  * BMAP Btree format definitions
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
new file mode 100644
index 0000000..359cf0c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -0,0 +1,178 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_cksum.h"
+#include "xfs_trans.h"
+#include "xfs_bit.h"
+
+static struct xfs_btree_cur *
+xfs_refcountbt_dup_cursor(
+	struct xfs_btree_cur	*cur)
+{
+	return xfs_refcountbt_init_cursor(cur->bc_mp, cur->bc_tp,
+			cur->bc_private.a.agbp, cur->bc_private.a.agno,
+			cur->bc_private.a.dfops);
+}
+
+STATIC bool
+xfs_refcountbt_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_target->bt_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_perag	*pag = bp->b_pag;
+	unsigned int		level;
+
+	if (block->bb_magic != cpu_to_be32(XFS_REFC_CRC_MAGIC))
+		return false;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return false;
+	if (!xfs_btree_sblock_v5hdr_verify(bp))
+		return false;
+
+	level = be16_to_cpu(block->bb_level);
+	if (pag && pag->pagf_init) {
+		if (level >= pag->pagf_refcount_level)
+			return false;
+	} else if (level >= mp->m_refc_maxlevels)
+		return false;
+
+	return xfs_btree_sblock_verify(bp, mp->m_refc_mxr[level != 0]);
+}
+
+STATIC void
+xfs_refcountbt_read_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_btree_sblock_verify_crc(bp))
+		xfs_buf_ioerror(bp, -EFSBADCRC);
+	else if (!xfs_refcountbt_verify(bp))
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+
+	if (bp->b_error) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_verifier_error(bp);
+	}
+}
+
+STATIC void
+xfs_refcountbt_write_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_refcountbt_verify(bp)) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+		xfs_verifier_error(bp);
+		return;
+	}
+	xfs_btree_sblock_calc_crc(bp);
+
+}
+
+const struct xfs_buf_ops xfs_refcountbt_buf_ops = {
+	.name			= "xfs_refcountbt",
+	.verify_read		= xfs_refcountbt_read_verify,
+	.verify_write		= xfs_refcountbt_write_verify,
+};
+
+static const struct xfs_btree_ops xfs_refcountbt_ops = {
+	.rec_len		= sizeof(struct xfs_refcount_rec),
+	.key_len		= sizeof(struct xfs_refcount_key),
+
+	.dup_cursor		= xfs_refcountbt_dup_cursor,
+	.buf_ops		= &xfs_refcountbt_buf_ops,
+};
+
+/*
+ * Allocate a new refcount btree cursor.
+ */
+struct xfs_btree_cur *
+xfs_refcountbt_init_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agnumber_t		agno,
+	struct xfs_defer_ops	*dfops)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	struct xfs_btree_cur	*cur;
+
+	ASSERT(agno != NULLAGNUMBER);
+	ASSERT(agno < mp->m_sb.sb_agcount);
+	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
+
+	cur->bc_tp = tp;
+	cur->bc_mp = mp;
+	cur->bc_btnum = XFS_BTNUM_REFC;
+	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+	cur->bc_ops = &xfs_refcountbt_ops;
+
+	cur->bc_nlevels = be32_to_cpu(agf->agf_refcount_level);
+
+	cur->bc_private.a.agbp = agbp;
+	cur->bc_private.a.agno = agno;
+	cur->bc_private.a.dfops = dfops;
+	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
+
+	cur->bc_private.a.priv.refc.nr_ops = 0;
+	cur->bc_private.a.priv.refc.shape_changes = 0;
+
+	return cur;
+}
+
+/*
+ * Calculate the number of records in a refcount btree block.
+ */
+int
+xfs_refcountbt_maxrecs(
+	struct xfs_mount	*mp,
+	int			blocklen,
+	bool			leaf)
+{
+	blocklen -= XFS_REFCOUNT_BLOCK_LEN;
+
+	if (leaf)
+		return blocklen / sizeof(struct xfs_refcount_rec);
+	return blocklen / (sizeof(struct xfs_refcount_key) +
+			   sizeof(xfs_refcount_ptr_t));
+}
+
+/* Compute the maximum height of a refcount btree. */
+void
+xfs_refcountbt_compute_maxlevels(
+	struct xfs_mount		*mp)
+{
+	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(mp,
+			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
new file mode 100644
index 0000000..9e9ad7c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -0,0 +1,67 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_REFCOUNT_BTREE_H__
+#define	__XFS_REFCOUNT_BTREE_H__
+
+/*
+ * Reference Count Btree on-disk structures
+ */
+
+struct xfs_buf;
+struct xfs_btree_cur;
+struct xfs_mount;
+
+/*
+ * Btree block header size
+ */
+#define XFS_REFCOUNT_BLOCK_LEN	XFS_BTREE_SBLOCK_CRC_LEN
+
+/*
+ * Record, key, and pointer address macros for btree blocks.
+ *
+ * (note that some of these may appear unused, but they are used in userspace)
+ */
+#define XFS_REFCOUNT_REC_ADDR(block, index) \
+	((struct xfs_refcount_rec *) \
+		((char *)(block) + \
+		 XFS_REFCOUNT_BLOCK_LEN + \
+		 (((index) - 1) * sizeof(struct xfs_refcount_rec))))
+
+#define XFS_REFCOUNT_KEY_ADDR(block, index) \
+	((struct xfs_refcount_key *) \
+		((char *)(block) + \
+		 XFS_REFCOUNT_BLOCK_LEN + \
+		 ((index) - 1) * sizeof(struct xfs_refcount_key)))
+
+#define XFS_REFCOUNT_PTR_ADDR(block, index, maxrecs) \
+	((xfs_refcount_ptr_t *) \
+		((char *)(block) + \
+		 XFS_REFCOUNT_BLOCK_LEN + \
+		 (maxrecs) * sizeof(struct xfs_refcount_key) + \
+		 ((index) - 1) * sizeof(xfs_refcount_ptr_t)))
+
+extern struct xfs_btree_cur *xfs_refcountbt_init_cursor(struct xfs_mount *mp,
+		struct xfs_trans *tp, struct xfs_buf *agbp, xfs_agnumber_t agno,
+		struct xfs_defer_ops *dfops);
+extern int xfs_refcountbt_maxrecs(struct xfs_mount *mp, int blocklen,
+		bool leaf);
+extern void xfs_refcountbt_compute_maxlevels(struct xfs_mount *mp);
+
+#endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 59c9f59..a937071 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -38,6 +38,8 @@
 #include "xfs_ialloc_btree.h"
 #include "xfs_log.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_refcount_btree.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -740,6 +742,13 @@ xfs_sb_mount_common(
 	mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2;
 	mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2;
 
+	mp->m_refc_mxr[0] = xfs_refcountbt_maxrecs(mp, sbp->sb_blocksize,
+			true);
+	mp->m_refc_mxr[1] = xfs_refcountbt_maxrecs(mp, sbp->sb_blocksize,
+			false);
+	mp->m_refc_mnr[0] = mp->m_refc_mxr[0] / 2;
+	mp->m_refc_mnr[1] = mp->m_refc_mxr[1] / 2;
+
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
 	mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
 					sbp->sb_inopblock);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 0c5b30b..c6f4eb4 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -39,6 +39,7 @@ extern const struct xfs_buf_ops xfs_agf_buf_ops;
 extern const struct xfs_buf_ops xfs_agfl_buf_ops;
 extern const struct xfs_buf_ops xfs_allocbt_buf_ops;
 extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
+extern const struct xfs_buf_ops xfs_refcountbt_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops;
 extern const struct xfs_buf_ops xfs_bmbt_buf_ops;
@@ -122,6 +123,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define	XFS_INO_REF		2
 #define	XFS_ATTR_BTREE_REF	1
 #define	XFS_DQUOT_REF		1
+#define	XFS_REFC_BTREE_REF	1
 
 /*
  * Flags for xfs_trans_ichgtime().
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 301ef2f..7c840e1 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -73,7 +73,7 @@ xfs_calc_buf_res(
  *
  * Keep in mind that max depth is calculated separately for each type of tree.
  */
-static uint
+uint
 xfs_allocfree_log_count(
 	struct xfs_mount *mp,
 	uint		num_ops)
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 0eb46ed..36a1511 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -102,5 +102,6 @@ struct xfs_trans_resv {
 #define	XFS_ATTRRM_LOG_COUNT		3
 
 void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp);
+uint xfs_allocfree_log_count(struct xfs_mount *mp, uint num_ops);
 
 #endif	/* __XFS_TRANS_RESV_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 879f3ef..48b8b1e 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -43,6 +43,7 @@
 #include "xfs_icache.h"
 #include "xfs_sysfs.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_refcount_btree.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -682,6 +683,7 @@ xfs_mountfs(
 	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
 	xfs_ialloc_compute_maxlevels(mp);
 	xfs_rmapbt_compute_maxlevels(mp);
+	xfs_refcountbt_compute_maxlevels(mp);
 
 	xfs_set_maxicount(mp);
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 823ee63..a516a1f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -118,10 +118,13 @@ typedef struct xfs_mount {
 	uint			m_inobt_mnr[2];	/* min inobt btree records */
 	uint			m_rmap_mxr[2];	/* max rmap btree records */
 	uint			m_rmap_mnr[2];	/* min rmap btree records */
+	uint			m_refc_mxr[2];	/* max refc btree records */
+	uint			m_refc_mnr[2];	/* min refc btree records */
 	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
 	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
 	uint			m_in_maxlevels;	/* max inobt btree levels. */
 	uint			m_rmap_maxlevels; /* max rmap btree levels */
+	uint			m_refc_maxlevels; /* max refcount btree level */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */
diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
index 48d544f..3742216 100644
--- a/fs/xfs/xfs_ondisk.h
+++ b/fs/xfs/xfs_ondisk.h
@@ -47,6 +47,8 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_refcount_key,		4);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_refcount_rec,		12);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key,		20);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec,		24);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_timestamp,		8);
@@ -54,6 +56,7 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t,			4);
 	XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t,			8);
 	XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t,			4);
+	XFS_CHECK_STRUCT_SIZE(xfs_refcount_ptr_t,		4);
 	XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t,			4);
 
 	/* dir/attr trees */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6ed7cbf..67ce2d8 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -39,16 +39,7 @@ struct xfs_buf_log_format;
 struct xfs_inode_log_format;
 struct xfs_bmbt_irec;
 struct xfs_btree_cur;
-
-#ifndef XFS_REFCOUNT_IREC_PLACEHOLDER
-#define XFS_REFCOUNT_IREC_PLACEHOLDER
-/* Placeholder definition to avoid breaking bisectability. */
-struct xfs_refcount_irec {
-	xfs_agblock_t	rc_startblock;	/* starting block number */
-	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
-	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
-};
-#endif
+struct xfs_refcount_irec;
 
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 058/119] xfs: add refcount btree support to growfs
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (56 preceding siblings ...)
  2016-06-17  1:23 ` [PATCH 057/119] xfs: define the on-disk refcount btree format Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 059/119] xfs: account for the refcount btree in the alloc/free log reservation Darrick J. Wong
                   ` (60 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Modify the growfs code to initialize new refcount btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_fsops.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index cd4de75..3c1ded1 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -257,6 +257,11 @@ xfs_growfs_data_private(
 		agf->agf_longest = cpu_to_be32(tmpsize);
 		if (xfs_sb_version_hascrc(&mp->m_sb))
 			uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
+		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			agf->agf_refcount_root = cpu_to_be32(
+					xfs_refc_block(mp));
+			agf->agf_refcount_level = cpu_to_be32(1);
+		}
 
 		error = xfs_bwrite(bp);
 		xfs_buf_relse(bp);
@@ -448,6 +453,17 @@ xfs_growfs_data_private(
 			rrec->rm_offset = 0;
 			be16_add_cpu(&block->bb_numrecs, 1);
 
+			/* account for refc btree root */
+			if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+				rrec = XFS_RMAP_REC_ADDR(block, 5);
+				rrec->rm_startblock = cpu_to_be32(
+						xfs_refc_block(mp));
+				rrec->rm_blockcount = cpu_to_be32(1);
+				rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_REFC);
+				rrec->rm_offset = 0;
+				be16_add_cpu(&block->bb_numrecs, 1);
+			}
+
 			error = xfs_bwrite(bp);
 			xfs_buf_relse(bp);
 			if (error)
@@ -505,6 +521,28 @@ xfs_growfs_data_private(
 				goto error0;
 		}
 
+		/*
+		 * refcount btree root block
+		 */
+		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AGB_TO_DADDR(mp, agno, xfs_refc_block(mp)),
+				BTOBB(mp->m_sb.sb_blocksize), 0,
+				&xfs_refcountbt_buf_ops);
+			if (!bp) {
+				error = -ENOMEM;
+				goto error0;
+			}
+
+			xfs_btree_init_block(mp, bp, XFS_REFC_CRC_MAGIC,
+					     0, 0, agno,
+					     XFS_BTREE_CRC_BLOCKS);
+
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+			if (error)
+				goto error0;
+		}
 	}
 	xfs_trans_agblocks_delta(tp, nfree);
 	/*


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 059/119] xfs: account for the refcount btree in the alloc/free log reservation
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (57 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 058/119] xfs: add refcount btree support to growfs Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 060/119] xfs: add refcount btree operations Darrick J. Wong
                   ` (59 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

Every time we allocate or free an extent, we might need to split the
refcount btree.  Reserve some blocks in the transaction to handle
this possibility.

(Reproduced by generic/167 over NFS atop XFS)

Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 7c840e1..a59838f 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -67,7 +67,8 @@ xfs_calc_buf_res(
  * Per-extent log reservation for the btree changes involved in freeing or
  * allocating an extent.  In classic XFS there were two trees that will be
  * modified (bnobt + cntbt).  With rmap enabled, there are three trees
- * (rmapbt).  The number of blocks reserved is based on the formula:
+ * (rmapbt).  With reflink, there are four trees (refcountbt).  The number of
+ * blocks reserved is based on the formula:
  *
  * num trees * ((2 blocks/level * max depth) - 1)
  *
@@ -83,6 +84,8 @@ xfs_allocfree_log_count(
 	blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
 
 	return blocks;
 }


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 060/119] xfs: add refcount btree operations
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (58 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 059/119] xfs: account for the refcount btree in the alloc/free log reservation Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 061/119] xfs: create refcount update intent log items Darrick J. Wong
                   ` (58 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

Implement the generic btree operations required to manipulate refcount
btree blocks.  The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.

v2: Remove init_rec_from_key since we no longer need it, and add
tracepoints when refcount btree operations fail.

Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_alloc.c          |    4 +
 fs/xfs/libxfs/xfs_format.h         |    5 +
 fs/xfs/libxfs/xfs_refcount.c       |  177 ++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h       |   30 +++++
 fs/xfs/libxfs/xfs_refcount_btree.c |  197 ++++++++++++++++++++++++++++++++++++
 6 files changed, 413 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/libxfs/xfs_refcount.c
 create mode 100644 fs/xfs/libxfs/xfs_refcount.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 3f579af..e0c3067 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -55,6 +55,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_ag_resv.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
+				   xfs_refcount.o \
 				   xfs_refcount_btree.o \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 14f8a69d..be5d0df 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2326,6 +2326,10 @@ xfs_alloc_log_agf(
 		offsetof(xfs_agf_t, agf_longest),
 		offsetof(xfs_agf_t, agf_btreeblks),
 		offsetof(xfs_agf_t, agf_uuid),
+		offsetof(xfs_agf_t, agf_refcount_root),
+		offsetof(xfs_agf_t, agf_refcount_level),
+		/* needed so that we don't log the whole rest of the structure: */
+		offsetof(xfs_agf_t, agf_spare64),
 		sizeof(xfs_agf_t)
 	};
 
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 3b0cae2..45bbdad 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -680,7 +680,10 @@ typedef struct xfs_agf {
 #define	XFS_AGF_LONGEST		0x00000400
 #define	XFS_AGF_BTREEBLKS	0x00000800
 #define	XFS_AGF_UUID		0x00001000
-#define	XFS_AGF_NUM_BITS	13
+#define	XFS_AGF_REFCOUNT_ROOT	0x00002000
+#define	XFS_AGF_REFCOUNT_LEVEL	0x00004000
+#define	XFS_AGF_SPARE64		0x00008000
+#define	XFS_AGF_NUM_BITS	16
 #define	XFS_AGF_ALL_BITS	((1 << XFS_AGF_NUM_BITS) - 1)
 
 #define XFS_AGF_FLAGS \
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
new file mode 100644
index 0000000..4d483b5
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -0,0 +1,177 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_cksum.h"
+#include "xfs_trans.h"
+#include "xfs_bit.h"
+#include "xfs_refcount.h"
+
+/*
+ * Look up the first record less than or equal to [bno, len] in the btree
+ * given by cur.
+ */
+int
+xfs_refcountbt_lookup_le(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	int			*stat)
+{
+	trace_xfs_refcountbt_lookup(cur->bc_mp, cur->bc_private.a.agno, bno,
+			XFS_LOOKUP_LE);
+	cur->bc_rec.rc.rc_startblock = bno;
+	cur->bc_rec.rc.rc_blockcount = 0;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
+}
+
+/*
+ * Look up the first record greater than or equal to [bno, len] in the btree
+ * given by cur.
+ */
+int
+xfs_refcountbt_lookup_ge(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	int			*stat)
+{
+	trace_xfs_refcountbt_lookup(cur->bc_mp, cur->bc_private.a.agno, bno,
+			XFS_LOOKUP_GE);
+	cur->bc_rec.rc.rc_startblock = bno;
+	cur->bc_rec.rc.rc_blockcount = 0;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int
+xfs_refcountbt_get_rec(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*irec,
+	int				*stat)
+{
+	union xfs_btree_rec	*rec;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (!error && *stat == 1) {
+		irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
+		irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
+		irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
+		trace_xfs_refcountbt_get(cur->bc_mp, cur->bc_private.a.agno,
+				irec);
+	}
+	return error;
+}
+
+/*
+ * Update the record referred to by cur to the value given
+ * by [bno, len, refcount].
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int
+xfs_refcountbt_update(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*irec)
+{
+	union xfs_btree_rec	rec;
+	int			error;
+
+	trace_xfs_refcountbt_update(cur->bc_mp, cur->bc_private.a.agno, irec);
+	rec.refc.rc_startblock = cpu_to_be32(irec->rc_startblock);
+	rec.refc.rc_blockcount = cpu_to_be32(irec->rc_blockcount);
+	rec.refc.rc_refcount = cpu_to_be32(irec->rc_refcount);
+	error = xfs_btree_update(cur, &rec);
+	if (error)
+		trace_xfs_refcountbt_update_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Insert the record referred to by cur to the value given
+ * by [bno, len, refcount].
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int
+xfs_refcountbt_insert(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*irec,
+	int				*i)
+{
+	int				error;
+
+	trace_xfs_refcountbt_insert(cur->bc_mp, cur->bc_private.a.agno, irec);
+	cur->bc_rec.rc.rc_startblock = irec->rc_startblock;
+	cur->bc_rec.rc.rc_blockcount = irec->rc_blockcount;
+	cur->bc_rec.rc.rc_refcount = irec->rc_refcount;
+	error = xfs_btree_insert(cur, i);
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, *i == 1, out_error);
+out_error:
+	if (error)
+		trace_xfs_refcountbt_insert_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Remove the record referred to by cur, then set the pointer to the spot
+ * where the record could be re-inserted, in case we want to increment or
+ * decrement the cursor.
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int
+xfs_refcountbt_delete(
+	struct xfs_btree_cur	*cur,
+	int			*i)
+{
+	struct xfs_refcount_irec	irec;
+	int			found_rec;
+	int			error;
+
+	error = xfs_refcountbt_get_rec(cur, &irec, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+	trace_xfs_refcountbt_delete(cur->bc_mp, cur->bc_private.a.agno, &irec);
+	error = xfs_btree_delete(cur, i);
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, *i == 1, out_error);
+	if (error)
+		goto out_error;
+	error = xfs_refcountbt_lookup_ge(cur, irec.rc_startblock, &found_rec);
+out_error:
+	if (error)
+		trace_xfs_refcountbt_delete_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
new file mode 100644
index 0000000..8ea65c6
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_REFCOUNT_H__
+#define __XFS_REFCOUNT_H__
+
+extern int xfs_refcountbt_lookup_le(struct xfs_btree_cur *cur,
+		xfs_agblock_t bno, int *stat);
+extern int xfs_refcountbt_lookup_ge(struct xfs_btree_cur *cur,
+		xfs_agblock_t bno, int *stat);
+extern int xfs_refcountbt_get_rec(struct xfs_btree_cur *cur,
+		struct xfs_refcount_irec *irec, int *stat);
+
+#endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 359cf0c..7093c71 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -44,6 +44,153 @@ xfs_refcountbt_dup_cursor(
 			cur->bc_private.a.dfops);
 }
 
+STATIC void
+xfs_refcountbt_set_root(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	int			inc)
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agnumber_t		seqno = be32_to_cpu(agf->agf_seqno);
+	struct xfs_perag	*pag = xfs_perag_get(cur->bc_mp, seqno);
+
+	ASSERT(ptr->s != 0);
+
+	agf->agf_refcount_root = ptr->s;
+	be32_add_cpu(&agf->agf_refcount_level, inc);
+	pag->pagf_refcount_level += inc;
+	xfs_perag_put(pag);
+
+	xfs_alloc_log_agf(cur->bc_tp, agbp,
+			XFS_AGF_REFCOUNT_ROOT | XFS_AGF_REFCOUNT_LEVEL);
+}
+
+STATIC int
+xfs_refcountbt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	struct xfs_alloc_arg	args;		/* block allocation args */
+	int			error;		/* error return value */
+
+	memset(&args, 0, sizeof(args));
+	args.tp = cur->bc_tp;
+	args.mp = cur->bc_mp;
+	args.type = XFS_ALLOCTYPE_NEAR_BNO;
+	args.fsbno = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+			xfs_refc_block(args.mp));
+	args.firstblock = args.fsbno;
+	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
+	args.minlen = args.maxlen = args.prod = 1;
+
+	error = xfs_alloc_vextent(&args);
+	if (error)
+		goto out_error;
+	trace_xfs_refcountbt_alloc_block(cur->bc_mp, cur->bc_private.a.agno,
+			args.agbno, 1);
+	if (args.fsbno == NULLFSBLOCK) {
+		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+		*stat = 0;
+		return 0;
+	}
+	ASSERT(args.agno == cur->bc_private.a.agno);
+	ASSERT(args.len == 1);
+
+	new->s = cpu_to_be32(args.agbno);
+
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+	*stat = 1;
+	return 0;
+
+out_error:
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
+	return error;
+}
+
+STATIC int
+xfs_refcountbt_free_block(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_trans	*tp = cur->bc_tp;
+	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
+	struct xfs_owner_info	oinfo;
+
+	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
+	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
+			&oinfo);
+	xfs_trans_binval(tp, bp);
+	return 0;
+}
+
+STATIC int
+xfs_refcountbt_get_minrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_refc_mnr[level != 0];
+}
+
+STATIC int
+xfs_refcountbt_get_maxrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_refc_mxr[level != 0];
+}
+
+STATIC void
+xfs_refcountbt_init_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	ASSERT(rec->refc.rc_startblock != 0);
+
+	key->refc.rc_startblock = rec->refc.rc_startblock;
+}
+
+STATIC void
+xfs_refcountbt_init_rec_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*rec)
+{
+	ASSERT(cur->bc_rec.rc.rc_startblock != 0);
+
+	rec->refc.rc_startblock = cpu_to_be32(cur->bc_rec.rc.rc_startblock);
+	rec->refc.rc_blockcount = cpu_to_be32(cur->bc_rec.rc.rc_blockcount);
+	rec->refc.rc_refcount = cpu_to_be32(cur->bc_rec.rc.rc_refcount);
+}
+
+STATIC void
+xfs_refcountbt_init_ptr_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(cur->bc_private.a.agbp);
+
+	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agf->agf_seqno));
+	ASSERT(agf->agf_refcount_root != 0);
+
+	ptr->s = agf->agf_refcount_root;
+}
+
+STATIC __int64_t
+xfs_refcountbt_key_diff(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*key)
+{
+	struct xfs_refcount_irec	*rec = &cur->bc_rec.rc;
+	struct xfs_refcount_key		*kp = &key->refc;
+
+	return (__int64_t)be32_to_cpu(kp->rc_startblock) - rec->rc_startblock;
+}
+
 STATIC bool
 xfs_refcountbt_verify(
 	struct xfs_buf		*bp)
@@ -106,12 +253,62 @@ const struct xfs_buf_ops xfs_refcountbt_buf_ops = {
 	.verify_write		= xfs_refcountbt_write_verify,
 };
 
+#if defined(DEBUG) || defined(XFS_WARN)
+STATIC int
+xfs_refcountbt_keys_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return be32_to_cpu(k1->refc.rc_startblock) <
+	       be32_to_cpu(k2->refc.rc_startblock);
+}
+
+STATIC int
+xfs_refcountbt_recs_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*r1,
+	union xfs_btree_rec	*r2)
+{
+	struct xfs_refcount_irec	a, b;
+
+	int ret = be32_to_cpu(r1->refc.rc_startblock) +
+		be32_to_cpu(r1->refc.rc_blockcount) <=
+		be32_to_cpu(r2->refc.rc_startblock);
+	if (!ret) {
+		a.rc_startblock = be32_to_cpu(r1->refc.rc_startblock);
+		a.rc_blockcount = be32_to_cpu(r1->refc.rc_blockcount);
+		a.rc_refcount = be32_to_cpu(r1->refc.rc_refcount);
+		b.rc_startblock = be32_to_cpu(r2->refc.rc_startblock);
+		b.rc_blockcount = be32_to_cpu(r2->refc.rc_blockcount);
+		b.rc_refcount = be32_to_cpu(r2->refc.rc_refcount);
+		trace_xfs_refcount_rec_order_error(cur->bc_mp,
+				cur->bc_private.a.agno, &a, &b);
+	}
+
+	return ret;
+}
+#endif	/* DEBUG */
+
 static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.rec_len		= sizeof(struct xfs_refcount_rec),
 	.key_len		= sizeof(struct xfs_refcount_key),
 
 	.dup_cursor		= xfs_refcountbt_dup_cursor,
+	.set_root		= xfs_refcountbt_set_root,
+	.alloc_block		= xfs_refcountbt_alloc_block,
+	.free_block		= xfs_refcountbt_free_block,
+	.get_minrecs		= xfs_refcountbt_get_minrecs,
+	.get_maxrecs		= xfs_refcountbt_get_maxrecs,
+	.init_key_from_rec	= xfs_refcountbt_init_key_from_rec,
+	.init_rec_from_cur	= xfs_refcountbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_refcountbt_init_ptr_from_cur,
+	.key_diff		= xfs_refcountbt_key_diff,
 	.buf_ops		= &xfs_refcountbt_buf_ops,
+#if defined(DEBUG) || defined(XFS_WARN)
+	.keys_inorder		= xfs_refcountbt_keys_inorder,
+	.recs_inorder		= xfs_refcountbt_recs_inorder,
+#endif
 };
 
 /*


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 061/119] xfs: create refcount update intent log items
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (59 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 060/119] xfs: add refcount btree operations Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 062/119] xfs: log refcount intent items Darrick J. Wong
                   ` (57 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create refcount update intent/done log items to record redo
information in the log.  Because we need to roll transactions between
updating the bmbt mapping and updating the reverse mapping, we also
have to track the status of the metadata updates that will be recorded
in the post-roll transactions, just in case we crash before committing
the final transaction.  This mechanism enables log recovery to finish
what was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_log_format.h |   52 ++++-
 fs/xfs/xfs_refcount_item.c     |  459 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_refcount_item.h     |  100 +++++++++
 fs/xfs/xfs_super.c             |   21 ++
 5 files changed, 631 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_refcount_item.c
 create mode 100644 fs/xfs/xfs_refcount_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e0c3067..322c386 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -107,6 +107,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_extfree_item.o \
 				   xfs_icreate_item.o \
 				   xfs_inode_item.o \
+				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index b9627b7..1dfa02c 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -112,7 +112,9 @@ static inline uint xlog_get_cycle(char *ptr)
 #define XLOG_REG_TYPE_ICREATE		20
 #define XLOG_REG_TYPE_RUI_FORMAT	21
 #define XLOG_REG_TYPE_RUD_FORMAT	22
-#define XLOG_REG_TYPE_MAX		22
+#define XLOG_REG_TYPE_CUI_FORMAT	23
+#define XLOG_REG_TYPE_CUD_FORMAT	24
+#define XLOG_REG_TYPE_MAX		24
 
 /*
  * Flags to log operation header
@@ -231,6 +233,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_ICREATE		0x123f
 #define	XFS_LI_RUI		0x1240	/* rmap update intent */
 #define	XFS_LI_RUD		0x1241
+#define	XFS_LI_CUI		0x1242	/* refcount update intent */
+#define	XFS_LI_CUD		0x1243
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -242,7 +246,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_QUOTAOFF,	"XFS_LI_QUOTAOFF" }, \
 	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }, \
 	{ XFS_LI_RUI,		"XFS_LI_RUI" }, \
-	{ XFS_LI_RUD,		"XFS_LI_RUD" }
+	{ XFS_LI_RUD,		"XFS_LI_RUD" }, \
+	{ XFS_LI_CUI,		"XFS_LI_CUI" }, \
+	{ XFS_LI_CUD,		"XFS_LI_CUD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -667,6 +673,48 @@ struct xfs_rud_log_format {
 };
 
 /*
+ * CUI/CUD (refcount update) log format definitions
+ */
+struct xfs_phys_extent {
+	__uint64_t		pe_startblock;
+	__uint32_t		pe_len;
+	__uint32_t		pe_flags;
+};
+
+/* refcount pe_flags: upper bits are flags, lower byte is type code */
+#define XFS_REFCOUNT_EXTENT_INCREASE	1
+#define XFS_REFCOUNT_EXTENT_DECREASE	2
+#define XFS_REFCOUNT_EXTENT_ALLOC_COW	3
+#define XFS_REFCOUNT_EXTENT_FREE_COW	4
+#define XFS_REFCOUNT_EXTENT_TYPE_MASK	0xFF
+
+/*
+ * This is the structure used to lay out a cui log item in the
+ * log.  The cui_extents field is a variable size array whose
+ * size is given by cui_nextents.
+ */
+struct xfs_cui_log_format {
+	__uint16_t		cui_type;	/* cui log item type */
+	__uint16_t		cui_size;	/* size of this item */
+	__uint32_t		cui_nextents;	/* # extents to free */
+	__uint64_t		cui_id;		/* cui identifier */
+	struct xfs_phys_extent	cui_extents[1];	/* array of extents */
+};
+
+/*
+ * This is the structure used to lay out a cud log item in the
+ * log.  The cud_extents array is a variable size array whose
+ * size is given by cud_nextents;
+ */
+struct xfs_cud_log_format {
+	__uint16_t		cud_type;	/* cud log item type */
+	__uint16_t		cud_size;	/* size of this item */
+	__uint32_t		cud_nextents;	/* # of extents freed */
+	__uint64_t		cud_cui_id;	/* id of corresponding cui */
+	struct xfs_phys_extent	cud_extents[1];	/* array of extents */
+};
+
+/*
  * Dquot Log format definitions.
  *
  * The first two fields must be the type and size fitting into
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
new file mode 100644
index 0000000..681d068
--- /dev/null
+++ b/fs/xfs/xfs_refcount_item.c
@@ -0,0 +1,459 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_buf_item.h"
+#include "xfs_refcount_item.h"
+#include "xfs_log.h"
+
+
+kmem_zone_t	*xfs_cui_zone;
+kmem_zone_t	*xfs_cud_zone;
+
+static inline struct xfs_cui_log_item *CUI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_cui_log_item, cui_item);
+}
+
+void
+xfs_cui_item_free(
+	struct xfs_cui_log_item	*cuip)
+{
+	if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS)
+		kmem_free(cuip);
+	else
+		kmem_zone_free(xfs_cui_zone, cuip);
+}
+
+/*
+ * This returns the number of iovecs needed to log the given cui item.
+ * We only need 1 iovec for an cui item.  It just logs the cui_log_format
+ * structure.
+ */
+static inline int
+xfs_cui_item_sizeof(
+	struct xfs_cui_log_item *cuip)
+{
+	return sizeof(struct xfs_cui_log_format) +
+			(cuip->cui_format.cui_nextents - 1) *
+			sizeof(struct xfs_phys_extent);
+}
+
+STATIC void
+xfs_cui_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += xfs_cui_item_sizeof(CUI_ITEM(lip));
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given cui log item. We use only 1 iovec, and we point that
+ * at the cui_log_format structure embedded in the cui item.
+ * It is at this point that we assert that all of the extent
+ * slots in the cui item have been filled.
+ */
+STATIC void
+xfs_cui_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_cui_log_item	*cuip = CUI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(atomic_read(&cuip->cui_next_extent) ==
+			cuip->cui_format.cui_nextents);
+
+	cuip->cui_format.cui_type = XFS_LI_CUI;
+	cuip->cui_format.cui_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_CUI_FORMAT, &cuip->cui_format,
+			xfs_cui_item_sizeof(cuip));
+}
+
+/*
+ * Pinning has no meaning for an cui item, so just return.
+ */
+STATIC void
+xfs_cui_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * The unpin operation is the last place an CUI is manipulated in the log. It is
+ * either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the CUI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the CUI to either construct
+ * and commit the CUD or drop the CUD's reference in the event of error. Simply
+ * drop the log's CUI reference now that the log is done with it.
+ */
+STATIC void
+xfs_cui_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_cui_log_item	*cuip = CUI_ITEM(lip);
+
+	xfs_cui_release(cuip);
+}
+
+/*
+ * CUI items have no locking or pushing.  However, since CUIs are pulled from
+ * the AIL when their corresponding CUDs are committed to disk, their situation
+ * is very similar to being pinned.  Return XFS_ITEM_PINNED so that the caller
+ * will eventually flush the log.  This should help in getting the CUI out of
+ * the AIL.
+ */
+STATIC uint
+xfs_cui_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The CUI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an CUD isn't going to be
+ * constructed and thus we free the CUI here directly.
+ */
+STATIC void
+xfs_cui_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	if (lip->li_flags & XFS_LI_ABORTED)
+		xfs_cui_item_free(CUI_ITEM(lip));
+}
+
+/*
+ * The CUI is logged only once and cannot be moved in the log, so simply return
+ * the lsn at which it's been logged.
+ */
+STATIC xfs_lsn_t
+xfs_cui_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	return lsn;
+}
+
+/*
+ * The CUI dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_cui_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all cui log items.
+ */
+static const struct xfs_item_ops xfs_cui_item_ops = {
+	.iop_size	= xfs_cui_item_size,
+	.iop_format	= xfs_cui_item_format,
+	.iop_pin	= xfs_cui_item_pin,
+	.iop_unpin	= xfs_cui_item_unpin,
+	.iop_unlock	= xfs_cui_item_unlock,
+	.iop_committed	= xfs_cui_item_committed,
+	.iop_push	= xfs_cui_item_push,
+	.iop_committing = xfs_cui_item_committing,
+};
+
+/*
+ * Allocate and initialize an cui item with the given number of extents.
+ */
+struct xfs_cui_log_item *
+xfs_cui_init(
+	struct xfs_mount		*mp,
+	uint				nextents)
+
+{
+	struct xfs_cui_log_item		*cuip;
+	uint				size;
+
+	ASSERT(nextents > 0);
+	if (nextents > XFS_CUI_MAX_FAST_EXTENTS) {
+		size = (uint)(sizeof(struct xfs_cui_log_item) +
+			((nextents - 1) * sizeof(struct xfs_phys_extent)));
+		cuip = kmem_zalloc(size, KM_SLEEP);
+	} else {
+		cuip = kmem_zone_zalloc(xfs_cui_zone, KM_SLEEP);
+	}
+
+	xfs_log_item_init(mp, &cuip->cui_item, XFS_LI_CUI, &xfs_cui_item_ops);
+	cuip->cui_format.cui_nextents = nextents;
+	cuip->cui_format.cui_id = (uintptr_t)(void *)cuip;
+	atomic_set(&cuip->cui_next_extent, 0);
+	atomic_set(&cuip->cui_refcount, 2);
+
+	return cuip;
+}
+
+/*
+ * Copy an CUI format buffer from the given buf, and into the destination
+ * CUI format structure.  The CUI/CUD items were designed not to need any
+ * special alignment handling.
+ */
+int
+xfs_cui_copy_format(
+	struct xfs_log_iovec		*buf,
+	struct xfs_cui_log_format	*dst_cui_fmt)
+{
+	struct xfs_cui_log_format	*src_cui_fmt;
+	uint				len;
+
+	src_cui_fmt = buf->i_addr;
+	len = sizeof(struct xfs_cui_log_format) +
+			(src_cui_fmt->cui_nextents - 1) *
+			sizeof(struct xfs_phys_extent);
+
+	if (buf->i_len == len) {
+		memcpy((char *)dst_cui_fmt, (char *)src_cui_fmt, len);
+		return 0;
+	}
+	return -EFSCORRUPTED;
+}
+
+/*
+ * Freeing the CUI requires that we remove it from the AIL if it has already
+ * been placed there. However, the CUI may not yet have been placed in the AIL
+ * when called by xfs_cui_release() from CUD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the CUI.
+ */
+void
+xfs_cui_release(
+	struct xfs_cui_log_item	*cuip)
+{
+	if (atomic_dec_and_test(&cuip->cui_refcount)) {
+		xfs_trans_ail_remove(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_cui_item_free(cuip);
+	}
+}
+
+static inline struct xfs_cud_log_item *CUD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_cud_log_item, cud_item);
+}
+
+STATIC void
+xfs_cud_item_free(struct xfs_cud_log_item *cudp)
+{
+	if (cudp->cud_format.cud_nextents > XFS_CUD_MAX_FAST_EXTENTS)
+		kmem_free(cudp);
+	else
+		kmem_zone_free(xfs_cud_zone, cudp);
+}
+
+/*
+ * This returns the number of iovecs needed to log the given cud item.
+ * We only need 1 iovec for an cud item.  It just logs the cud_log_format
+ * structure.
+ */
+static inline int
+xfs_cud_item_sizeof(
+	struct xfs_cud_log_item	*cudp)
+{
+	return sizeof(struct xfs_cud_log_format) +
+			(cudp->cud_format.cud_nextents - 1) *
+			sizeof(struct xfs_phys_extent);
+}
+
+STATIC void
+xfs_cud_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += xfs_cud_item_sizeof(CUD_ITEM(lip));
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given cud log item. We use only 1 iovec, and we point that
+ * at the cud_log_format structure embedded in the cud item.
+ * It is at this point that we assert that all of the extent
+ * slots in the cud item have been filled.
+ */
+STATIC void
+xfs_cud_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(cudp->cud_next_extent == cudp->cud_format.cud_nextents);
+
+	cudp->cud_format.cud_type = XFS_LI_CUD;
+	cudp->cud_format.cud_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_CUD_FORMAT, &cudp->cud_format,
+			xfs_cud_item_sizeof(cudp));
+}
+
+/*
+ * Pinning has no meaning for an cud item, so just return.
+ */
+STATIC void
+xfs_cud_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * Since pinning has no meaning for an cud item, unpinning does
+ * not either.
+ */
+STATIC void
+xfs_cud_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+}
+
+/*
+ * There isn't much you can do to push on an cud item.  It is simply stuck
+ * waiting for the log to be flushed to disk.
+ */
+STATIC uint
+xfs_cud_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The CUD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the CUI and free the
+ * CUD.
+ */
+STATIC void
+xfs_cud_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
+
+	if (lip->li_flags & XFS_LI_ABORTED) {
+		xfs_cui_release(cudp->cud_cuip);
+		xfs_cud_item_free(cudp);
+	}
+}
+
+/*
+ * When the cud item is committed to disk, all we need to do is delete our
+ * reference to our partner cui item and then free ourselves. Since we're
+ * freeing ourselves we must return -1 to keep the transaction code from
+ * further referencing this item.
+ */
+STATIC xfs_lsn_t
+xfs_cud_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
+
+	/*
+	 * Drop the CUI reference regardless of whether the CUD has been
+	 * aborted. Once the CUD transaction is constructed, it is the sole
+	 * responsibility of the CUD to release the CUI (even if the CUI is
+	 * aborted due to log I/O error).
+	 */
+	xfs_cui_release(cudp->cud_cuip);
+	xfs_cud_item_free(cudp);
+
+	return (xfs_lsn_t)-1;
+}
+
+/*
+ * The CUD dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_cud_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all cud log items.
+ */
+static const struct xfs_item_ops xfs_cud_item_ops = {
+	.iop_size	= xfs_cud_item_size,
+	.iop_format	= xfs_cud_item_format,
+	.iop_pin	= xfs_cud_item_pin,
+	.iop_unpin	= xfs_cud_item_unpin,
+	.iop_unlock	= xfs_cud_item_unlock,
+	.iop_committed	= xfs_cud_item_committed,
+	.iop_push	= xfs_cud_item_push,
+	.iop_committing = xfs_cud_item_committing,
+};
+
+/*
+ * Allocate and initialize an cud item with the given number of extents.
+ */
+struct xfs_cud_log_item *
+xfs_cud_init(
+	struct xfs_mount		*mp,
+	struct xfs_cui_log_item		*cuip,
+	uint				nextents)
+
+{
+	struct xfs_cud_log_item	*cudp;
+	uint			size;
+
+	ASSERT(nextents > 0);
+	if (nextents > XFS_CUD_MAX_FAST_EXTENTS) {
+		size = (uint)(sizeof(struct xfs_cud_log_item) +
+			((nextents - 1) * sizeof(struct xfs_phys_extent)));
+		cudp = kmem_zalloc(size, KM_SLEEP);
+	} else {
+		cudp = kmem_zone_zalloc(xfs_cud_zone, KM_SLEEP);
+	}
+
+	xfs_log_item_init(mp, &cudp->cud_item, XFS_LI_CUD, &xfs_cud_item_ops);
+	cudp->cud_cuip = cuip;
+	cudp->cud_format.cud_nextents = nextents;
+	cudp->cud_format.cud_cui_id = cuip->cui_format.cui_id;
+
+	return cudp;
+}
diff --git a/fs/xfs/xfs_refcount_item.h b/fs/xfs/xfs_refcount_item.h
new file mode 100644
index 0000000..9af4c9b
--- /dev/null
+++ b/fs/xfs/xfs_refcount_item.h
@@ -0,0 +1,100 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef	__XFS_REFCOUNT_ITEM_H__
+#define	__XFS_REFCOUNT_ITEM_H__
+
+/*
+ * There are (currently) two pairs of refcount btree redo item types:
+ * increase and decrease.  The log items for these are CUI (refcount
+ * update intent) and CUD (refcount update done).  The redo item type
+ * is encoded in the flags field of each xfs_map_extent.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same
+ * transaction that records the associated refcountbt updates.
+ *
+ * Should the system crash after the commit of the first transaction
+ * but before the commit of the final transaction in a series, log
+ * recovery will use the redo information recorded by the intent items
+ * to replay the refcountbt metadata updates.
+ */
+
+/* kernel only CUI/CUD definitions */
+
+struct xfs_mount;
+struct kmem_zone;
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_CUI_MAX_FAST_EXTENTS	16
+
+/*
+ * Define CUI flag bits. Manipulated by set/clear/test_bit operators.
+ */
+#define	XFS_CUI_RECOVERED		1
+
+/*
+ * This is the "refcount update intent" log item.  It is used to log
+ * the fact that some reverse mappings need to change.  It is used in
+ * conjunction with the "refcount update done" log item described
+ * below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item;
+ * see the comments about that structure (in xfs_extfree_item.h) for
+ * more details.
+ */
+struct xfs_cui_log_item {
+	struct xfs_log_item		cui_item;
+	atomic_t			cui_refcount;
+	atomic_t			cui_next_extent;
+	unsigned long			cui_flags;	/* misc flags */
+	struct xfs_cui_log_format	cui_format;
+};
+
+/*
+ * This is the "refcount update done" log item.  It is used to log the
+ * fact that some refcountbt updates mentioned in an earlier cui item
+ * have been performed.
+ */
+struct xfs_cud_log_item {
+	struct xfs_log_item		cud_item;
+	struct xfs_cui_log_item		*cud_cuip;
+	uint				cud_next_extent;
+	struct xfs_cud_log_format	cud_format;
+};
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_CUD_MAX_FAST_EXTENTS	16
+
+extern struct kmem_zone	*xfs_cui_zone;
+extern struct kmem_zone	*xfs_cud_zone;
+
+struct xfs_cui_log_item *xfs_cui_init(struct xfs_mount *, uint);
+struct xfs_cud_log_item *xfs_cud_init(struct xfs_mount *,
+		struct xfs_cui_log_item *, uint);
+int xfs_cui_copy_format(struct xfs_log_iovec *buf,
+		struct xfs_cui_log_format *dst_cui_fmt);
+void xfs_cui_item_free(struct xfs_cui_log_item *);
+void xfs_cui_release(struct xfs_cui_log_item *);
+
+#endif	/* __XFS_REFCOUNT_ITEM_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9328821..a0c7bdc 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -48,6 +48,7 @@
 #include "xfs_ondisk.h"
 #include "xfs_defer.h"
 #include "xfs_rmap_item.h"
+#include "xfs_refcount_item.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1781,8 +1782,26 @@ xfs_init_zones(void)
 	if (!xfs_rui_zone)
 		goto out_destroy_rud_zone;
 
+	xfs_cud_zone = kmem_zone_init((sizeof(struct xfs_cud_log_item) +
+			((XFS_CUD_MAX_FAST_EXTENTS - 1) *
+				 sizeof(struct xfs_phys_extent))),
+			"xfs_cud_item");
+	if (!xfs_cud_zone)
+		goto out_destroy_rui_zone;
+
+	xfs_cui_zone = kmem_zone_init((sizeof(struct xfs_cui_log_item) +
+			((XFS_CUI_MAX_FAST_EXTENTS - 1) *
+				sizeof(struct xfs_phys_extent))),
+			"xfs_cui_item");
+	if (!xfs_cui_zone)
+		goto out_destroy_cud_zone;
+
 	return 0;
 
+ out_destroy_cud_zone:
+	kmem_zone_destroy(xfs_cud_zone);
+ out_destroy_rui_zone:
+	kmem_zone_destroy(xfs_rui_zone);
  out_destroy_rud_zone:
 	kmem_zone_destroy(xfs_rud_zone);
  out_destroy_icreate_zone:
@@ -1825,6 +1844,8 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_zone_destroy(xfs_cui_zone);
+	kmem_zone_destroy(xfs_cud_zone);
 	kmem_zone_destroy(xfs_rui_zone);
 	kmem_zone_destroy(xfs_rud_zone);
 	kmem_zone_destroy(xfs_icreate_zone);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 062/119] xfs: log refcount intent items
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (60 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 061/119] xfs: create refcount update intent log items Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 063/119] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
                   ` (56 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Provide a mechanism for higher levels to create CUI/CUD items, submit
them to the log, and a stub function to deal with recovered CUI items.
These parts will be connected to the refcountbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_log_format.h |    2 
 fs/xfs/libxfs/xfs_refcount.h   |   14 ++
 fs/xfs/xfs_log_recover.c       |  299 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.h             |   30 ++++
 fs/xfs/xfs_trans.h             |   15 ++
 fs/xfs/xfs_trans_refcount.c    |  192 ++++++++++++++++++++++++++
 7 files changed, 552 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/xfs_trans_refcount.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 322c386..2945270 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -114,6 +114,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_trans_buf.o \
 				   xfs_trans_extfree.o \
 				   xfs_trans_inode.o \
+				   xfs_trans_refcount.o \
 				   xfs_trans_rmap.o \
 
 # optional features
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 1dfa02c..923b08f 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -688,6 +688,8 @@ struct xfs_phys_extent {
 #define XFS_REFCOUNT_EXTENT_FREE_COW	4
 #define XFS_REFCOUNT_EXTENT_TYPE_MASK	0xFF
 
+#define XFS_REFCOUNT_EXTENT_FLAGS	(XFS_REFCOUNT_EXTENT_TYPE_MASK)
+
 /*
  * This is the structure used to lay out a cui log item in the
  * log.  The cui_extents field is a variable size array whose
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 8ea65c6..0b36c1d 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -27,4 +27,18 @@ extern int xfs_refcountbt_lookup_ge(struct xfs_btree_cur *cur,
 extern int xfs_refcountbt_get_rec(struct xfs_btree_cur *cur,
 		struct xfs_refcount_irec *irec, int *stat);
 
+enum xfs_refcount_intent_type {
+	XFS_REFCOUNT_INCREASE,
+	XFS_REFCOUNT_DECREASE,
+	XFS_REFCOUNT_ALLOC_COW,
+	XFS_REFCOUNT_FREE_COW,
+};
+
+struct xfs_refcount_intent {
+	struct list_head			ri_list;
+	enum xfs_refcount_intent_type		ri_type;
+	xfs_fsblock_t				ri_startblock;
+	xfs_extlen_t				ri_blockcount;
+};
+
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index dbfbc26..e0a470a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -46,6 +46,7 @@
 #include "xfs_dir2.h"
 #include "xfs_rmap_item.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_refcount_item.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -1916,6 +1917,8 @@ xlog_recover_reorder_trans(
 		case XFS_LI_EFI:
 		case XFS_LI_RUI:
 		case XFS_LI_RUD:
+		case XFS_LI_CUI:
+		case XFS_LI_CUD:
 			trace_xfs_log_recover_item_reorder_tail(log,
 							trans, item, pass);
 			list_move_tail(&item->ri_list, &inode_list);
@@ -3519,6 +3522,101 @@ xlog_recover_rud_pass2(
 }
 
 /*
+ * This routine is called to create an in-core extent refcount update
+ * item from the cui format structure which was logged on disk.
+ * It allocates an in-core cui, copies the extents from the format
+ * structure into it, and adds the cui to the AIL with the given
+ * LSN.
+ */
+STATIC int
+xlog_recover_cui_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	int				error;
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_cui_log_item		*cuip;
+	struct xfs_cui_log_format	*cui_formatp;
+
+	cui_formatp = item->ri_buf[0].i_addr;
+
+	cuip = xfs_cui_init(mp, cui_formatp->cui_nextents);
+	error = xfs_cui_copy_format(&item->ri_buf[0], &cuip->cui_format);
+	if (error) {
+		xfs_cui_item_free(cuip);
+		return error;
+	}
+	atomic_set(&cuip->cui_next_extent, cui_formatp->cui_nextents);
+
+	spin_lock(&log->l_ailp->xa_lock);
+	/*
+	 * The CUI has two references. One for the CUD and one for CUI to ensure
+	 * it makes it into the AIL. Insert the CUI into the AIL directly and
+	 * drop the CUI reference. Note that xfs_trans_ail_update() drops the
+	 * AIL lock.
+	 */
+	xfs_trans_ail_update(log->l_ailp, &cuip->cui_item, lsn);
+	xfs_cui_release(cuip);
+	return 0;
+}
+
+
+/*
+ * This routine is called when an CUD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding CUI if it
+ * was still in the log. To do this it searches the AIL for the CUI with an id
+ * equal to that in the CUD format structure. If we find it we drop the CUD
+ * reference, which removes the CUI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_cud_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item)
+{
+	struct xfs_cud_log_format	*cud_formatp;
+	struct xfs_cui_log_item		*cuip = NULL;
+	struct xfs_log_item		*lip;
+	__uint64_t			cui_id;
+	struct xfs_ail_cursor		cur;
+	struct xfs_ail			*ailp = log->l_ailp;
+
+	cud_formatp = item->ri_buf[0].i_addr;
+	ASSERT(item->ri_buf[0].i_len == (sizeof(struct xfs_cud_log_format) +
+			((cud_formatp->cud_nextents - 1) *
+			sizeof(struct xfs_phys_extent))));
+	cui_id = cud_formatp->cud_cui_id;
+
+	/*
+	 * Search for the CUI with the id in the CUD format structure in the
+	 * AIL.
+	 */
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		if (lip->li_type == XFS_LI_CUI) {
+			cuip = (struct xfs_cui_log_item *)lip;
+			if (cuip->cui_format.cui_id == cui_id) {
+				/*
+				 * Drop the CUD reference to the CUI. This
+				 * removes the CUI from the AIL and frees it.
+				 */
+				spin_unlock(&ailp->xa_lock);
+				xfs_cui_release(cuip);
+				spin_lock(&ailp->xa_lock);
+				break;
+			}
+		}
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+
+	return 0;
+}
+
+/*
  * This routine is called when an inode create format structure is found in a
  * committed transaction in the log.  It's purpose is to initialise the inodes
  * being allocated on disk. This requires us to get inode cluster buffers that
@@ -3745,6 +3843,8 @@ xlog_recover_ra_pass2(
 	case XFS_LI_QUOTAOFF:
 	case XFS_LI_RUI:
 	case XFS_LI_RUD:
+	case XFS_LI_CUI:
+	case XFS_LI_CUD:
 	default:
 		break;
 	}
@@ -3770,6 +3870,8 @@ xlog_recover_commit_pass1(
 	case XFS_LI_ICREATE:
 	case XFS_LI_RUI:
 	case XFS_LI_RUD:
+	case XFS_LI_CUI:
+	case XFS_LI_CUD:
 		/* nothing to do in pass 1 */
 		return 0;
 	default:
@@ -3804,6 +3906,10 @@ xlog_recover_commit_pass2(
 		return xlog_recover_rui_pass2(log, item, trans->r_lsn);
 	case XFS_LI_RUD:
 		return xlog_recover_rud_pass2(log, item);
+	case XFS_LI_CUI:
+		return xlog_recover_cui_pass2(log, item, trans->r_lsn);
+	case XFS_LI_CUD:
+		return xlog_recover_cud_pass2(log, item);
 	case XFS_LI_DQUOT:
 		return xlog_recover_dquot_pass2(log, buffer_list, item,
 						trans->r_lsn);
@@ -4282,6 +4388,7 @@ static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
 	switch (lip->li_type) {
 	case XFS_LI_EFI:
 	case XFS_LI_RUI:
+	case XFS_LI_CUI:
 		return true;
 	default:
 		return false;
@@ -4713,6 +4820,186 @@ xlog_recover_cancel_ruis(
 }
 
 /*
+ * Process a refcount update intent item that was recovered from the log.
+ * We need to update the refcountbt.
+ */
+STATIC int
+xlog_recover_process_cui(
+	struct xfs_mount		*mp,
+	struct xfs_cui_log_item		*cuip)
+{
+	int				i;
+	int				error = 0;
+	struct xfs_phys_extent		*refc;
+	xfs_fsblock_t			startblock_fsb;
+	bool				op_ok;
+
+	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
+
+	/*
+	 * First check the validity of the extents described by the
+	 * CUI.  If any are bad, then assume that all are bad and
+	 * just toss the CUI.
+	 */
+	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
+		refc = &(cuip->cui_format.cui_extents[i]);
+		startblock_fsb = XFS_BB_TO_FSB(mp,
+				   XFS_FSB_TO_DADDR(mp, refc->pe_startblock));
+		switch (refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK) {
+		case XFS_REFCOUNT_EXTENT_INCREASE:
+		case XFS_REFCOUNT_EXTENT_DECREASE:
+		case XFS_REFCOUNT_EXTENT_ALLOC_COW:
+		case XFS_REFCOUNT_EXTENT_FREE_COW:
+			op_ok = true;
+			break;
+		default:
+			op_ok = false;
+			break;
+		}
+		if (!op_ok || (startblock_fsb == 0) ||
+		    (refc->pe_len == 0) ||
+		    (startblock_fsb >= mp->m_sb.sb_dblocks) ||
+		    (refc->pe_len >= mp->m_sb.sb_agblocks) ||
+		    (refc->pe_flags & ~XFS_REFCOUNT_EXTENT_FLAGS)) {
+			/*
+			 * This will pull the CUI from the AIL and
+			 * free the memory associated with it.
+			 */
+			set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
+			xfs_cui_release(cuip);
+			return -EIO;
+		}
+	}
+
+	/* XXX: do nothing for now */
+	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
+	xfs_cui_release(cuip);
+	return error;
+}
+
+/*
+ * When this is called, all of the CUIs which did not have
+ * corresponding CUDs should be in the AIL.  What we do now
+ * is update the rmaps associated with each one.
+ *
+ * Since we process the CUIs in normal transactions, they
+ * will be removed at some point after the commit.  This prevents
+ * us from just walking down the list processing each one.
+ * We'll use a flag in the CUI to skip those that we've already
+ * processed and use the AIL iteration mechanism's generation
+ * count to try to speed this up at least a bit.
+ *
+ * When we start, we know that the CUIs are the only things in
+ * the AIL.  As we process them, however, other items are added
+ * to the AIL.  Since everything added to the AIL must come after
+ * everything already in the AIL, we stop processing as soon as
+ * we see something other than an CUI in the AIL.
+ */
+STATIC int
+xlog_recover_process_cuis(
+	struct xlog		*log)
+{
+	struct xfs_log_item	*lip;
+	struct xfs_cui_log_item	*cuip;
+	int			error = 0;
+	struct xfs_ail_cursor	cur;
+	struct xfs_ail		*ailp;
+
+	ailp = log->l_ailp;
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		/*
+		 * We're done when we see something other than an intent.
+		 * There should be no intents left in the AIL now.
+		 */
+		if (!xlog_item_is_intent(lip)) {
+#ifdef DEBUG
+			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
+				ASSERT(!xlog_item_is_intent(lip));
+#endif
+			break;
+		}
+
+		/* Skip anything that isn't an CUI */
+		if (lip->li_type != XFS_LI_CUI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		/*
+		 * Skip CUIs that we've already processed.
+		 */
+		cuip = container_of(lip, struct xfs_cui_log_item, cui_item);
+		if (test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags)) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		spin_unlock(&ailp->xa_lock);
+		error = xlog_recover_process_cui(log->l_mp, cuip);
+		spin_lock(&ailp->xa_lock);
+		if (error)
+			goto out;
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+out:
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+	return error;
+}
+
+/*
+ * A cancel occurs when the mount has failed and we're bailing out. Release all
+ * pending CUIs so they don't pin the AIL.
+ */
+STATIC int
+xlog_recover_cancel_cuis(
+	struct xlog		*log)
+{
+	struct xfs_log_item	*lip;
+	struct xfs_cui_log_item	*cuip;
+	int			error = 0;
+	struct xfs_ail_cursor	cur;
+	struct xfs_ail		*ailp;
+
+	ailp = log->l_ailp;
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		/*
+		 * We're done when we see something other than an CUI.
+		 * There should be no CUIs left in the AIL now.
+		 */
+		if (!xlog_item_is_intent(lip)) {
+#ifdef DEBUG
+			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
+				ASSERT(!xlog_item_is_intent(lip));
+#endif
+			break;
+		}
+
+		/* Skip anything that isn't an CUI */
+		if (lip->li_type != XFS_LI_CUI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		cuip = container_of(lip, struct xfs_cui_log_item, cui_item);
+
+		spin_unlock(&ailp->xa_lock);
+		xfs_cui_release(cuip);
+		spin_lock(&ailp->xa_lock);
+
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+	return error;
+}
+
+/*
  * This routine performs a transaction to null out a bad inode pointer
  * in an agi unlinked inode hash bucket.
  */
@@ -5515,6 +5802,12 @@ xlog_recover_finish(
 	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
 		int	error;
 
+		error = xlog_recover_process_cuis(log);
+		if (error) {
+			xfs_alert(log->l_mp, "Failed to recover CUIs");
+			return error;
+		}
+
 		error = xlog_recover_process_ruis(log);
 		if (error) {
 			xfs_alert(log->l_mp, "Failed to recover RUIs");
@@ -5557,7 +5850,11 @@ xlog_recover_cancel(
 	int		err2;
 
 	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
-		error = xlog_recover_cancel_ruis(log);
+		error = xlog_recover_cancel_cuis(log);
+
+		err2 = xlog_recover_cancel_ruis(log);
+		if (err2 && !error)
+			error = err2;
 
 		err2 = xlog_recover_cancel_efis(log);
 		if (err2 && !error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 67ce2d8..1f6cee0 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2906,6 +2906,36 @@ DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
 
+TRACE_EVENT(xfs_refcount_finish_one_leftover,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int type, xfs_agblock_t agbno,
+		 xfs_extlen_t len, xfs_extlen_t adjusted),
+	TP_ARGS(mp, agno, type, agbno, len, adjusted),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(int, type)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(xfs_extlen_t, adjusted)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->type = type;
+		__entry->agbno = agbno;
+		__entry->len = len;
+		__entry->adjusted = adjusted;
+	),
+	TP_printk("dev %d:%d type %d agno %u agbno %u len %u adjusted %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->adjusted)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f59d934..2b197fd 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -253,4 +253,19 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
 		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
 		xfs_exntst_t state, struct xfs_btree_cur **pcur);
 
+enum xfs_refcount_intent_type;
+
+struct xfs_cui_log_item *xfs_trans_get_cui(struct xfs_trans *tp, uint nextents);
+void xfs_trans_log_start_refcount_update(struct xfs_trans *tp,
+		struct xfs_cui_log_item *cuip,
+		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
+		xfs_filblks_t blockcount);
+
+struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
+		struct xfs_cui_log_item *cuip, uint nextents);
+int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
+		struct xfs_cud_log_item *cudp,
+		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
+		xfs_extlen_t blockcount, struct xfs_btree_cur **pcur);
+
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
new file mode 100644
index 0000000..32701e4
--- /dev/null
+++ b/fs/xfs/xfs_trans_refcount.c
@@ -0,0 +1,192 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_refcount_item.h"
+#include "xfs_alloc.h"
+#include "xfs_refcount.h"
+
+/*
+ * This routine is called to allocate an "refcount update intent"
+ * log item that will hold nextents worth of extents.  The
+ * caller must use all nextents extents, because we are not
+ * flexible about this at all.
+ */
+struct xfs_cui_log_item *
+xfs_trans_get_cui(
+	struct xfs_trans		*tp,
+	uint				nextents)
+{
+	struct xfs_cui_log_item		*cuip;
+
+	ASSERT(tp != NULL);
+	ASSERT(nextents > 0);
+
+	cuip = xfs_cui_init(tp->t_mountp, nextents);
+	ASSERT(cuip != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &cuip->cui_item);
+	return cuip;
+}
+
+/*
+ * This routine is called to indicate that the described
+ * extent is to be logged as needing to be freed.  It should
+ * be called once for each extent to be freed.
+ */
+void
+xfs_trans_log_start_refcount_update(
+	struct xfs_trans		*tp,
+	struct xfs_cui_log_item		*cuip,
+	enum xfs_refcount_intent_type	type,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount)
+{
+	uint				next_extent;
+	struct xfs_phys_extent		*refc;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	cuip->cui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	/*
+	 * atomic_inc_return gives us the value after the increment;
+	 * we want to use it as an array index so we need to subtract 1 from
+	 * it.
+	 */
+	next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1;
+	ASSERT(next_extent < cuip->cui_format.cui_nextents);
+	refc = &(cuip->cui_format.cui_extents[next_extent]);
+	refc->pe_startblock = startblock;
+	refc->pe_len = blockcount;
+	refc->pe_flags = 0;
+	switch (type) {
+	case XFS_REFCOUNT_INCREASE:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_INCREASE;
+		break;
+	case XFS_REFCOUNT_DECREASE:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_DECREASE;
+		break;
+	case XFS_REFCOUNT_ALLOC_COW:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_ALLOC_COW;
+		break;
+	case XFS_REFCOUNT_FREE_COW:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_FREE_COW;
+		break;
+	default:
+		ASSERT(0);
+	}
+}
+
+
+/*
+ * This routine is called to allocate an "extent free done"
+ * log item that will hold nextents worth of extents.  The
+ * caller must use all nextents extents, because we are not
+ * flexible about this at all.
+ */
+struct xfs_cud_log_item *
+xfs_trans_get_cud(
+	struct xfs_trans		*tp,
+	struct xfs_cui_log_item		*cuip,
+	uint				nextents)
+{
+	struct xfs_cud_log_item		*cudp;
+
+	ASSERT(tp != NULL);
+	ASSERT(nextents > 0);
+
+	cudp = xfs_cud_init(tp->t_mountp, cuip, nextents);
+	ASSERT(cudp != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &cudp->cud_item);
+	return cudp;
+}
+
+/*
+ * Finish an refcount update and log it to the CUD. Note that the transaction is
+ * marked dirty regardless of whether the refcount update succeeds or fails to
+ * support the CUI/CUD lifecycle rules.
+ */
+int
+xfs_trans_log_finish_refcount_update(
+	struct xfs_trans		*tp,
+	struct xfs_cud_log_item		*cudp,
+	enum xfs_refcount_intent_type	type,
+	xfs_fsblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	struct xfs_btree_cur		**pcur)
+{
+	uint				next_extent;
+	struct xfs_phys_extent		*refc;
+	int				error;
+
+	/* XXX: leave this empty for now */
+	error = -EFSCORRUPTED;
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the CUI and frees the CUD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	cudp->cud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	next_extent = cudp->cud_next_extent;
+	ASSERT(next_extent < cudp->cud_format.cud_nextents);
+	refc = &(cudp->cud_format.cud_extents[next_extent]);
+	refc->pe_startblock = startblock;
+	refc->pe_len = blockcount;
+	refc->pe_flags = 0;
+	switch (type) {
+	case XFS_REFCOUNT_INCREASE:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_INCREASE;
+		break;
+	case XFS_REFCOUNT_DECREASE:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_DECREASE;
+		break;
+	case XFS_REFCOUNT_ALLOC_COW:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_ALLOC_COW;
+		break;
+	case XFS_REFCOUNT_FREE_COW:
+		refc->pe_flags |= XFS_REFCOUNT_EXTENT_FREE_COW;
+		break;
+	default:
+		ASSERT(0);
+	}
+	cudp->cud_next_extent++;
+
+	return error;
+}


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 063/119] xfs: adjust refcount of an extent of blocks in refcount btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (61 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 062/119] xfs: log refcount intent items Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 064/119] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
                   ` (55 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Provide functions to adjust the reference counts for an extent of
physical blocks stored in the refcount btree.

v2: Refactor the left/right split code into a single function.  Track
the number of btree shape changes and record updates during a refcount
update so that we can decide if we need to get a fresh transaction to
continue.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_refcount.c |  783 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_error.h           |    4 
 2 files changed, 786 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 4d483b5..d13393b 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -37,6 +37,12 @@
 #include "xfs_bit.h"
 #include "xfs_refcount.h"
 
+/* Allowable refcount adjustment amounts. */
+enum xfs_refc_adjust_op {
+	XFS_REFCOUNT_ADJUST_INCREASE	= 1,
+	XFS_REFCOUNT_ADJUST_DECREASE	= -1,
+};
+
 /*
  * Look up the first record less than or equal to [bno, len] in the btree
  * given by cur.
@@ -175,3 +181,780 @@ out_error:
 				cur->bc_private.a.agno, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Adjusting the Reference Count
+ *
+ * As stated elsewhere, the reference count btree (refcbt) stores
+ * >1 reference counts for extents of physical blocks.  In this
+ * operation, we're either raising or lowering the reference count of
+ * some subrange stored in the tree:
+ *
+ *      <------ adjustment range ------>
+ * ----+   +---+-----+ +--+--------+---------
+ *  2  |   | 3 |  4  | |17|   55   |   10
+ * ----+   +---+-----+ +--+--------+---------
+ * X axis is physical blocks number;
+ * reference counts are the numbers inside the rectangles
+ *
+ * The first thing we need to do is to ensure that there are no
+ * refcount extents crossing either boundary of the range to be
+ * adjusted.  For any extent that does cross a boundary, split it into
+ * two extents so that we can increment the refcount of one of the
+ * pieces later:
+ *
+ *      <------ adjustment range ------>
+ * ----+   +---+-----+ +--+--------+----+----
+ *  2  |   | 3 |  2  | |17|   55   | 10 | 10
+ * ----+   +---+-----+ +--+--------+----+----
+ *
+ * For this next step, let's assume that all the physical blocks in
+ * the adjustment range are mapped to a file and are therefore in use
+ * at least once.  Therefore, we can infer that any gap in the
+ * refcount tree within the adjustment range represents a physical
+ * extent with refcount == 1:
+ *
+ *      <------ adjustment range ------>
+ * ----+---+---+-----+-+--+--------+----+----
+ *  2  |"1"| 3 |  2  |1|17|   55   | 10 | 10
+ * ----+---+---+-----+-+--+--------+----+----
+ *      ^
+ *
+ * For each extent that falls within the interval range, figure out
+ * which extent is to the left or the right of that extent.  Now we
+ * have a left, current, and right extent.  If the new reference count
+ * of the center extent enables us to merge left, center, and right
+ * into one record covering all three, do so.  If the center extent is
+ * at the left end of the range, abuts the left extent, and its new
+ * reference count matches the left extent's record, then merge them.
+ * If the center extent is at the right end of the range, abuts the
+ * right extent, and the reference counts match, merge those.  In the
+ * example, we can left merge (assuming an increment operation):
+ *
+ *      <------ adjustment range ------>
+ * --------+---+-----+-+--+--------+----+----
+ *    2    | 3 |  2  |1|17|   55   | 10 | 10
+ * --------+---+-----+-+--+--------+----+----
+ *          ^
+ *
+ * For all other extents within the range, adjust the reference count
+ * or delete it if the refcount falls below 2.  If we were
+ * incrementing, the end result looks like this:
+ *
+ *      <------ adjustment range ------>
+ * --------+---+-----+-+--+--------+----+----
+ *    2    | 4 |  3  |2|18|   56   | 11 | 10
+ * --------+---+-----+-+--+--------+----+----
+ *
+ * The result of a decrement operation looks as such:
+ *
+ *      <------ adjustment range ------>
+ * ----+   +---+       +--+--------+----+----
+ *  2  |   | 2 |       |16|   54   |  9 | 10
+ * ----+   +---+       +--+--------+----+----
+ *      DDDD    111111DD
+ *
+ * The blocks marked "D" are freed; the blocks marked "1" are only
+ * referenced once and therefore the record is removed from the
+ * refcount btree.
+ */
+
+#define RCNEXT(rc)	((rc).rc_startblock + (rc).rc_blockcount)
+/*
+ * Split a refcount extent that crosses agbno.
+ */
+STATIC int
+xfs_refcount_split_extent(
+	struct xfs_btree_cur		*cur,
+	xfs_agblock_t			agbno,
+	bool				*shape_changed)
+{
+	struct xfs_refcount_irec	rcext, tmp;
+	int				found_rec;
+	int				error;
+
+	*shape_changed = false;
+	error = xfs_refcountbt_lookup_le(cur, agbno, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec)
+		return 0;
+
+	error = xfs_refcountbt_get_rec(cur, &rcext, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+	if (rcext.rc_startblock == agbno || RCNEXT(rcext) <= agbno)
+		return 0;
+
+	*shape_changed = true;
+	trace_xfs_refcount_split_extent(cur->bc_mp, cur->bc_private.a.agno,
+			&rcext, agbno);
+
+	/* Establish the right extent. */
+	tmp = rcext;
+	tmp.rc_startblock = agbno;
+	tmp.rc_blockcount -= (agbno - rcext.rc_startblock);
+	error = xfs_refcountbt_update(cur, &tmp);
+	if (error)
+		goto out_error;
+
+	/* Insert the left extent. */
+	tmp = rcext;
+	tmp.rc_blockcount = agbno - rcext.rc_startblock;
+	error = xfs_refcountbt_insert(cur, &tmp, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+	return error;
+
+out_error:
+	trace_xfs_refcount_split_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Merge the left, center, and right extents.
+ */
+STATIC int
+xfs_refcount_merge_center_extent(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*left,
+	struct xfs_refcount_irec	*center,
+	unsigned long long		extlen,
+	xfs_agblock_t			*agbno,
+	xfs_extlen_t			*aglen)
+{
+	int				error;
+	int				found_rec;
+
+	error = xfs_refcountbt_lookup_ge(cur, center->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	error = xfs_refcountbt_delete(cur, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	if (center->rc_refcount > 1) {
+		error = xfs_refcountbt_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+	}
+
+	error = xfs_refcountbt_lookup_le(cur, left->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	left->rc_blockcount = extlen;
+	error = xfs_refcountbt_update(cur, left);
+	if (error)
+		goto out_error;
+
+	*aglen = 0;
+	return error;
+
+out_error:
+	trace_xfs_refcount_merge_center_extents_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Merge with the left extent.
+ */
+STATIC int
+xfs_refcount_merge_left_extent(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*left,
+	struct xfs_refcount_irec	*cleft,
+	xfs_agblock_t			*agbno,
+	xfs_extlen_t			*aglen)
+{
+	int				error;
+	int				found_rec;
+
+	if (cleft->rc_refcount > 1) {
+		error = xfs_refcountbt_lookup_le(cur, cleft->rc_startblock,
+				&found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		error = xfs_refcountbt_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+	}
+
+	error = xfs_refcountbt_lookup_le(cur, left->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	left->rc_blockcount += cleft->rc_blockcount;
+	error = xfs_refcountbt_update(cur, left);
+	if (error)
+		goto out_error;
+
+	*agbno += cleft->rc_blockcount;
+	*aglen -= cleft->rc_blockcount;
+	return error;
+
+out_error:
+	trace_xfs_refcount_merge_left_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Merge with the right extent.
+ */
+STATIC int
+xfs_refcount_merge_right_extent(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*right,
+	struct xfs_refcount_irec	*cright,
+	xfs_agblock_t			*agbno,
+	xfs_extlen_t			*aglen)
+{
+	int				error;
+	int				found_rec;
+
+	if (cright->rc_refcount > 1) {
+		error = xfs_refcountbt_lookup_le(cur, cright->rc_startblock,
+			&found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		error = xfs_refcountbt_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+	}
+
+	error = xfs_refcountbt_lookup_le(cur, right->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	right->rc_startblock -= cright->rc_blockcount;
+	right->rc_blockcount += cright->rc_blockcount;
+	error = xfs_refcountbt_update(cur, right);
+	if (error)
+		goto out_error;
+
+	*aglen -= cright->rc_blockcount;
+	return error;
+
+out_error:
+	trace_xfs_refcount_merge_right_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Find the left extent and the one after it (cleft).  This function assumes
+ * that we've already split any extent crossing agbno.
+ */
+STATIC int
+xfs_refcount_find_left_extents(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*left,
+	struct xfs_refcount_irec	*cleft,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			aglen)
+{
+	struct xfs_refcount_irec	tmp;
+	int				error;
+	int				found_rec;
+
+	left->rc_blockcount = cleft->rc_blockcount = 0;
+	error = xfs_refcountbt_lookup_le(cur, agbno - 1, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec)
+		return 0;
+
+	error = xfs_refcountbt_get_rec(cur, &tmp, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	if (RCNEXT(tmp) != agbno)
+		return 0;
+	/* We have a left extent; retrieve (or invent) the next right one */
+	*left = tmp;
+
+	error = xfs_btree_increment(cur, 0, &found_rec);
+	if (error)
+		goto out_error;
+	if (found_rec) {
+		error = xfs_refcountbt_get_rec(cur, &tmp, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		/* if tmp starts at the end of our range, just use that */
+		if (tmp.rc_startblock == agbno)
+			*cleft = tmp;
+		else {
+			/*
+			 * There's a gap in the refcntbt at the start of the
+			 * range we're interested in (refcount == 1) so
+			 * create the implied extent and pass it back.
+			 */
+			cleft->rc_startblock = agbno;
+			cleft->rc_blockcount = min(aglen,
+					tmp.rc_startblock - agbno);
+			cleft->rc_refcount = 1;
+		}
+	} else {
+		/*
+		 * No extents, so pretend that there's one covering the whole
+		 * range.
+		 */
+		cleft->rc_startblock = agbno;
+		cleft->rc_blockcount = aglen;
+		cleft->rc_refcount = 1;
+	}
+	trace_xfs_refcount_find_left_extent(cur->bc_mp, cur->bc_private.a.agno,
+			left, cleft, agbno);
+	return error;
+
+out_error:
+	trace_xfs_refcount_find_left_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Find the right extent and the one before it (cright).  This function
+ * assumes that we've already split any extents crossing agbno + aglen.
+ */
+STATIC int
+xfs_refcount_find_right_extents(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*right,
+	struct xfs_refcount_irec	*cright,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			aglen)
+{
+	struct xfs_refcount_irec	tmp;
+	int				error;
+	int				found_rec;
+
+	right->rc_blockcount = cright->rc_blockcount = 0;
+	error = xfs_refcountbt_lookup_ge(cur, agbno + aglen, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec)
+		return 0;
+
+	error = xfs_refcountbt_get_rec(cur, &tmp, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	if (tmp.rc_startblock != agbno + aglen)
+		return 0;
+	/* We have a right extent; retrieve (or invent) the next left one */
+	*right = tmp;
+
+	error = xfs_btree_decrement(cur, 0, &found_rec);
+	if (error)
+		goto out_error;
+	if (found_rec) {
+		error = xfs_refcountbt_get_rec(cur, &tmp, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		/* if tmp ends at the end of our range, just use that */
+		if (RCNEXT(tmp) == agbno + aglen)
+			*cright = tmp;
+		else {
+			/*
+			 * There's a gap in the refcntbt at the end of the
+			 * range we're interested in (refcount == 1) so
+			 * create the implied extent and pass it back.
+			 */
+			cright->rc_startblock = max(agbno, RCNEXT(tmp));
+			cright->rc_blockcount = right->rc_startblock -
+					cright->rc_startblock;
+			cright->rc_refcount = 1;
+		}
+	} else {
+		/*
+		 * No extents, so pretend that there's one covering the whole
+		 * range.
+		 */
+		cright->rc_startblock = agbno;
+		cright->rc_blockcount = aglen;
+		cright->rc_refcount = 1;
+	}
+	trace_xfs_refcount_find_right_extent(cur->bc_mp, cur->bc_private.a.agno,
+			cright, right, agbno + aglen);
+	return error;
+
+out_error:
+	trace_xfs_refcount_find_right_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+#undef RCNEXT
+
+/*
+ * Try to merge with any extents on the boundaries of the adjustment range.
+ */
+STATIC int
+xfs_refcount_merge_extents(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		*agbno,
+	xfs_extlen_t		*aglen,
+	enum xfs_refc_adjust_op adjust,
+	bool			*shape_changed)
+{
+	struct xfs_refcount_irec	left = {0}, cleft = {0};
+	struct xfs_refcount_irec	cright = {0}, right = {0};
+	int				error;
+	unsigned long long		ulen;
+	bool				cequal;
+
+	*shape_changed = false;
+	/*
+	 * Find the extent just below agbno [left], just above agbno [cleft],
+	 * just below (agbno + aglen) [cright], and just above (agbno + aglen)
+	 * [right].
+	 */
+	error = xfs_refcount_find_left_extents(cur, &left, &cleft, *agbno,
+			*aglen);
+	if (error)
+		return error;
+	error = xfs_refcount_find_right_extents(cur, &right, &cright, *agbno,
+			*aglen);
+	if (error)
+		return error;
+
+	/* No left or right extent to merge; exit. */
+	if (left.rc_blockcount == 0 && right.rc_blockcount == 0)
+		return 0;
+
+	*shape_changed = true;
+	cequal = (cleft.rc_startblock == cright.rc_startblock) &&
+		 (cleft.rc_blockcount == cright.rc_blockcount);
+
+	/* Try to merge left, cleft, and right.  cleft must == cright. */
+	ulen = (unsigned long long)left.rc_blockcount + cleft.rc_blockcount +
+			right.rc_blockcount;
+	if (left.rc_blockcount != 0 && right.rc_blockcount != 0 &&
+	    cleft.rc_blockcount != 0 && cright.rc_blockcount != 0 &&
+	    cequal &&
+	    left.rc_refcount == cleft.rc_refcount + adjust &&
+	    right.rc_refcount == cleft.rc_refcount + adjust &&
+	    ulen < MAXREFCEXTLEN) {
+		trace_xfs_refcount_merge_center_extents(cur->bc_mp,
+				cur->bc_private.a.agno, &left, &cleft, &right);
+		return xfs_refcount_merge_center_extent(cur, &left, &cleft,
+				ulen, agbno, aglen);
+	}
+
+	/* Try to merge left and cleft. */
+	ulen = (unsigned long long)left.rc_blockcount + cleft.rc_blockcount;
+	if (left.rc_blockcount != 0 && cleft.rc_blockcount != 0 &&
+	    left.rc_refcount == cleft.rc_refcount + adjust &&
+	    ulen < MAXREFCEXTLEN) {
+		trace_xfs_refcount_merge_left_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &left, &cleft);
+		error = xfs_refcount_merge_left_extent(cur, &left, &cleft,
+				agbno, aglen);
+		if (error)
+			return error;
+
+		/*
+		 * If we just merged left + cleft and cleft == cright,
+		 * we no longer have a cright to merge with right.  We're done.
+		 */
+		if (cequal)
+			return 0;
+	}
+
+	/* Try to merge cright and right. */
+	ulen = (unsigned long long)right.rc_blockcount + cright.rc_blockcount;
+	if (right.rc_blockcount != 0 && cright.rc_blockcount != 0 &&
+	    right.rc_refcount == cright.rc_refcount + adjust &&
+	    ulen < MAXREFCEXTLEN) {
+		trace_xfs_refcount_merge_right_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &cright, &right);
+		return xfs_refcount_merge_right_extent(cur, &right, &cright,
+				agbno, aglen);
+	}
+
+	return error;
+}
+
+/*
+ * While we're adjusting the refcounts records of an extent, we have
+ * to keep an eye on the number of extents we're dirtying -- run too
+ * many in a single transaction and we'll exceed the transaction's
+ * reservation and crash the fs.  Each record adds 12 bytes to the
+ * log (plus any key updates) so we'll conservatively assume 24 bytes
+ * per record.  We must also leave space for btree splits on both ends
+ * of the range and space for the CUD and a new CUI.
+ *
+ * XXX: This is a pretty hand-wavy estimate.  The penalty for guessing
+ * true incorrectly is a shutdown FS; the penalty for guessing false
+ * incorrectly is more transaction rolls than might be necessary.
+ * Be conservative here.
+ */
+static bool
+xfs_refcount_still_have_space(
+	struct xfs_btree_cur		*cur)
+{
+	unsigned long			overhead;
+
+	overhead = cur->bc_private.a.priv.refc.shape_changes *
+			xfs_allocfree_log_count(cur->bc_mp, 1);
+	overhead *= cur->bc_mp->m_sb.sb_blocksize;
+
+	/*
+	 * Only allow 2 refcount extent updates per transaction if the
+	 * refcount continue update "error" has been injected.
+	 */
+	if (cur->bc_private.a.priv.refc.nr_ops > 2 &&
+	    XFS_TEST_ERROR(false, cur->bc_mp,
+			XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE,
+			XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE))
+		return false;
+
+	if (cur->bc_private.a.priv.refc.nr_ops == 0)
+		return true;
+	else if (overhead > cur->bc_tp->t_log_res)
+		return false;
+	return  cur->bc_tp->t_log_res - overhead >
+		cur->bc_private.a.priv.refc.nr_ops * 32;
+}
+
+/*
+ * Adjust the refcounts of middle extents.  At this point we should have
+ * split extents that crossed the adjustment range; merged with adjacent
+ * extents; and updated agbno/aglen to reflect the merges.  Therefore,
+ * all we have to do is update the extents inside [agbno, agbno + aglen].
+ */
+STATIC int
+xfs_refcount_adjust_extents(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	xfs_extlen_t		*adjusted,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_refcount_irec	ext, tmp;
+	int				error;
+	int				found_rec, found_tmp;
+	xfs_fsblock_t			fsbno;
+
+	/* Merging did all the work already. */
+	if (aglen == 0)
+		return 0;
+
+	error = xfs_refcountbt_lookup_ge(cur, agbno, &found_rec);
+	if (error)
+		goto out_error;
+
+	while (aglen > 0 && xfs_refcount_still_have_space(cur)) {
+		error = xfs_refcountbt_get_rec(cur, &ext, &found_rec);
+		if (error)
+			goto out_error;
+		if (!found_rec) {
+			ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
+			ext.rc_blockcount = 0;
+			ext.rc_refcount = 0;
+		}
+
+		/*
+		 * Deal with a hole in the refcount tree; if a file maps to
+		 * these blocks and there's no refcountbt recourd, pretend that
+		 * there is one with refcount == 1.
+		 */
+		if (ext.rc_startblock != agbno) {
+			tmp.rc_startblock = agbno;
+			tmp.rc_blockcount = min(aglen,
+					ext.rc_startblock - agbno);
+			tmp.rc_refcount = 1 + adj;
+			trace_xfs_refcount_modify_extent(cur->bc_mp,
+					cur->bc_private.a.agno, &tmp);
+
+			/*
+			 * Either cover the hole (increment) or
+			 * delete the range (decrement).
+			 */
+			if (tmp.rc_refcount) {
+				error = xfs_refcountbt_insert(cur, &tmp,
+						&found_tmp);
+				if (error)
+					goto out_error;
+				XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+						found_tmp == 1, out_error);
+				cur->bc_private.a.priv.refc.nr_ops++;
+			} else {
+				fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
+						cur->bc_private.a.agno,
+						tmp.rc_startblock);
+				xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
+						tmp.rc_blockcount, oinfo);
+			}
+
+			(*adjusted) += tmp.rc_blockcount;
+			agbno += tmp.rc_blockcount;
+			aglen -= tmp.rc_blockcount;
+
+			error = xfs_refcountbt_lookup_ge(cur, agbno,
+					&found_rec);
+			if (error)
+				goto out_error;
+		}
+
+		/* Stop if there's nothing left to modify */
+		if (aglen == 0 || !xfs_refcount_still_have_space(cur))
+			break;
+
+		/*
+		 * Adjust the reference count and either update the tree
+		 * (incr) or free the blocks (decr).
+		 */
+		if (ext.rc_refcount == MAXREFCOUNT)
+			goto skip;
+		ext.rc_refcount += adj;
+		trace_xfs_refcount_modify_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &ext);
+		if (ext.rc_refcount > 1) {
+			error = xfs_refcountbt_update(cur, &ext);
+			if (error)
+				goto out_error;
+			cur->bc_private.a.priv.refc.nr_ops++;
+		} else if (ext.rc_refcount == 1) {
+			error = xfs_refcountbt_delete(cur, &found_rec);
+			if (error)
+				goto out_error;
+			XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+					found_rec == 1, out_error);
+			cur->bc_private.a.priv.refc.nr_ops++;
+			goto advloop;
+		} else {
+			fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
+					cur->bc_private.a.agno,
+					ext.rc_startblock);
+			xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
+					ext.rc_blockcount, oinfo);
+		}
+
+skip:
+		error = xfs_btree_increment(cur, 0, &found_rec);
+		if (error)
+			goto out_error;
+
+advloop:
+		(*adjusted) += ext.rc_blockcount;
+		agbno += ext.rc_blockcount;
+		aglen -= ext.rc_blockcount;
+	}
+
+	return error;
+out_error:
+	trace_xfs_refcount_modify_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/* Adjust the reference count of a range of AG blocks. */
+STATIC int
+xfs_refcount_adjust(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	xfs_extlen_t		*adjusted,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_owner_info	*oinfo)
+{
+	xfs_extlen_t		orig_aglen;
+	bool			shape_changed;
+	int			shape_changes = 0;
+	int			error;
+
+	*adjusted = 0;
+	switch (adj) {
+	case XFS_REFCOUNT_ADJUST_INCREASE:
+		trace_xfs_refcount_increase(cur->bc_mp, cur->bc_private.a.agno,
+				agbno, aglen);
+		break;
+	case XFS_REFCOUNT_ADJUST_DECREASE:
+		trace_xfs_refcount_decrease(cur->bc_mp, cur->bc_private.a.agno,
+				agbno, aglen);
+		break;
+	default:
+		ASSERT(0);
+	}
+
+	/*
+	 * Ensure that no rcextents cross the boundary of the adjustment range.
+	 */
+	error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
+	if (error)
+		goto out_error;
+	if (shape_changed)
+		shape_changes++;
+
+	error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
+	if (error)
+		goto out_error;
+	if (shape_changed)
+		shape_changes++;
+
+	/*
+	 * Try to merge with the left or right extents of the range.
+	 */
+	orig_aglen = aglen;
+	error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
+			&shape_changed);
+	if (error)
+		goto out_error;
+	if (shape_changed)
+		shape_changes++;
+	(*adjusted) += orig_aglen - aglen;
+	if (shape_changes)
+		cur->bc_private.a.priv.refc.shape_changes++;
+
+	/* Now that we've taken care of the ends, adjust the middle extents */
+	error = xfs_refcount_adjust_extents(cur, agbno, aglen, adjusted, adj,
+			dfops, oinfo);
+	if (error)
+		goto out_error;
+
+	return 0;
+
+out_error:
+	trace_xfs_refcount_adjust_error(cur->bc_mp, cur->bc_private.a.agno,
+			error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 6bc614c..ffeffb5 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -92,7 +92,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_BMAPIFORMAT				21
 #define XFS_ERRTAG_FREE_EXTENT				22
 #define XFS_ERRTAG_RMAP_FINISH_ONE			23
-#define XFS_ERRTAG_MAX					24
+#define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE		24
+#define XFS_ERRTAG_MAX					25
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -121,6 +122,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
 #define XFS_RANDOM_FREE_EXTENT				1
 #define XFS_RANDOM_RMAP_FINISH_ONE			1
+#define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE		1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 064/119] xfs: connect refcount adjust functions to upper layers
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (62 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 063/119] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 065/119] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
                   ` (54 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Plumb in the upper level interface to schedule and finish deferred
refcount operations via the deferred ops mechanism.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.h    |    1 
 fs/xfs/libxfs/xfs_refcount.c |  171 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h |   12 +++
 fs/xfs/xfs_defer_item.c      |  132 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_error.h           |    4 +
 fs/xfs/xfs_log_recover.c     |   79 +++++++++++++++++++
 fs/xfs/xfs_trace.h           |    3 +
 fs/xfs/xfs_trans.h           |    6 +
 fs/xfs/xfs_trans_refcount.c  |   10 ++
 9 files changed, 411 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 920642e62..4081b00 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -51,6 +51,7 @@ struct xfs_defer_pending {
  * find all the space it needs.
  */
 enum xfs_defer_ops_type {
+	XFS_DEFER_OPS_TYPE_REFCOUNT,
 	XFS_DEFER_OPS_TYPE_RMAP,
 	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_MAX,
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index d13393b..bfbdbad 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -958,3 +958,174 @@ out_error:
 			error, _RET_IP_);
 	return error;
 }
+
+/* Clean up after calling xfs_refcount_finish_one. */
+void
+xfs_refcount_finish_one_cleanup(
+	struct xfs_trans	*tp,
+	struct xfs_btree_cur	*rcur,
+	int			error)
+{
+	struct xfs_buf		*agbp;
+
+	if (rcur == NULL)
+		return;
+	agbp = rcur->bc_private.a.agbp;
+	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(tp, agbp);
+}
+
+/*
+ * Process one of the deferred refcount operations.  We pass back the
+ * btree cursor to maintain our lock on the btree between calls.
+ * This saves time and eliminates a buffer deadlock between the
+ * superblock and the AGF because we'll always grab them in the same
+ * order.
+ */
+int
+xfs_refcount_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dfops,
+	enum xfs_refcount_intent_type	type,
+	xfs_fsblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	xfs_extlen_t			*adjusted,
+	struct xfs_btree_cur		**pcur)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_btree_cur		*rcur;
+	struct xfs_buf			*agbp = NULL;
+	int				error = 0;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			bno;
+	unsigned long			nr_ops = 0;
+	int				shape_changes = 0;
+
+	agno = XFS_FSB_TO_AGNO(mp, startblock);
+	ASSERT(agno != NULLAGNUMBER);
+	bno = XFS_FSB_TO_AGBNO(mp, startblock);
+
+	trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, startblock),
+			type, XFS_FSB_TO_AGBNO(mp, startblock),
+			blockcount);
+
+	if (XFS_TEST_ERROR(false, mp,
+			XFS_ERRTAG_REFCOUNT_FINISH_ONE,
+			XFS_RANDOM_REFCOUNT_FINISH_ONE))
+		return -EIO;
+
+	/*
+	 * If we haven't gotten a cursor or the cursor AG doesn't match
+	 * the startblock, get one now.
+	 */
+	rcur = *pcur;
+	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
+		nr_ops = rcur->bc_private.a.priv.refc.nr_ops;
+		shape_changes = rcur->bc_private.a.priv.refc.shape_changes;
+		xfs_refcount_finish_one_cleanup(tp, rcur, 0);
+		rcur = NULL;
+		*pcur = NULL;
+	}
+	if (rcur == NULL) {
+		error = xfs_alloc_read_agf(tp->t_mountp, tp, agno,
+				XFS_ALLOC_FLAG_FREEING, &agbp);
+		if (error)
+			return error;
+		if (!agbp)
+			return -EFSCORRUPTED;
+
+		rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, agno, dfops);
+		if (!rcur) {
+			error = -ENOMEM;
+			goto out_cur;
+		}
+		rcur->bc_private.a.priv.refc.nr_ops = nr_ops;
+		rcur->bc_private.a.priv.refc.shape_changes = shape_changes;
+	}
+	*pcur = rcur;
+
+	switch (type) {
+	case XFS_REFCOUNT_INCREASE:
+		error = xfs_refcount_adjust(rcur, bno, blockcount, adjusted,
+			XFS_REFCOUNT_ADJUST_INCREASE, dfops, NULL);
+		break;
+	case XFS_REFCOUNT_DECREASE:
+		error = xfs_refcount_adjust(rcur, bno, blockcount, adjusted,
+			XFS_REFCOUNT_ADJUST_DECREASE, dfops, NULL);
+		break;
+	default:
+		ASSERT(0);
+		error = -EFSCORRUPTED;
+	}
+	if (!error && *adjusted != blockcount)
+		trace_xfs_refcount_finish_one_leftover(mp, agno, type,
+				bno, blockcount, *adjusted);
+	return error;
+
+out_cur:
+	xfs_trans_brelse(tp, agbp);
+
+	return error;
+}
+
+/*
+ * Record a refcount intent for later processing.
+ */
+static int
+__xfs_refcount_add(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	struct xfs_refcount_intent	*ri)
+{
+	struct xfs_refcount_intent	*new;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	trace_xfs_refcount_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock),
+			ri->ri_type, XFS_FSB_TO_AGBNO(mp, ri->ri_startblock),
+			ri->ri_blockcount);
+
+	new = kmem_zalloc(sizeof(struct xfs_refcount_intent),
+			KM_SLEEP | KM_NOFS);
+	*new = *ri;
+
+	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_REFCOUNT, &new->ri_list);
+	return 0;
+}
+
+/*
+ * Increase the reference count of the blocks backing a file's extent.
+ */
+int
+xfs_refcount_increase_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	struct xfs_bmbt_irec		*PREV)
+{
+	struct xfs_refcount_intent	ri;
+
+	ri.ri_type = XFS_REFCOUNT_INCREASE;
+	ri.ri_startblock = PREV->br_startblock;
+	ri.ri_blockcount = PREV->br_blockcount;
+
+	return __xfs_refcount_add(mp, dfops, &ri);
+}
+
+/*
+ * Decrease the reference count of the blocks backing a file's extent.
+ */
+int
+xfs_refcount_decrease_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	struct xfs_bmbt_irec		*PREV)
+{
+	struct xfs_refcount_intent	ri;
+
+	ri.ri_type = XFS_REFCOUNT_DECREASE;
+	ri.ri_startblock = PREV->br_startblock;
+	ri.ri_blockcount = PREV->br_blockcount;
+
+	return __xfs_refcount_add(mp, dfops, &ri);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 0b36c1d..92c05ea 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -41,4 +41,16 @@ struct xfs_refcount_intent {
 	xfs_extlen_t				ri_blockcount;
 };
 
+extern int xfs_refcount_increase_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, struct xfs_bmbt_irec *irec);
+extern int xfs_refcount_decrease_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, struct xfs_bmbt_irec *irec);
+
+extern void xfs_refcount_finish_one_cleanup(struct xfs_trans *tp,
+		struct xfs_btree_cur *rcur, int error);
+extern int xfs_refcount_finish_one(struct xfs_trans *tp,
+		struct xfs_defer_ops *dfops, enum xfs_refcount_intent_type type,
+		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
+		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
+
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
index 9ed060d..2cac94f 100644
--- a/fs/xfs/xfs_defer_item.c
+++ b/fs/xfs/xfs_defer_item.c
@@ -33,6 +33,8 @@
 #include "xfs_extfree_item.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_rmap_item.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_item.h"
 
 /* Extent Freeing */
 
@@ -263,12 +265,142 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
 	.cancel_item	= xfs_rmap_update_cancel_item,
 };
 
+/* Reference Counting */
+
+/* Sort rmap intents by AG. */
+static int
+xfs_refcount_update_diff_items(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_mount		*mp = priv;
+	struct xfs_refcount_intent	*ra;
+	struct xfs_refcount_intent	*rb;
+
+	ra = container_of(a, struct xfs_refcount_intent, ri_list);
+	rb = container_of(b, struct xfs_refcount_intent, ri_list);
+	return  XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
+		XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
+}
+
+/* Get an CUI. */
+STATIC void *
+xfs_refcount_update_create_intent(
+	struct xfs_trans		*tp,
+	unsigned int			count)
+{
+	return xfs_trans_get_cui(tp, count);
+}
+
+/* Log refcount updates in the intent item. */
+STATIC void
+xfs_refcount_update_log_item(
+	struct xfs_trans		*tp,
+	void				*intent,
+	struct list_head		*item)
+{
+	struct xfs_refcount_intent	*refc;
+
+	refc = container_of(item, struct xfs_refcount_intent, ri_list);
+	xfs_trans_log_start_refcount_update(tp, intent, refc->ri_type,
+			refc->ri_startblock,
+			refc->ri_blockcount);
+}
+
+/* Get an CUD so we can process all the deferred refcount updates. */
+STATIC void *
+xfs_refcount_update_create_done(
+	struct xfs_trans		*tp,
+	void				*intent,
+	unsigned int			count)
+{
+	return xfs_trans_get_cud(tp, intent, count);
+}
+
+/* Process a deferred refcount update. */
+STATIC int
+xfs_refcount_update_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop,
+	struct list_head		*item,
+	void				*done_item,
+	void				**state)
+{
+	struct xfs_refcount_intent	*refc;
+	xfs_extlen_t			adjusted;
+	int				error;
+
+	refc = container_of(item, struct xfs_refcount_intent, ri_list);
+	error = xfs_trans_log_finish_refcount_update(tp, done_item, dop,
+			refc->ri_type,
+			refc->ri_startblock,
+			refc->ri_blockcount,
+			&adjusted,
+			(struct xfs_btree_cur **)state);
+	/* Did we run out of reservation?  Requeue what we didn't finish. */
+	if (!error && adjusted < refc->ri_blockcount) {
+		ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE ||
+		       refc->ri_type == XFS_REFCOUNT_DECREASE);
+		refc->ri_startblock += adjusted;
+		refc->ri_blockcount -= adjusted;
+		return -EAGAIN;
+	}
+	kmem_free(refc);
+	return error;
+}
+
+/* Clean up after processing deferred refcounts. */
+STATIC void
+xfs_refcount_update_finish_cleanup(
+	struct xfs_trans	*tp,
+	void			*state,
+	int			error)
+{
+	struct xfs_btree_cur	*rcur = state;
+
+	xfs_refcount_finish_one_cleanup(tp, rcur, error);
+}
+
+/* Abort all pending CUIs. */
+STATIC void
+xfs_refcount_update_abort_intent(
+	void				*intent)
+{
+	xfs_cui_release(intent);
+}
+
+/* Cancel a deferred refcount update. */
+STATIC void
+xfs_refcount_update_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_refcount_intent	*refc;
+
+	refc = container_of(item, struct xfs_refcount_intent, ri_list);
+	kmem_free(refc);
+}
+
+const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
+	.type		= XFS_DEFER_OPS_TYPE_REFCOUNT,
+	.max_items	= XFS_CUI_MAX_FAST_EXTENTS,
+	.diff_items	= xfs_refcount_update_diff_items,
+	.create_intent	= xfs_refcount_update_create_intent,
+	.abort_intent	= xfs_refcount_update_abort_intent,
+	.log_item	= xfs_refcount_update_log_item,
+	.create_done	= xfs_refcount_update_create_done,
+	.finish_item	= xfs_refcount_update_finish_item,
+	.finish_cleanup = xfs_refcount_update_finish_cleanup,
+	.cancel_item	= xfs_refcount_update_cancel_item,
+};
+
 /* Deferred Item Initialization */
 
 /* Initialize the deferred operation types. */
 void
 xfs_defer_init_types(void)
 {
+	xfs_defer_init_op_type(&xfs_refcount_update_defer_type);
 	xfs_defer_init_op_type(&xfs_rmap_update_defer_type);
 	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
 }
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index ffeffb5..83d7b62 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -93,7 +93,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_FREE_EXTENT				22
 #define XFS_ERRTAG_RMAP_FINISH_ONE			23
 #define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE		24
-#define XFS_ERRTAG_MAX					25
+#define XFS_ERRTAG_REFCOUNT_FINISH_ONE			25
+#define XFS_ERRTAG_MAX					26
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -123,6 +124,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_FREE_EXTENT				1
 #define XFS_RANDOM_RMAP_FINISH_ONE			1
 #define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE		1
+#define XFS_RANDOM_REFCOUNT_FINISH_ONE			1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e0a470a..ee37dc5 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -47,6 +47,7 @@
 #include "xfs_rmap_item.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_item.h"
+#include "xfs_refcount.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -4833,6 +4834,15 @@ xlog_recover_process_cui(
 	struct xfs_phys_extent		*refc;
 	xfs_fsblock_t			startblock_fsb;
 	bool				op_ok;
+	struct xfs_cud_log_item		*cudp;
+	struct xfs_trans		*tp;
+	struct xfs_btree_cur		*rcur = NULL;
+	enum xfs_refcount_intent_type	type;
+	xfs_fsblock_t			firstfsb;
+	xfs_extlen_t			adjusted;
+	struct xfs_bmbt_irec		irec;
+	struct xfs_defer_ops		dfops;
+	bool				requeue_only = false;
 
 	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
 
@@ -4871,9 +4881,74 @@ xlog_recover_process_cui(
 		}
 	}
 
-	/* XXX: do nothing for now */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error)
+		return error;
+	cudp = xfs_trans_get_cud(tp, cuip, cuip->cui_format.cui_nextents);
+
+	xfs_defer_init(&dfops, &firstfsb);
+	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
+		refc = &(cuip->cui_format.cui_extents[i]);
+		switch (refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK) {
+		case XFS_REFCOUNT_EXTENT_INCREASE:
+			type = XFS_REFCOUNT_INCREASE;
+			break;
+		case XFS_REFCOUNT_EXTENT_DECREASE:
+			type = XFS_REFCOUNT_DECREASE;
+			break;
+		case XFS_REFCOUNT_EXTENT_ALLOC_COW:
+			type = XFS_REFCOUNT_ALLOC_COW;
+			break;
+		case XFS_REFCOUNT_EXTENT_FREE_COW:
+			type = XFS_REFCOUNT_FREE_COW;
+			break;
+		default:
+			error = -EFSCORRUPTED;
+			goto abort_error;
+		}
+		if (requeue_only)
+			adjusted = 0;
+		else
+			error = xfs_trans_log_finish_refcount_update(tp, cudp,
+				&dfops, type, refc->pe_startblock, refc->pe_len,
+				&adjusted, &rcur);
+		if (error)
+			goto abort_error;
+
+		/* Requeue what we didn't finish. */
+		if (adjusted < refc->pe_len) {
+			irec.br_startblock = refc->pe_startblock + adjusted;
+			irec.br_blockcount = refc->pe_len - adjusted;
+			switch (type) {
+			case XFS_REFCOUNT_INCREASE:
+				error = xfs_refcount_increase_extent(
+						tp->t_mountp, &dfops, &irec);
+				break;
+			case XFS_REFCOUNT_DECREASE:
+				error = xfs_refcount_decrease_extent(
+						tp->t_mountp, &dfops, &irec);
+				break;
+			default:
+				ASSERT(0);
+			}
+			if (error)
+				goto abort_error;
+			requeue_only = true;
+		}
+	}
+
+	xfs_refcount_finish_one_cleanup(tp, rcur, error);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
+	if (error)
+		goto abort_error;
 	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
-	xfs_cui_release(cuip);
+	error = xfs_trans_commit(tp);
+	return error;
+
+abort_error:
+	xfs_refcount_finish_one_cleanup(tp, rcur, error);
+	xfs_defer_cancel(&dfops);
+	xfs_trans_cancel(tp);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 1f6cee0..8366102 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2905,6 +2905,9 @@ DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_rec_order_error);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
+#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
+DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
+DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
 
 TRACE_EVENT(xfs_refcount_finish_one_leftover,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 2b197fd..6b6cb4a 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -33,6 +33,7 @@ struct xfs_trans;
 struct xfs_trans_res;
 struct xfs_dquot_acct;
 struct xfs_busy_extent;
+struct xfs_defer_ops;
 
 typedef struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -264,8 +265,9 @@ void xfs_trans_log_start_refcount_update(struct xfs_trans *tp,
 struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
 		struct xfs_cui_log_item *cuip, uint nextents);
 int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
-		struct xfs_cud_log_item *cudp,
+		struct xfs_cud_log_item *cudp, struct xfs_defer_ops *dfops,
 		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
-		xfs_extlen_t blockcount, struct xfs_btree_cur **pcur);
+		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
+		struct xfs_btree_cur **pcur);
 
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
index 32701e4..de30c5a 100644
--- a/fs/xfs/xfs_trans_refcount.c
+++ b/fs/xfs/xfs_trans_refcount.c
@@ -142,17 +142,19 @@ int
 xfs_trans_log_finish_refcount_update(
 	struct xfs_trans		*tp,
 	struct xfs_cud_log_item		*cudp,
+	struct xfs_defer_ops		*dop,
 	enum xfs_refcount_intent_type	type,
 	xfs_fsblock_t			startblock,
 	xfs_extlen_t			blockcount,
+	xfs_extlen_t			*adjusted,
 	struct xfs_btree_cur		**pcur)
 {
 	uint				next_extent;
 	struct xfs_phys_extent		*refc;
 	int				error;
 
-	/* XXX: leave this empty for now */
-	error = -EFSCORRUPTED;
+	error = xfs_refcount_finish_one(tp, dop, type, startblock,
+			blockcount, adjusted, pcur);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -187,6 +189,10 @@ xfs_trans_log_finish_refcount_update(
 		ASSERT(0);
 	}
 	cudp->cud_next_extent++;
+	if (!error && *adjusted != blockcount) {
+		refc->pe_len = *adjusted;
+		cudp->cud_format.cud_nextents = cudp->cud_next_extent;
+	}
 
 	return error;
 }


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 065/119] xfs: adjust refcount when unmapping file blocks
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (63 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 064/119] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:24 ` [PATCH 066/119] xfs: add refcount btree block detection to log recovery Darrick J. Wong
                   ` (53 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When we're unmapping blocks from a reflinked file, decrease the
refcount of the affected blocks and free the extents that are no
longer in use.

v2: Use deferred ops system to avoid deadlocks and running out of
transaction reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 972dfc2..9044e39 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -48,6 +48,7 @@
 #include "xfs_filestream.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_ag_resv.h"
+#include "xfs_refcount.h"
 
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
@@ -5061,9 +5062,16 @@ xfs_bmap_del_extent(
 	/*
 	 * If we need to, add to list of extents to delete.
 	 */
-	if (do_fx)
-		xfs_bmap_add_free(mp, dfops, del->br_startblock,
-				  del->br_blockcount, NULL);
+	if (do_fx) {
+		if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) {
+			error = xfs_refcount_decrease_extent(mp, dfops, del);
+			if (error)
+				goto done;
+		} else
+			xfs_bmap_add_free(mp, dfops, del->br_startblock,
+					  del->br_blockcount, NULL);
+	}
+
 	/*
 	 * Adjust inode # blocks in the file.
 	 */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 066/119] xfs: add refcount btree block detection to log recovery
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (64 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 065/119] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
@ 2016-06-17  1:24 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 067/119] xfs: refcount btree requires more reserved space Darrick J. Wong
                   ` (52 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:24 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Teach log recovery how to deal with refcount btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log_recover.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ee37dc5..2235d94 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2238,6 +2238,7 @@ xlog_recover_get_buf_lsn(
 	case XFS_ABTB_MAGIC:
 	case XFS_ABTC_MAGIC:
 	case XFS_RMAP_CRC_MAGIC:
+	case XFS_REFC_CRC_MAGIC:
 	case XFS_IBT_CRC_MAGIC:
 	case XFS_IBT_MAGIC: {
 		struct xfs_btree_block *btb = blk;
@@ -2409,6 +2410,9 @@ xlog_recover_validate_buf_type(
 		case XFS_RMAP_CRC_MAGIC:
 			bp->b_ops = &xfs_rmapbt_buf_ops;
 			break;
+		case XFS_REFC_CRC_MAGIC:
+			bp->b_ops = &xfs_refcountbt_buf_ops;
+			break;
 		default:
 			xfs_warn(mp, "Bad btree block magic!");
 			ASSERT(0);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 067/119] xfs: refcount btree requires more reserved space
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (65 preceding siblings ...)
  2016-06-17  1:24 ` [PATCH 066/119] xfs: add refcount btree block detection to log recovery Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 068/119] xfs: introduce reflink utility functions Darrick J. Wong
                   ` (51 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

The reference count btree is allocated from the free space, which
means that we have to ensure that an AG can't run out of free space
while performing a refcount operation.  In the pathological case each
AG block has its own refcntbt record, so we have to keep that much
space available.

v2: Calculate the maximum possible size of the rmap and refcount
btrees based on minimally-full btree blocks.  This increases the
per-AG block reservations to handle the worst case btree size.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c          |    3 +++
 fs/xfs/libxfs/xfs_refcount_btree.c |   23 +++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount_btree.h |    4 ++++
 3 files changed, 30 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index be5d0df..c46db76 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -38,6 +38,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 #include "xfs_ag_resv.h"
+#include "xfs_refcount_btree.h"
 
 struct workqueue_struct *xfs_alloc_wq;
 
@@ -138,6 +139,8 @@ xfs_alloc_ag_max_usable(struct xfs_mount *mp)
 		/* rmap root block + full tree split on full AG */
 		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;
 	}
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		blocks += xfs_refcountbt_max_size(mp);
 
 	return mp->m_sb.sb_agblocks - blocks;
 }
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 7093c71..a944fca 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -373,3 +373,26 @@ xfs_refcountbt_compute_maxlevels(
 	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(mp,
 			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
 }
+
+/* Calculate the refcount btree size for some records. */
+xfs_extlen_t
+xfs_refcountbt_calc_size(
+	struct xfs_mount	*mp,
+	unsigned long long	len)
+{
+	return xfs_btree_calc_size(mp, mp->m_refc_mnr, len);
+}
+
+/*
+ * Calculate the maximum refcount btree size.
+ */
+xfs_extlen_t
+xfs_refcountbt_max_size(
+	struct xfs_mount	*mp)
+{
+	/* Bail out if we're uninitialized, which can happen in mkfs. */
+	if (mp->m_refc_mxr[0] == 0)
+		return 0;
+
+	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index 9e9ad7c..780b02f 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -64,4 +64,8 @@ extern int xfs_refcountbt_maxrecs(struct xfs_mount *mp, int blocklen,
 		bool leaf);
 extern void xfs_refcountbt_compute_maxlevels(struct xfs_mount *mp);
 
+extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
+		unsigned long long len);
+extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 068/119] xfs: introduce reflink utility functions
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (66 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 067/119] xfs: refcount btree requires more reserved space Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 069/119] xfs: create bmbt update intent log items Darrick J. Wong
                   ` (50 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Darrick J. Wong

These functions will be used by the other reflink functions to find
the maximum length of a range of shared blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
---
 fs/xfs/libxfs/xfs_refcount.c |  109 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h |    4 ++
 2 files changed, 113 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index bfbdbad..ebbb714 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1129,3 +1129,112 @@ xfs_refcount_decrease_extent(
 
 	return __xfs_refcount_add(mp, dfops, &ri);
 }
+
+/*
+ * Given an AG extent, find the lowest-numbered run of shared blocks within
+ * that range and return the range in fbno/flen.  If find_maximal is set,
+ * return the longest extent of shared blocks; if not, just return the first
+ * extent we find.  If no shared blocks are found, flen will be set to zero.
+ */
+int
+xfs_refcount_find_shared(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	xfs_agblock_t		*fbno,
+	xfs_extlen_t		*flen,
+	bool			find_maximal)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_buf		*agbp;
+	struct xfs_refcount_irec	tmp;
+	int			error;
+	int			i, have;
+	int			bt_error = XFS_BTREE_ERROR;
+
+	trace_xfs_refcount_find_shared(mp, agno, agbno, aglen);
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		goto out;
+	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
+
+	/* By default, skip the whole range */
+	*fbno = agbno + aglen;
+	*flen = 0;
+
+	/* Try to find a refcount extent that crosses the start */
+	error = xfs_refcountbt_lookup_le(cur, agbno, &have);
+	if (error)
+		goto out_error;
+	if (!have) {
+		/* No left extent, look at the next one */
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			goto done;
+	}
+	error = xfs_refcountbt_get_rec(cur, &tmp, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+
+	/* If the extent ends before the start, look at the next one */
+	if (tmp.rc_startblock + tmp.rc_blockcount <= agbno) {
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			goto done;
+		error = xfs_refcountbt_get_rec(cur, &tmp, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+	}
+
+	/* If the extent ends after the range we want, bail out */
+	if (tmp.rc_startblock >= agbno + aglen)
+		goto done;
+
+	/* We found the start of a shared extent! */
+	if (tmp.rc_startblock < agbno) {
+		tmp.rc_blockcount -= (agbno - tmp.rc_startblock);
+		tmp.rc_startblock = agbno;
+	}
+
+	*fbno = tmp.rc_startblock;
+	*flen = min(tmp.rc_blockcount, agbno + aglen - *fbno);
+	if (!find_maximal)
+		goto done;
+
+	/* Otherwise, find the end of this shared extent */
+	while (*fbno + *flen < agbno + aglen) {
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			break;
+		error = xfs_refcountbt_get_rec(cur, &tmp, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+		if (tmp.rc_startblock >= agbno + aglen ||
+		    tmp.rc_startblock != *fbno + *flen)
+			break;
+		*flen = min(*flen + tmp.rc_blockcount, agbno + aglen - *fbno);
+	}
+
+done:
+	bt_error = XFS_BTREE_NOERROR;
+	trace_xfs_refcount_find_shared_result(mp, agno, *fbno, *flen);
+
+out_error:
+	xfs_btree_del_cursor(cur, bt_error);
+	xfs_buf_relse(agbp);
+out:
+	if (error)
+		trace_xfs_refcount_find_shared_error(mp, agno, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 92c05ea..b7b83b8 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -53,4 +53,8 @@ extern int xfs_refcount_finish_one(struct xfs_trans *tp,
 		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
 		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
 
+extern int xfs_refcount_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
+		xfs_extlen_t *flen, bool find_maximal);
+
 #endif	/* __XFS_REFCOUNT_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 069/119] xfs: create bmbt update intent log items
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (67 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 068/119] xfs: introduce reflink utility functions Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 070/119] xfs: log bmap intent items Darrick J. Wong
                   ` (49 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create bmbt update intent/done log items to record redo information in
the log.  Because we roll transactions multiple times for reflink
operations, we also have to track the status of the metadata updates
that will be recorded in the post-roll transactions in case we crash
before committing the final transaction.  This mechanism enables log
recovery to finish what was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_log_format.h |   52 ++++-
 fs/xfs/xfs_bmap_item.c         |  459 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_item.h         |   97 ++++++++
 fs/xfs/xfs_super.c             |   21 ++
 5 files changed, 628 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_bmap_item.c
 create mode 100644 fs/xfs/xfs_bmap_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 2945270..d2ce5db 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -103,6 +103,7 @@ xfs-y				+= xfs_aops.o \
 # low-level transaction/log code
 xfs-y				+= xfs_log.o \
 				   xfs_log_cil.o \
+				   xfs_bmap_item.o \
 				   xfs_buf_item.o \
 				   xfs_extfree_item.o \
 				   xfs_icreate_item.o \
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 923b08f..320a305 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -114,7 +114,9 @@ static inline uint xlog_get_cycle(char *ptr)
 #define XLOG_REG_TYPE_RUD_FORMAT	22
 #define XLOG_REG_TYPE_CUI_FORMAT	23
 #define XLOG_REG_TYPE_CUD_FORMAT	24
-#define XLOG_REG_TYPE_MAX		24
+#define XLOG_REG_TYPE_BUI_FORMAT	25
+#define XLOG_REG_TYPE_BUD_FORMAT	26
+#define XLOG_REG_TYPE_MAX		26
 
 /*
  * Flags to log operation header
@@ -235,6 +237,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_RUD		0x1241
 #define	XFS_LI_CUI		0x1242	/* refcount update intent */
 #define	XFS_LI_CUD		0x1243
+#define	XFS_LI_BUI		0x1244	/* bmbt update intent */
+#define	XFS_LI_BUD		0x1245
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -248,7 +252,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_RUI,		"XFS_LI_RUI" }, \
 	{ XFS_LI_RUD,		"XFS_LI_RUD" }, \
 	{ XFS_LI_CUI,		"XFS_LI_CUI" }, \
-	{ XFS_LI_CUD,		"XFS_LI_CUD" }
+	{ XFS_LI_CUD,		"XFS_LI_CUD" }, \
+	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
+	{ XFS_LI_BUD,		"XFS_LI_BUD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -717,6 +723,48 @@ struct xfs_cud_log_format {
 };
 
 /*
+ * BUI/BUD (inode block mapping) log format definitions
+ */
+
+/* bmbt me_flags: upper bits are flags, lower byte is type code */
+#define XFS_BMAP_EXTENT_MAP		1
+#define XFS_BMAP_EXTENT_UNMAP		2
+#define XFS_BMAP_EXTENT_TYPE_MASK	0xFF
+
+#define XFS_BMAP_EXTENT_ATTR_FORK	(1U << 31)
+#define XFS_BMAP_EXTENT_UNWRITTEN	(1U << 30)
+
+#define XFS_BMAP_EXTENT_FLAGS		(XFS_BMAP_EXTENT_TYPE_MASK | \
+					 XFS_BMAP_EXTENT_ATTR_FORK | \
+					 XFS_BMAP_EXTENT_UNWRITTEN)
+
+/*
+ * This is the structure used to lay out an bui log item in the
+ * log.  The bui_extents field is a variable size array whose
+ * size is given by bui_nextents.
+ */
+struct xfs_bui_log_format {
+	__uint16_t		bui_type;	/* bui log item type */
+	__uint16_t		bui_size;	/* size of this item */
+	__uint32_t		bui_nextents;	/* # extents to free */
+	__uint64_t		bui_id;		/* bui identifier */
+	struct xfs_map_extent	bui_extents[1];	/* array of extents to bmap */
+};
+
+/*
+ * This is the structure used to lay out an bud log item in the
+ * log.  The bud_extents array is a variable size array whose
+ * size is given by bud_nextents;
+ */
+struct xfs_bud_log_format {
+	__uint16_t		bud_type;	/* bud log item type */
+	__uint16_t		bud_size;	/* size of this item */
+	__uint32_t		bud_nextents;	/* # of extents freed */
+	__uint64_t		bud_bui_id;	/* id of corresponding bui */
+	struct xfs_map_extent	bud_extents[1];	/* array of extents bmapped */
+};
+
+/*
  * Dquot Log format definitions.
  *
  * The first two fields must be the type and size fitting into
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
new file mode 100644
index 0000000..19bd7dd
--- /dev/null
+++ b/fs/xfs/xfs_bmap_item.c
@@ -0,0 +1,459 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_buf_item.h"
+#include "xfs_bmap_item.h"
+#include "xfs_log.h"
+
+
+kmem_zone_t	*xfs_bui_zone;
+kmem_zone_t	*xfs_bud_zone;
+
+static inline struct xfs_bui_log_item *BUI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_bui_log_item, bui_item);
+}
+
+void
+xfs_bui_item_free(
+	struct xfs_bui_log_item	*buip)
+{
+	if (buip->bui_format.bui_nextents > XFS_BUI_MAX_FAST_EXTENTS)
+		kmem_free(buip);
+	else
+		kmem_zone_free(xfs_bui_zone, buip);
+}
+
+/*
+ * This returns the number of iovecs needed to log the given bui item.
+ * We only need 1 iovec for an bui item.  It just logs the bui_log_format
+ * structure.
+ */
+static inline int
+xfs_bui_item_sizeof(
+	struct xfs_bui_log_item *buip)
+{
+	return sizeof(struct xfs_bui_log_format) +
+			(buip->bui_format.bui_nextents - 1) *
+			sizeof(struct xfs_map_extent);
+}
+
+STATIC void
+xfs_bui_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += xfs_bui_item_sizeof(BUI_ITEM(lip));
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given bui log item. We use only 1 iovec, and we point that
+ * at the bui_log_format structure embedded in the bui item.
+ * It is at this point that we assert that all of the extent
+ * slots in the bui item have been filled.
+ */
+STATIC void
+xfs_bui_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_bui_log_item	*buip = BUI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(atomic_read(&buip->bui_next_extent) ==
+			buip->bui_format.bui_nextents);
+
+	buip->bui_format.bui_type = XFS_LI_BUI;
+	buip->bui_format.bui_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_BUI_FORMAT, &buip->bui_format,
+			xfs_bui_item_sizeof(buip));
+}
+
+/*
+ * Pinning has no meaning for an bui item, so just return.
+ */
+STATIC void
+xfs_bui_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * The unpin operation is the last place an BUI is manipulated in the log. It is
+ * either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the BUI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the BUI to either construct
+ * and commit the BUD or drop the BUD's reference in the event of error. Simply
+ * drop the log's BUI reference now that the log is done with it.
+ */
+STATIC void
+xfs_bui_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_bui_log_item	*buip = BUI_ITEM(lip);
+
+	xfs_bui_release(buip);
+}
+
+/*
+ * BUI items have no locking or pushing.  However, since BUIs are pulled from
+ * the AIL when their corresponding BUDs are committed to disk, their situation
+ * is very similar to being pinned.  Return XFS_ITEM_PINNED so that the caller
+ * will eventually flush the log.  This should help in getting the BUI out of
+ * the AIL.
+ */
+STATIC uint
+xfs_bui_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The BUI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an BUD isn't going to be
+ * constructed and thus we free the BUI here directly.
+ */
+STATIC void
+xfs_bui_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	if (lip->li_flags & XFS_LI_ABORTED)
+		xfs_bui_item_free(BUI_ITEM(lip));
+}
+
+/*
+ * The BUI is logged only once and cannot be moved in the log, so simply return
+ * the lsn at which it's been logged.
+ */
+STATIC xfs_lsn_t
+xfs_bui_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	return lsn;
+}
+
+/*
+ * The BUI dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_bui_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all bui log items.
+ */
+static const struct xfs_item_ops xfs_bui_item_ops = {
+	.iop_size	= xfs_bui_item_size,
+	.iop_format	= xfs_bui_item_format,
+	.iop_pin	= xfs_bui_item_pin,
+	.iop_unpin	= xfs_bui_item_unpin,
+	.iop_unlock	= xfs_bui_item_unlock,
+	.iop_committed	= xfs_bui_item_committed,
+	.iop_push	= xfs_bui_item_push,
+	.iop_committing = xfs_bui_item_committing,
+};
+
+/*
+ * Allocate and initialize an bui item with the given number of extents.
+ */
+struct xfs_bui_log_item *
+xfs_bui_init(
+	struct xfs_mount		*mp,
+	uint				nextents)
+
+{
+	struct xfs_bui_log_item		*buip;
+	uint				size;
+
+	ASSERT(nextents > 0);
+	if (nextents > XFS_BUI_MAX_FAST_EXTENTS) {
+		size = (uint)(sizeof(struct xfs_bui_log_item) +
+			((nextents - 1) * sizeof(struct xfs_map_extent)));
+		buip = kmem_zalloc(size, KM_SLEEP);
+	} else {
+		buip = kmem_zone_zalloc(xfs_bui_zone, KM_SLEEP);
+	}
+
+	xfs_log_item_init(mp, &buip->bui_item, XFS_LI_BUI, &xfs_bui_item_ops);
+	buip->bui_format.bui_nextents = nextents;
+	buip->bui_format.bui_id = (uintptr_t)(void *)buip;
+	atomic_set(&buip->bui_next_extent, 0);
+	atomic_set(&buip->bui_refcount, 2);
+
+	return buip;
+}
+
+/*
+ * Copy an BUI format buffer from the given buf, and into the destination
+ * BUI format structure.  The BUI/BUD items were designed not to need any
+ * special alignment handling.
+ */
+int
+xfs_bui_copy_format(
+	struct xfs_log_iovec		*buf,
+	struct xfs_bui_log_format	*dst_bui_fmt)
+{
+	struct xfs_bui_log_format	*src_bui_fmt;
+	uint				len;
+
+	src_bui_fmt = buf->i_addr;
+	len = sizeof(struct xfs_bui_log_format) +
+			(src_bui_fmt->bui_nextents - 1) *
+			sizeof(struct xfs_map_extent);
+
+	if (buf->i_len == len) {
+		memcpy((char *)dst_bui_fmt, (char *)src_bui_fmt, len);
+		return 0;
+	}
+	return -EFSCORRUPTED;
+}
+
+/*
+ * Freeing the BUI requires that we remove it from the AIL if it has already
+ * been placed there. However, the BUI may not yet have been placed in the AIL
+ * when called by xfs_bui_release() from BUD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the BUI.
+ */
+void
+xfs_bui_release(
+	struct xfs_bui_log_item	*buip)
+{
+	if (atomic_dec_and_test(&buip->bui_refcount)) {
+		xfs_trans_ail_remove(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_bui_item_free(buip);
+	}
+}
+
+static inline struct xfs_bud_log_item *BUD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_bud_log_item, bud_item);
+}
+
+STATIC void
+xfs_bud_item_free(struct xfs_bud_log_item *budp)
+{
+	if (budp->bud_format.bud_nextents > XFS_BUD_MAX_FAST_EXTENTS)
+		kmem_free(budp);
+	else
+		kmem_zone_free(xfs_bud_zone, budp);
+}
+
+/*
+ * This returns the number of iovecs needed to log the given bud item.
+ * We only need 1 iovec for an bud item.  It just logs the bud_log_format
+ * structure.
+ */
+static inline int
+xfs_bud_item_sizeof(
+	struct xfs_bud_log_item	*budp)
+{
+	return sizeof(struct xfs_bud_log_format) +
+			(budp->bud_format.bud_nextents - 1) *
+			sizeof(struct xfs_map_extent);
+}
+
+STATIC void
+xfs_bud_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += xfs_bud_item_sizeof(BUD_ITEM(lip));
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given bud log item. We use only 1 iovec, and we point that
+ * at the bud_log_format structure embedded in the bud item.
+ * It is at this point that we assert that all of the extent
+ * slots in the bud item have been filled.
+ */
+STATIC void
+xfs_bud_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(budp->bud_next_extent == budp->bud_format.bud_nextents);
+
+	budp->bud_format.bud_type = XFS_LI_BUD;
+	budp->bud_format.bud_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_BUD_FORMAT, &budp->bud_format,
+			xfs_bud_item_sizeof(budp));
+}
+
+/*
+ * Pinning has no meaning for an bud item, so just return.
+ */
+STATIC void
+xfs_bud_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * Since pinning has no meaning for an bud item, unpinning does
+ * not either.
+ */
+STATIC void
+xfs_bud_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+}
+
+/*
+ * There isn't much you can do to push on an bud item.  It is simply stuck
+ * waiting for the log to be flushed to disk.
+ */
+STATIC uint
+xfs_bud_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The BUD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the BUI and free the
+ * BUD.
+ */
+STATIC void
+xfs_bud_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
+
+	if (lip->li_flags & XFS_LI_ABORTED) {
+		xfs_bui_release(budp->bud_buip);
+		xfs_bud_item_free(budp);
+	}
+}
+
+/*
+ * When the bud item is committed to disk, all we need to do is delete our
+ * reference to our partner bui item and then free ourselves. Since we're
+ * freeing ourselves we must return -1 to keep the transaction code from
+ * further referencing this item.
+ */
+STATIC xfs_lsn_t
+xfs_bud_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
+
+	/*
+	 * Drop the BUI reference regardless of whether the BUD has been
+	 * aborted. Once the BUD transaction is constructed, it is the sole
+	 * responsibility of the BUD to release the BUI (even if the BUI is
+	 * aborted due to log I/O error).
+	 */
+	xfs_bui_release(budp->bud_buip);
+	xfs_bud_item_free(budp);
+
+	return (xfs_lsn_t)-1;
+}
+
+/*
+ * The BUD dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_bud_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all bud log items.
+ */
+static const struct xfs_item_ops xfs_bud_item_ops = {
+	.iop_size	= xfs_bud_item_size,
+	.iop_format	= xfs_bud_item_format,
+	.iop_pin	= xfs_bud_item_pin,
+	.iop_unpin	= xfs_bud_item_unpin,
+	.iop_unlock	= xfs_bud_item_unlock,
+	.iop_committed	= xfs_bud_item_committed,
+	.iop_push	= xfs_bud_item_push,
+	.iop_committing = xfs_bud_item_committing,
+};
+
+/*
+ * Allocate and initialize an bud item with the given number of extents.
+ */
+struct xfs_bud_log_item *
+xfs_bud_init(
+	struct xfs_mount		*mp,
+	struct xfs_bui_log_item		*buip,
+	uint				nextents)
+
+{
+	struct xfs_bud_log_item	*budp;
+	uint			size;
+
+	ASSERT(nextents > 0);
+	if (nextents > XFS_BUD_MAX_FAST_EXTENTS) {
+		size = (uint)(sizeof(struct xfs_bud_log_item) +
+			((nextents - 1) * sizeof(struct xfs_map_extent)));
+		budp = kmem_zalloc(size, KM_SLEEP);
+	} else {
+		budp = kmem_zone_zalloc(xfs_bud_zone, KM_SLEEP);
+	}
+
+	xfs_log_item_init(mp, &budp->bud_item, XFS_LI_BUD, &xfs_bud_item_ops);
+	budp->bud_buip = buip;
+	budp->bud_format.bud_nextents = nextents;
+	budp->bud_format.bud_bui_id = buip->bui_format.bui_id;
+
+	return budp;
+}
diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
new file mode 100644
index 0000000..a992c09
--- /dev/null
+++ b/fs/xfs/xfs_bmap_item.h
@@ -0,0 +1,97 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef	__XFS_BMAP_ITEM_H__
+#define	__XFS_BMAP_ITEM_H__
+
+/*
+ * There are (currently) two pairs of bmap btree redo item types: map & unmap.
+ * The common abbreviations for these are BUI (bmap update intent) and BUD
+ * (bmap update done).  The redo item type is encoded in the flags field of
+ * each xfs_map_extent.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated bmbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * bmbt metadata updates in the non-first transaction.
+ */
+
+/* kernel only BUI/BUD definitions */
+
+struct xfs_mount;
+struct kmem_zone;
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_BUI_MAX_FAST_EXTENTS	16
+
+/*
+ * Define BUI flag bits. Manipulated by set/clear/test_bit operators.
+ */
+#define	XFS_BUI_RECOVERED		1
+
+/*
+ * This is the "bmap update intent" log item.  It is used to log the fact that
+ * some reverse mappings need to change.  It is used in conjunction with the
+ * "bmap update done" log item described below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_bui_log_item {
+	struct xfs_log_item		bui_item;
+	atomic_t			bui_refcount;
+	atomic_t			bui_next_extent;
+	unsigned long			bui_flags;	/* misc flags */
+	struct xfs_bui_log_format	bui_format;
+};
+
+/*
+ * This is the "bmap update done" log item.  It is used to log the fact that
+ * some bmbt updates mentioned in an earlier bui item have been performed.
+ */
+struct xfs_bud_log_item {
+	struct xfs_log_item		bud_item;
+	struct xfs_bui_log_item		*bud_buip;
+	uint				bud_next_extent;
+	struct xfs_bud_log_format	bud_format;
+};
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_BUD_MAX_FAST_EXTENTS	16
+
+extern struct kmem_zone	*xfs_bui_zone;
+extern struct kmem_zone	*xfs_bud_zone;
+
+struct xfs_bui_log_item *xfs_bui_init(struct xfs_mount *, uint);
+struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *,
+		struct xfs_bui_log_item *, uint);
+int xfs_bui_copy_format(struct xfs_log_iovec *buf,
+		struct xfs_bui_log_format *dst_bui_fmt);
+void xfs_bui_item_free(struct xfs_bui_log_item *);
+void xfs_bui_release(struct xfs_bui_log_item *);
+
+#endif	/* __XFS_BMAP_ITEM_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a0c7bdc..18f74b3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -49,6 +49,7 @@
 #include "xfs_defer.h"
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
+#include "xfs_bmap_item.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1796,8 +1797,26 @@ xfs_init_zones(void)
 	if (!xfs_cui_zone)
 		goto out_destroy_cud_zone;
 
+	xfs_bud_zone = kmem_zone_init((sizeof(struct xfs_bud_log_item) +
+			((XFS_BUD_MAX_FAST_EXTENTS - 1) *
+				 sizeof(struct xfs_map_extent))),
+			"xfs_bud_item");
+	if (!xfs_bud_zone)
+		goto out_destroy_cui_zone;
+
+	xfs_bui_zone = kmem_zone_init((sizeof(struct xfs_bui_log_item) +
+			((XFS_BUI_MAX_FAST_EXTENTS - 1) *
+				sizeof(struct xfs_map_extent))),
+			"xfs_bui_item");
+	if (!xfs_bui_zone)
+		goto out_destroy_bud_zone;
+
 	return 0;
 
+ out_destroy_bud_zone:
+	kmem_zone_destroy(xfs_bud_zone);
+ out_destroy_cui_zone:
+	kmem_zone_destroy(xfs_cui_zone);
  out_destroy_cud_zone:
 	kmem_zone_destroy(xfs_cud_zone);
  out_destroy_rui_zone:
@@ -1844,6 +1863,8 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_zone_destroy(xfs_bui_zone);
+	kmem_zone_destroy(xfs_bud_zone);
 	kmem_zone_destroy(xfs_cui_zone);
 	kmem_zone_destroy(xfs_cud_zone);
 	kmem_zone_destroy(xfs_rui_zone);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 070/119] xfs: log bmap intent items
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (68 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 069/119] xfs: create bmbt update intent log items Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 071/119] xfs: map an inode's offset to an exact physical block Darrick J. Wong
                   ` (48 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Provide a mechanism for higher levels to create BUI/BUD items, submit
them to the log, and a stub function to deal with recovered BUI items.
These parts will be connected to the rmapbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile          |    1 
 fs/xfs/libxfs/xfs_bmap.h |   13 ++
 fs/xfs/xfs_log_recover.c |  301 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.h       |   18 +++
 fs/xfs/xfs_trans_bmap.c  |  201 +++++++++++++++++++++++++++++++
 5 files changed, 533 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/xfs_trans_bmap.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d2ce5db..941afe6 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -112,6 +112,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
+				   xfs_trans_bmap.o \
 				   xfs_trans_buf.o \
 				   xfs_trans_extfree.o \
 				   xfs_trans_inode.o \
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 862ea464..62a66d0 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -209,4 +209,17 @@ struct xfs_bmbt_rec_host *
 				struct xfs_bmbt_irec *gotp,
 				struct xfs_bmbt_irec *prevp);
 
+enum xfs_bmap_intent_type {
+	XFS_BMAP_MAP,
+	XFS_BMAP_UNMAP,
+};
+
+struct xfs_bmap_intent {
+	struct list_head			bi_list;
+	enum xfs_bmap_intent_type		bi_type;
+	struct xfs_inode			*bi_owner;
+	int					bi_whichfork;
+	struct xfs_bmbt_irec			bi_bmap;
+};
+
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 2235d94..42000f4 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -48,6 +48,8 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_item.h"
 #include "xfs_refcount.h"
+#include "xfs_bmap_item.h"
+#include "xfs_bmap.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -1920,6 +1922,8 @@ xlog_recover_reorder_trans(
 		case XFS_LI_RUD:
 		case XFS_LI_CUI:
 		case XFS_LI_CUD:
+		case XFS_LI_BUI:
+		case XFS_LI_BUD:
 			trace_xfs_log_recover_item_reorder_tail(log,
 							trans, item, pass);
 			list_move_tail(&item->ri_list, &inode_list);
@@ -3622,6 +3626,101 @@ xlog_recover_cud_pass2(
 }
 
 /*
+ * This routine is called to create an in-core extent bmap update
+ * item from the bui format structure which was logged on disk.
+ * It allocates an in-core bui, copies the extents from the format
+ * structure into it, and adds the bui to the AIL with the given
+ * LSN.
+ */
+STATIC int
+xlog_recover_bui_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	int				error;
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_bui_log_item		*buip;
+	struct xfs_bui_log_format	*bui_formatp;
+
+	bui_formatp = item->ri_buf[0].i_addr;
+
+	buip = xfs_bui_init(mp, bui_formatp->bui_nextents);
+	error = xfs_bui_copy_format(&item->ri_buf[0], &buip->bui_format);
+	if (error) {
+		xfs_bui_item_free(buip);
+		return error;
+	}
+	atomic_set(&buip->bui_next_extent, bui_formatp->bui_nextents);
+
+	spin_lock(&log->l_ailp->xa_lock);
+	/*
+	 * The RUI has two references. One for the RUD and one for RUI to ensure
+	 * it makes it into the AIL. Insert the RUI into the AIL directly and
+	 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
+	 * AIL lock.
+	 */
+	xfs_trans_ail_update(log->l_ailp, &buip->bui_item, lsn);
+	xfs_bui_release(buip);
+	return 0;
+}
+
+
+/*
+ * This routine is called when an BUD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding BUI if it
+ * was still in the log. To do this it searches the AIL for the BUI with an id
+ * equal to that in the BUD format structure. If we find it we drop the BUD
+ * reference, which removes the BUI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_bud_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item)
+{
+	struct xfs_bud_log_format	*bud_formatp;
+	struct xfs_bui_log_item		*buip = NULL;
+	struct xfs_log_item		*lip;
+	__uint64_t			bui_id;
+	struct xfs_ail_cursor		cur;
+	struct xfs_ail			*ailp = log->l_ailp;
+
+	bud_formatp = item->ri_buf[0].i_addr;
+	ASSERT(item->ri_buf[0].i_len == (sizeof(struct xfs_bud_log_format) +
+			((bud_formatp->bud_nextents - 1) *
+			sizeof(struct xfs_map_extent))));
+	bui_id = bud_formatp->bud_bui_id;
+
+	/*
+	 * Search for the BUI with the id in the BUD format structure in the
+	 * AIL.
+	 */
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		if (lip->li_type == XFS_LI_BUI) {
+			buip = (struct xfs_bui_log_item *)lip;
+			if (buip->bui_format.bui_id == bui_id) {
+				/*
+				 * Drop the BUD reference to the BUI. This
+				 * removes the BUI from the AIL and frees it.
+				 */
+				spin_unlock(&ailp->xa_lock);
+				xfs_bui_release(buip);
+				spin_lock(&ailp->xa_lock);
+				break;
+			}
+		}
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+
+	return 0;
+}
+
+/*
  * This routine is called when an inode create format structure is found in a
  * committed transaction in the log.  It's purpose is to initialise the inodes
  * being allocated on disk. This requires us to get inode cluster buffers that
@@ -3850,6 +3949,8 @@ xlog_recover_ra_pass2(
 	case XFS_LI_RUD:
 	case XFS_LI_CUI:
 	case XFS_LI_CUD:
+	case XFS_LI_BUI:
+	case XFS_LI_BUD:
 	default:
 		break;
 	}
@@ -3877,6 +3978,8 @@ xlog_recover_commit_pass1(
 	case XFS_LI_RUD:
 	case XFS_LI_CUI:
 	case XFS_LI_CUD:
+	case XFS_LI_BUI:
+	case XFS_LI_BUD:
 		/* nothing to do in pass 1 */
 		return 0;
 	default:
@@ -3915,6 +4018,10 @@ xlog_recover_commit_pass2(
 		return xlog_recover_cui_pass2(log, item, trans->r_lsn);
 	case XFS_LI_CUD:
 		return xlog_recover_cud_pass2(log, item);
+	case XFS_LI_BUI:
+		return xlog_recover_bui_pass2(log, item, trans->r_lsn);
+	case XFS_LI_BUD:
+		return xlog_recover_bud_pass2(log, item);
 	case XFS_LI_DQUOT:
 		return xlog_recover_dquot_pass2(log, buffer_list, item,
 						trans->r_lsn);
@@ -4394,6 +4501,7 @@ static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
 	case XFS_LI_EFI:
 	case XFS_LI_RUI:
 	case XFS_LI_CUI:
+	case XFS_LI_BUI:
 		return true;
 	default:
 		return false;
@@ -5079,6 +5187,187 @@ xlog_recover_cancel_cuis(
 }
 
 /*
+ * Process a bmap update intent item that was recovered from the log.
+ * We need to update some inode's bmbt.
+ */
+STATIC int
+xlog_recover_process_bui(
+	struct xfs_mount		*mp,
+	struct xfs_bui_log_item		*buip)
+{
+	int				i;
+	int				error = 0;
+	struct xfs_map_extent		*bmap;
+	xfs_fsblock_t			startblock_fsb;
+	xfs_fsblock_t			inode_fsb;
+	bool				op_ok;
+
+	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
+
+	/*
+	 * First check the validity of the extents described by the
+	 * BUI.  If any are bad, then assume that all are bad and
+	 * just toss the BUI.
+	 */
+	for (i = 0; i < buip->bui_format.bui_nextents; i++) {
+		bmap = &(buip->bui_format.bui_extents[i]);
+		startblock_fsb = XFS_BB_TO_FSB(mp,
+				   XFS_FSB_TO_DADDR(mp, bmap->me_startblock));
+		inode_fsb = XFS_BB_TO_FSB(mp, XFS_FSB_TO_DADDR(mp,
+				XFS_INO_TO_FSB(mp, bmap->me_owner)));
+		switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
+		case XFS_BMAP_EXTENT_MAP:
+		case XFS_BMAP_EXTENT_UNMAP:
+			op_ok = true;
+			break;
+		default:
+			op_ok = false;
+			break;
+		}
+		if (!op_ok || (startblock_fsb == 0) ||
+		    (bmap->me_len == 0) ||
+		    (inode_fsb == 0) ||
+		    (startblock_fsb >= mp->m_sb.sb_dblocks) ||
+		    (bmap->me_len >= mp->m_sb.sb_agblocks) ||
+		    (inode_fsb >= mp->m_sb.sb_agblocks) ||
+		    (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) {
+			/*
+			 * This will pull the BUI from the AIL and
+			 * free the memory associated with it.
+			 */
+			set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
+			xfs_bui_release(buip);
+			return -EIO;
+		}
+	}
+
+	/* XXX: do nothing for now */
+	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
+	xfs_bui_release(buip);
+	return error;
+}
+
+/*
+ * When this is called, all of the BUIs which did not have
+ * corresponding BUDs should be in the AIL.  What we do now
+ * is update the bmbts associated with each one.
+ *
+ * Since we process the BUIs in normal transactions, they
+ * will be removed at some point after the commit.  This prevents
+ * us from just walking down the list processing each one.
+ * We'll use a flag in the BUI to skip those that we've already
+ * processed and use the AIL iteration mechanism's generation
+ * count to try to speed this up at least a bit.
+ *
+ * When we start, we know that the BUIs are the only things in
+ * the AIL.  As we process them, however, other items are added
+ * to the AIL.
+ */
+STATIC int
+xlog_recover_process_buis(
+	struct xlog		*log)
+{
+	struct xfs_log_item	*lip;
+	struct xfs_bui_log_item	*buip;
+	int			error = 0;
+	struct xfs_ail_cursor	cur;
+	struct xfs_ail		*ailp;
+
+	ailp = log->l_ailp;
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		/*
+		 * We're done when we see something other than an intent.
+		 * There should be no intents left in the AIL now.
+		 */
+		if (!xlog_item_is_intent(lip)) {
+#ifdef DEBUG
+			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
+				ASSERT(!xlog_item_is_intent(lip));
+#endif
+			break;
+		}
+
+		/* Skip anything that isn't a BUI */
+		if (lip->li_type != XFS_LI_BUI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		/*
+		 * Skip BUIs that we've already processed.
+		 */
+		buip = container_of(lip, struct xfs_bui_log_item, bui_item);
+		if (test_bit(XFS_BUI_RECOVERED, &buip->bui_flags)) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		spin_unlock(&ailp->xa_lock);
+		error = xlog_recover_process_bui(log->l_mp, buip);
+		spin_lock(&ailp->xa_lock);
+		if (error)
+			goto out;
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+out:
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+	return error;
+}
+
+/*
+ * A cancel occurs when the mount has failed and we're bailing out. Release all
+ * pending BUIs so they don't pin the AIL.
+ */
+STATIC int
+xlog_recover_cancel_buis(
+	struct xlog		*log)
+{
+	struct xfs_log_item	*lip;
+	struct xfs_bui_log_item	*buip;
+	int			error = 0;
+	struct xfs_ail_cursor	cur;
+	struct xfs_ail		*ailp;
+
+	ailp = log->l_ailp;
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		/*
+		 * We're done when we see something other than an RUI.
+		 * There should be no RUIs left in the AIL now.
+		 */
+		if (!xlog_item_is_intent(lip)) {
+#ifdef DEBUG
+			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
+				ASSERT(!xlog_item_is_intent(lip));
+#endif
+			break;
+		}
+
+		/* Skip anything that isn't a BUI */
+		if (lip->li_type != XFS_LI_BUI) {
+			lip = xfs_trans_ail_cursor_next(ailp, &cur);
+			continue;
+		}
+
+		buip = container_of(lip, struct xfs_bui_log_item, bui_item);
+
+		spin_unlock(&ailp->xa_lock);
+		xfs_bui_release(buip);
+		spin_lock(&ailp->xa_lock);
+
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+	return error;
+}
+
+/*
  * This routine performs a transaction to null out a bad inode pointer
  * in an agi unlinked inode hash bucket.
  */
@@ -5881,6 +6170,12 @@ xlog_recover_finish(
 	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
 		int	error;
 
+		error = xlog_recover_process_buis(log);
+		if (error) {
+			xfs_alert(log->l_mp, "Failed to recover BUIs");
+			return error;
+		}
+
 		error = xlog_recover_process_cuis(log);
 		if (error) {
 			xfs_alert(log->l_mp, "Failed to recover CUIs");
@@ -5929,7 +6224,11 @@ xlog_recover_cancel(
 	int		err2;
 
 	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
-		error = xlog_recover_cancel_cuis(log);
+		error = xlog_recover_cancel_buis(log);
+
+		err2 = xlog_recover_cancel_cuis(log);
+		if (err2 && !error)
+			error = err2;
 
 		err2 = xlog_recover_cancel_ruis(log);
 		if (err2 && !error)
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 6b6cb4a..cda7d92 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -270,4 +270,22 @@ int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
 		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
 		struct xfs_btree_cur **pcur);
 
+enum xfs_bmap_intent_type;
+
+struct xfs_bui_log_item *xfs_trans_get_bui(struct xfs_trans *tp, uint nextents);
+void xfs_trans_log_start_bmap_update(struct xfs_trans *tp,
+		struct xfs_bui_log_item *buip, enum xfs_bmap_intent_type type,
+		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
+		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
+		xfs_exntst_t state);
+
+struct xfs_bud_log_item *xfs_trans_get_bud(struct xfs_trans *tp,
+		struct xfs_bui_log_item *buip, uint nextents);
+int xfs_trans_log_finish_bmap_update(struct xfs_trans *tp,
+		struct xfs_bud_log_item *rudp, struct xfs_defer_ops *dfops,
+		enum xfs_bmap_intent_type type, struct xfs_inode *ip,
+		int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
+		xfs_filblks_t blockcount, xfs_exntst_t state,
+		struct xfs_btree_cur **pcur);
+
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_bmap.c b/fs/xfs/xfs_trans_bmap.c
new file mode 100644
index 0000000..1517c83
--- /dev/null
+++ b/fs/xfs/xfs_trans_bmap.c
@@ -0,0 +1,201 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_bmap_item.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_inode.h"
+
+/*
+ * This routine is called to allocate an "bmap update intent"
+ * log item that will hold nextents worth of extents.  The
+ * caller must use all nextents extents, because we are not
+ * flexible about this at all.
+ */
+struct xfs_bui_log_item *
+xfs_trans_get_bui(
+	struct xfs_trans		*tp,
+	uint				nextents)
+{
+	struct xfs_bui_log_item		*buip;
+
+	ASSERT(tp != NULL);
+	ASSERT(nextents > 0);
+
+	buip = xfs_bui_init(tp->t_mountp, nextents);
+	ASSERT(buip != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &buip->bui_item);
+	return buip;
+}
+
+/*
+ * This routine is called to indicate that the described
+ * extent is to be logged as needing to be freed.  It should
+ * be called once for each extent to be freed.
+ */
+void
+xfs_trans_log_start_bmap_update(
+	struct xfs_trans		*tp,
+	struct xfs_bui_log_item		*buip,
+	enum xfs_bmap_intent_type	type,
+	__uint64_t			owner,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state)
+{
+	uint				next_extent;
+	struct xfs_map_extent		*bmap;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	buip->bui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	/*
+	 * atomic_inc_return gives us the value after the increment;
+	 * we want to use it as an array index so we need to subtract 1 from
+	 * it.
+	 */
+	next_extent = atomic_inc_return(&buip->bui_next_extent) - 1;
+	ASSERT(next_extent < buip->bui_format.bui_nextents);
+	bmap = &(buip->bui_format.bui_extents[next_extent]);
+	bmap->me_owner = owner;
+	bmap->me_startblock = startblock;
+	bmap->me_startoff = startoff;
+	bmap->me_len = blockcount;
+	bmap->me_flags = 0;
+	if (state == XFS_EXT_UNWRITTEN)
+		bmap->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
+	if (whichfork == XFS_ATTR_FORK)
+		bmap->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
+	switch (type) {
+	case XFS_BMAP_MAP:
+		bmap->me_flags |= XFS_BMAP_EXTENT_MAP;
+		break;
+	case XFS_BMAP_UNMAP:
+		bmap->me_flags |= XFS_BMAP_EXTENT_UNMAP;
+		break;
+	default:
+		ASSERT(0);
+	}
+}
+
+
+/*
+ * This routine is called to allocate an "extent free done"
+ * log item that will hold nextents worth of extents.  The
+ * caller must use all nextents extents, because we are not
+ * flexible about this at all.
+ */
+struct xfs_bud_log_item *
+xfs_trans_get_bud(
+	struct xfs_trans		*tp,
+	struct xfs_bui_log_item		*buip,
+	uint				nextents)
+{
+	struct xfs_bud_log_item		*budp;
+
+	ASSERT(tp != NULL);
+	ASSERT(nextents > 0);
+
+	budp = xfs_bud_init(tp->t_mountp, buip, nextents);
+	ASSERT(budp != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &budp->bud_item);
+	return budp;
+}
+
+/*
+ * Finish an bmap update and log it to the BUD. Note that the transaction is
+ * marked dirty regardless of whether the bmap update succeeds or fails to
+ * support the BUI/BUD lifecycle rules.
+ */
+int
+xfs_trans_log_finish_bmap_update(
+	struct xfs_trans		*tp,
+	struct xfs_bud_log_item		*budp,
+	struct xfs_defer_ops		*dop,
+	enum xfs_bmap_intent_type	type,
+	struct xfs_inode		*ip,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state,
+	struct xfs_btree_cur		**pcur)
+{
+	uint				next_extent;
+	struct xfs_map_extent		*bmap;
+	int				error;
+
+	error = -EFSCORRUPTED;
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the BUI and frees the BUD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	budp->bud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	next_extent = budp->bud_next_extent;
+	ASSERT(next_extent < budp->bud_format.bud_nextents);
+	bmap = &(budp->bud_format.bud_extents[next_extent]);
+	bmap->me_owner = ip->i_ino;
+	bmap->me_startblock = startblock;
+	bmap->me_startoff = startoff;
+	bmap->me_len = blockcount;
+	bmap->me_flags = 0;
+	if (state == XFS_EXT_UNWRITTEN)
+		bmap->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
+	if (whichfork == XFS_ATTR_FORK)
+		bmap->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
+	switch (type) {
+	case XFS_BMAP_MAP:
+		bmap->me_flags |= XFS_BMAP_EXTENT_MAP;
+		break;
+	case XFS_BMAP_UNMAP:
+		bmap->me_flags |= XFS_BMAP_EXTENT_UNMAP;
+		break;
+	default:
+		ASSERT(0);
+	}
+	budp->bud_next_extent++;
+
+	return error;
+}


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 071/119] xfs: map an inode's offset to an exact physical block
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (69 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 070/119] xfs: log bmap intent items Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 072/119] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
                   ` (47 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Teach the bmap routine to know how to map a range of file blocks to a
specific range of physical blocks, instead of simply allocating fresh
blocks.  This enables reflink to map a file to blocks that are already
in use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h |   10 +++++++
 fs/xfs/xfs_trace.h       |   54 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 126 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9044e39..c29dcdb 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3872,6 +3872,55 @@ xfs_bmap_btalloc(
 }
 
 /*
+ * For a remap operation, just "allocate" an extent at the address that the
+ * caller passed in, and ensure that the AGFL is the right size.  The caller
+ * will then map the "allocated" extent into the file somewhere.
+ */
+STATIC int
+xfs_bmap_remap_alloc(
+	struct xfs_bmalloca	*ap)
+{
+	struct xfs_trans	*tp = ap->tp;
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_agblock_t		bno;
+	struct xfs_alloc_arg	args;
+	int			error;
+
+	/*
+	 * validate that the block number is legal - the enables us to detect
+	 * and handle a silent filesystem corruption rather than crashing.
+	 */
+	memset(&args, 0, sizeof(struct xfs_alloc_arg));
+	args.tp = ap->tp;
+	args.mp = ap->tp->t_mountp;
+	bno = *ap->firstblock;
+	args.agno = XFS_FSB_TO_AGNO(mp, bno);
+	ASSERT(args.agno < mp->m_sb.sb_agcount);
+	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
+	ASSERT(args.agbno < mp->m_sb.sb_agblocks);
+
+	/* "Allocate" the extent from the range we passed in. */
+	trace_xfs_bmap_remap_alloc(ap->ip, *ap->firstblock, ap->length);
+	ap->blkno = bno;
+	ap->ip->i_d.di_nblocks += ap->length;
+	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
+
+	/* Fix the freelist, like a real allocator does. */
+	args.userdata = 1;
+	args.pag = xfs_perag_get(args.mp, args.agno);
+	ASSERT(args.pag);
+
+	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
+	if (error)
+		goto error0;
+error0:
+	xfs_perag_put(args.pag);
+	if (error)
+		trace_xfs_bmap_remap_alloc_error(ap->ip, error, _RET_IP_);
+	return error;
+}
+
+/*
  * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file.
  * It figures out where to ask the underlying allocator to put the new extent.
  */
@@ -3879,6 +3928,8 @@ STATIC int
 xfs_bmap_alloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
 {
+	if (ap->flags & XFS_BMAPI_REMAP)
+		return xfs_bmap_remap_alloc(ap);
 	if (XFS_IS_REALTIME_INODE(ap->ip) && ap->userdata)
 		return xfs_bmap_rtalloc(ap);
 	return xfs_bmap_btalloc(ap);
@@ -4515,6 +4566,12 @@ xfs_bmapi_write(
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	if (whichfork == XFS_ATTR_FORK)
+		ASSERT(!(flags & XFS_BMAPI_REMAP));
+	if (flags & XFS_BMAPI_REMAP) {
+		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
+		ASSERT(!(flags & XFS_BMAPI_CONVERT));
+	}
 
 	/* zeroing is for currently only for data extents, not metadata */
 	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
@@ -4576,6 +4633,12 @@ xfs_bmapi_write(
 		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
 
 		/*
+		 * Make sure we only reflink into a hole.
+		 */
+		if (flags & XFS_BMAPI_REMAP)
+			ASSERT(inhole);
+
+		/*
 		 * First, deal with the hole before the allocated space
 		 * that we found, if any.
 		 */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 62a66d0..fb2fd4c 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -97,6 +97,13 @@ struct xfs_bmap_free_item
  */
 #define XFS_BMAPI_ZERO		0x080
 
+/*
+ * Map the inode offset to the block given in ap->firstblock.  Primarily
+ * used for reflink.  The range must be in a hole, and this flag cannot be
+ * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
+ */
+#define XFS_BMAPI_REMAP		0x100
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -105,7 +112,8 @@ struct xfs_bmap_free_item
 	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
-	{ XFS_BMAPI_ZERO,	"ZERO" }
+	{ XFS_BMAPI_ZERO,	"ZERO" }, \
+	{ XFS_BMAPI_REMAP,	"REMAP" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8366102..8844c9f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2939,6 +2939,60 @@ TRACE_EVENT(xfs_refcount_finish_one_leftover,
 		  __entry->adjusted)
 );
 
+/* simple inode-based error/%ip tracepoint class */
+DECLARE_EVENT_CLASS(xfs_inode_error_class,
+	TP_PROTO(struct xfs_inode *ip, int error, unsigned long caller_ip),
+	TP_ARGS(ip, error, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, error)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->error = error;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d ino %llx error %d caller %ps",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->error,
+		  (char *)__entry->caller_ip)
+);
+
+#define DEFINE_INODE_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_inode_error_class, name, \
+	TP_PROTO(struct xfs_inode *ip, int error, \
+		 unsigned long caller_ip), \
+	TP_ARGS(ip, error, caller_ip))
+
+/* reflink allocator */
+TRACE_EVENT(xfs_bmap_remap_alloc,
+	TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t fsbno,
+		 xfs_extlen_t len),
+	TP_ARGS(ip, fsbno, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fsblock_t, fsbno)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->fsbno = fsbno;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d ino 0x%llx fsbno 0x%llx len %x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->fsbno,
+		  __entry->len)
+);
+DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 072/119] xfs: implement deferred bmbt map/unmap operations
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (70 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 071/119] xfs: map an inode's offset to an exact physical block Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 073/119] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
                   ` (46 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Implement deferred versions of the inode block map/unmap functions.
These will be used in subsequent patches to make reflink operations
atomic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c  |  124 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h  |   11 ++++
 fs/xfs/libxfs/xfs_defer.h |    1 
 fs/xfs/xfs_defer_item.c   |  113 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_error.h        |    4 +
 fs/xfs/xfs_log_recover.c  |   77 +++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.h        |    5 ++
 fs/xfs/xfs_trans.h        |    3 -
 fs/xfs/xfs_trans_bmap.c   |    6 +-
 9 files changed, 336 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c29dcdb..63cfb1c 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6127,3 +6127,127 @@ out:
 	xfs_trans_cancel(tp);
 	return error;
 }
+
+/* Record a bmap intent. */
+static int
+__xfs_bmap_add(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_bmap_intent	*bi)
+{
+	int			error;
+	struct xfs_bmap_intent	*new;
+
+	ASSERT(bi->bi_whichfork == XFS_DATA_FORK);
+
+	trace_xfs_bmap_defer(mp, XFS_FSB_TO_AGNO(mp, bi->bi_bmap.br_startblock),
+			bi->bi_type,
+			XFS_FSB_TO_AGBNO(mp, bi->bi_bmap.br_startblock),
+			bi->bi_owner->i_ino, bi->bi_whichfork,
+			bi->bi_bmap.br_startoff,
+			bi->bi_bmap.br_blockcount,
+			bi->bi_bmap.br_state);
+
+	new = kmem_zalloc(sizeof(struct xfs_bmap_intent), KM_SLEEP | KM_NOFS);
+	*new = *bi;
+
+	error = xfs_defer_join(dfops, bi->bi_owner);
+	if (error)
+		return error;
+
+	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_BMAP, &new->bi_list);
+	return 0;
+}
+
+/* Map an extent into a file. */
+int
+xfs_bmap_map_extent(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*PREV)
+{
+	struct xfs_bmap_intent	bi;
+
+	bi.bi_type = XFS_BMAP_MAP;
+	bi.bi_owner = ip;
+	bi.bi_whichfork = whichfork;
+	bi.bi_bmap = *PREV;
+
+	return __xfs_bmap_add(mp, dfops, &bi);
+}
+
+/* Unmap an extent out of a file. */
+int
+xfs_bmap_unmap_extent(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*PREV)
+{
+	struct xfs_bmap_intent	bi;
+
+	bi.bi_type = XFS_BMAP_UNMAP;
+	bi.bi_owner = ip;
+	bi.bi_whichfork = whichfork;
+	bi.bi_bmap = *PREV;
+
+	return __xfs_bmap_add(mp, dfops, &bi);
+}
+
+/*
+ * Process one of the deferred bmap operations.  We pass back the
+ * btree cursor to maintain our lock on the bmapbt between calls.
+ */
+int
+xfs_bmap_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dfops,
+	struct xfs_inode		*ip,
+	enum xfs_bmap_intent_type	type,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state)
+{
+	struct xfs_bmbt_irec		bmap;
+	int				nimaps = 1;
+	xfs_fsblock_t			firstfsb;
+	int				error = 0;
+
+	bmap.br_startblock = startblock;
+	bmap.br_startoff = startoff;
+	bmap.br_blockcount = blockcount;
+	bmap.br_state = state;
+
+	trace_xfs_bmap_deferred(tp->t_mountp,
+			XFS_FSB_TO_AGNO(tp->t_mountp, startblock), type,
+			XFS_FSB_TO_AGBNO(tp->t_mountp, startblock),
+			ip->i_ino, whichfork, startoff, blockcount, state);
+
+	if (XFS_TEST_ERROR(false, tp->t_mountp,
+			XFS_ERRTAG_BMAP_FINISH_ONE,
+			XFS_RANDOM_BMAP_FINISH_ONE))
+		return -EIO;
+
+	switch (type) {
+	case XFS_BMAP_MAP:
+		firstfsb = bmap.br_startblock;
+		error = xfs_bmapi_write(tp, ip, bmap.br_startoff,
+					bmap.br_blockcount,
+					XFS_BMAPI_REMAP, &firstfsb,
+					bmap.br_blockcount, &bmap, &nimaps,
+					dfops);
+		break;
+	case XFS_BMAP_UNMAP:
+		/* not implemented for now */
+	default:
+		ASSERT(0);
+		error = -EFSCORRUPTED;
+	}
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index fb2fd4c..394a22c 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -230,4 +230,15 @@ struct xfs_bmap_intent {
 	struct xfs_bmbt_irec			bi_bmap;
 };
 
+int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, enum xfs_bmap_intent_type type,
+		int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
+		xfs_filblks_t blockcount, xfs_exntst_t state);
+int	xfs_bmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_bmbt_irec *imap);
+int	xfs_bmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_bmbt_irec *imap);
+
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 4081b00..47aa048 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -51,6 +51,7 @@ struct xfs_defer_pending {
  * find all the space it needs.
  */
 enum xfs_defer_ops_type {
+	XFS_DEFER_OPS_TYPE_BMAP,
 	XFS_DEFER_OPS_TYPE_REFCOUNT,
 	XFS_DEFER_OPS_TYPE_RMAP,
 	XFS_DEFER_OPS_TYPE_FREE,
diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
index 2cac94f..c9ebddc 100644
--- a/fs/xfs/xfs_defer_item.c
+++ b/fs/xfs/xfs_defer_item.c
@@ -35,6 +35,9 @@
 #include "xfs_rmap_item.h"
 #include "xfs_refcount.h"
 #include "xfs_refcount_item.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_item.h"
+#include "xfs_inode.h"
 
 /* Extent Freeing */
 
@@ -394,12 +397,122 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 	.cancel_item	= xfs_refcount_update_cancel_item,
 };
 
+/* Inode Block Mapping */
+
+/* Sort bmap intents by inode. */
+static int
+xfs_bmap_update_diff_items(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_bmap_intent		*ba;
+	struct xfs_bmap_intent		*bb;
+
+	ba = container_of(a, struct xfs_bmap_intent, bi_list);
+	bb = container_of(b, struct xfs_bmap_intent, bi_list);
+	return ba->bi_owner->i_ino - bb->bi_owner->i_ino;
+}
+
+/* Get an BUI. */
+STATIC void *
+xfs_bmap_update_create_intent(
+	struct xfs_trans		*tp,
+	unsigned int			count)
+{
+	return xfs_trans_get_bui(tp, count);
+}
+
+/* Log bmap updates in the intent item. */
+STATIC void
+xfs_bmap_update_log_item(
+	struct xfs_trans		*tp,
+	void				*intent,
+	struct list_head		*item)
+{
+	struct xfs_bmap_intent		*bmap;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+	xfs_trans_log_start_bmap_update(tp, intent, bmap->bi_type,
+			bmap->bi_owner->i_ino, bmap->bi_whichfork,
+			bmap->bi_bmap.br_startoff,
+			bmap->bi_bmap.br_startblock,
+			bmap->bi_bmap.br_blockcount,
+			bmap->bi_bmap.br_state);
+}
+
+/* Get an BUD so we can process all the deferred rmap updates. */
+STATIC void *
+xfs_bmap_update_create_done(
+	struct xfs_trans		*tp,
+	void				*intent,
+	unsigned int			count)
+{
+	return xfs_trans_get_bud(tp, intent, count);
+}
+
+/* Process a deferred rmap update. */
+STATIC int
+xfs_bmap_update_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop,
+	struct list_head		*item,
+	void				*done_item,
+	void				**state)
+{
+	struct xfs_bmap_intent		*bmap;
+	int				error;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+	error = xfs_trans_log_finish_bmap_update(tp, done_item, dop,
+			bmap->bi_type,
+			bmap->bi_owner, bmap->bi_whichfork,
+			bmap->bi_bmap.br_startoff,
+			bmap->bi_bmap.br_startblock,
+			bmap->bi_bmap.br_blockcount,
+			bmap->bi_bmap.br_state);
+	kmem_free(bmap);
+	return error;
+}
+
+/* Abort all pending BUIs. */
+STATIC void
+xfs_bmap_update_abort_intent(
+	void				*intent)
+{
+	xfs_bui_release(intent);
+}
+
+/* Cancel a deferred rmap update. */
+STATIC void
+xfs_bmap_update_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_bmap_intent		*bmap;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+	kmem_free(bmap);
+}
+
+const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
+	.type		= XFS_DEFER_OPS_TYPE_BMAP,
+	.max_items	= XFS_BUI_MAX_FAST_EXTENTS,
+	.diff_items	= xfs_bmap_update_diff_items,
+	.create_intent	= xfs_bmap_update_create_intent,
+	.abort_intent	= xfs_bmap_update_abort_intent,
+	.log_item	= xfs_bmap_update_log_item,
+	.create_done	= xfs_bmap_update_create_done,
+	.finish_item	= xfs_bmap_update_finish_item,
+	.cancel_item	= xfs_bmap_update_cancel_item,
+};
+
 /* Deferred Item Initialization */
 
 /* Initialize the deferred operation types. */
 void
 xfs_defer_init_types(void)
 {
+	xfs_defer_init_op_type(&xfs_bmap_update_defer_type);
 	xfs_defer_init_op_type(&xfs_refcount_update_defer_type);
 	xfs_defer_init_op_type(&xfs_rmap_update_defer_type);
 	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 83d7b62..16a60de 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -94,7 +94,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_RMAP_FINISH_ONE			23
 #define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE		24
 #define XFS_ERRTAG_REFCOUNT_FINISH_ONE			25
-#define XFS_ERRTAG_MAX					26
+#define XFS_ERRTAG_BMAP_FINISH_ONE			26
+#define XFS_ERRTAG_MAX					27
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -125,6 +126,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_RMAP_FINISH_ONE			1
 #define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE		1
 #define XFS_RANDOM_REFCOUNT_FINISH_ONE			1
+#define XFS_RANDOM_BMAP_FINISH_ONE			1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 42000f4..3faaf10 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -5201,6 +5201,14 @@ xlog_recover_process_bui(
 	xfs_fsblock_t			startblock_fsb;
 	xfs_fsblock_t			inode_fsb;
 	bool				op_ok;
+	struct xfs_bud_log_item		*budp;
+	enum xfs_bmap_intent_type	type;
+	int				whichfork;
+	xfs_exntst_t			state;
+	struct xfs_trans		*tp;
+	struct xfs_inode		**ips;
+	struct xfs_defer_ops		dfops;
+	xfs_fsblock_t			firstfsb;
 
 	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
 
@@ -5241,9 +5249,74 @@ xlog_recover_process_bui(
 		}
 	}
 
-	/* XXX: do nothing for now */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error)
+		return error;
+	budp = xfs_trans_get_bud(tp, buip, buip->bui_format.bui_nextents);
+
+	xfs_defer_init(&dfops, &firstfsb);
+
+	/* Grab all the inodes we'll need. */
+	ips = kmem_zalloc(sizeof(struct xfs_inode *) *
+				buip->bui_format.bui_nextents, KM_SLEEP);
+	for (i = 0; i < buip->bui_format.bui_nextents; i++) {
+		bmap = &(buip->bui_format.bui_extents[i]);
+		error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL,
+				&ips[i]);
+		if (error)
+			goto err_inodes;
+	}
+
+	/* Process deferred bmap items. */
+	for (i = 0; i < buip->bui_format.bui_nextents; i++) {
+		bmap = &(buip->bui_format.bui_extents[i]);
+		state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
+				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
+		whichfork = (bmap->me_flags & XFS_BMAP_EXTENT_ATTR_FORK) ?
+				XFS_ATTR_FORK : XFS_DATA_FORK;
+		switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
+		case XFS_BMAP_EXTENT_MAP:
+			type = XFS_BMAP_MAP;
+			break;
+		case XFS_BMAP_EXTENT_UNMAP:
+			type = XFS_BMAP_UNMAP;
+			break;
+		default:
+			error = -EFSCORRUPTED;
+			goto err_dfops;
+		}
+		xfs_trans_ijoin(tp, ips[i], 0);
+
+		error = xfs_trans_log_finish_bmap_update(tp, budp, &dfops, type,
+				ips[i], whichfork, bmap->me_startoff,
+				bmap->me_startblock, bmap->me_len,
+				state);
+		if (error)
+			goto err_dfops;
+
+	}
+
+	/* Finish transaction, free inodes. */
+	error = xfs_defer_finish(&tp, &dfops, NULL);
+	if (error)
+		goto err_dfops;
 	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
-	xfs_bui_release(buip);
+	error = xfs_trans_commit(tp);
+	for (i = 0; i < buip->bui_format.bui_nextents; i++) {
+		xfs_iunlock(ips[i], XFS_ILOCK_EXCL);
+		IRELE(ips[i]);
+	}
+
+	return error;
+
+err_dfops:
+	xfs_defer_cancel(&dfops);
+err_inodes:
+	for (i = 0; i < buip->bui_format.bui_nextents; i++) {
+		xfs_iunlock(ips[i], XFS_ILOCK_EXCL);
+		IRELE(ips[i]);
+	}
+	xfs_trans_cancel(tp);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8844c9f..a18e321 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2560,6 +2560,11 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
 DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
+/* deferred bmbt updates */
+#define DEFINE_BMAP_DEFERRED_EVENT	DEFINE_RMAP_DEFERRED_EVENT
+DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_defer);
+DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_deferred);
+
 /* per-AG reservation */
 DECLARE_EVENT_CLASS(xfs_ag_resv_class,
 	TP_PROTO(struct xfs_perag *pag, enum xfs_ag_resv_type resv,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index cda7d92..6e890bc 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -285,7 +285,6 @@ int xfs_trans_log_finish_bmap_update(struct xfs_trans *tp,
 		struct xfs_bud_log_item *rudp, struct xfs_defer_ops *dfops,
 		enum xfs_bmap_intent_type type, struct xfs_inode *ip,
 		int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
-		xfs_filblks_t blockcount, xfs_exntst_t state,
-		struct xfs_btree_cur **pcur);
+		xfs_filblks_t blockcount, xfs_exntst_t state);
 
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_bmap.c b/fs/xfs/xfs_trans_bmap.c
index 1517c83..97f395a 100644
--- a/fs/xfs/xfs_trans_bmap.c
+++ b/fs/xfs/xfs_trans_bmap.c
@@ -154,14 +154,14 @@ xfs_trans_log_finish_bmap_update(
 	xfs_fileoff_t			startoff,
 	xfs_fsblock_t			startblock,
 	xfs_filblks_t			blockcount,
-	xfs_exntst_t			state,
-	struct xfs_btree_cur		**pcur)
+	xfs_exntst_t			state)
 {
 	uint				next_extent;
 	struct xfs_map_extent		*bmap;
 	int				error;
 
-	error = -EFSCORRUPTED;
+	error = xfs_bmap_finish_one(tp, dop, ip, type, whichfork, startoff,
+			startblock, blockcount, state);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 073/119] xfs: return work remaining at the end of a bunmapi operation
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (71 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 072/119] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 074/119] xfs: define tracepoints for reflink activities Darrick J. Wong
                   ` (45 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Return the range of file blocks that bunmapi didn't free.  This hint
is used by CoW and reflink to figure out what part of an extent
actually got freed so that it can set up the appropriate atomic
remapping of just the freed range.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   36 ++++++++++++++++++++++++++++++------
 fs/xfs/libxfs/xfs_bmap.h |    4 ++++
 2 files changed, 34 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 63cfb1c..5af4593 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5165,17 +5165,16 @@ done:
  * *done is set.
  */
 int						/* error */
-xfs_bunmapi(
+__xfs_bunmapi(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	struct xfs_inode	*ip,		/* incore inode */
 	xfs_fileoff_t		bno,		/* starting offset to unmap */
-	xfs_filblks_t		len,		/* length to unmap in file */
+	xfs_filblks_t		*rlen,		/* i/o: amount remaining */
 	int			flags,		/* misc flags */
 	xfs_extnum_t		nexts,		/* number of extents max */
 	xfs_fsblock_t		*firstblock,	/* first allocated block
 						   controls a.g. for allocs */
-	struct xfs_defer_ops	*dfops,		/* i/o: list extents to free */
-	int			*done)		/* set if not done yet */
+	struct xfs_defer_ops	*dfops)		/* i/o: deferred updates */
 {
 	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
 	xfs_bmbt_irec_t		del;		/* extent being deleted */
@@ -5197,6 +5196,7 @@ xfs_bunmapi(
 	int			wasdel;		/* was a delayed alloc extent */
 	int			whichfork;	/* data or attribute fork */
 	xfs_fsblock_t		sum;
+	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
 
 	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
 
@@ -5223,7 +5223,7 @@ xfs_bunmapi(
 		return error;
 	nextents = ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
 	if (nextents == 0) {
-		*done = 1;
+		*rlen = 0;
 		return 0;
 	}
 	XFS_STATS_INC(mp, xs_blk_unmap);
@@ -5492,7 +5492,10 @@ nodelete:
 			extno++;
 		}
 	}
-	*done = bno == (xfs_fileoff_t)-1 || bno < start || lastx < 0;
+	if (bno == (xfs_fileoff_t)-1 || bno < start || lastx < 0)
+		*rlen = 0;
+	else
+		*rlen = bno - start + 1;
 
 	/*
 	 * Convert to a btree if necessary.
@@ -5548,6 +5551,27 @@ error0:
 	return error;
 }
 
+/* Unmap a range of a file. */
+int
+xfs_bunmapi(
+	xfs_trans_t		*tp,
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		bno,
+	xfs_filblks_t		len,
+	int			flags,
+	xfs_extnum_t		nexts,
+	xfs_fsblock_t		*firstblock,
+	struct xfs_defer_ops	*dfops,
+	int			*done)
+{
+	int			error;
+
+	error = __xfs_bunmapi(tp, ip, bno, &len, flags, nexts, firstblock,
+			dfops);
+	*done = (len == 0);
+	return error;
+}
+
 /*
  * Determine whether an extent shift can be accomplished by a merge with the
  * extent that precedes the target hole of the shift.
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 394a22c..97828c5 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -197,6 +197,10 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fsblock_t *firstblock, xfs_extlen_t total,
 		struct xfs_bmbt_irec *mval, int *nmap,
 		struct xfs_defer_ops *dfops);
+int	__xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
+		xfs_fileoff_t bno, xfs_filblks_t *rlen, int flags,
+		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
+		struct xfs_defer_ops *dfops);
 int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 074/119] xfs: define tracepoints for reflink activities
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (72 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 073/119] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 075/119] xfs: add reflink feature flag to geometry Darrick J. Wong
                   ` (44 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Define all the tracepoints we need to inspect the runtime operation
of reflink/dedupe/copy-on-write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_trace.h |  333 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 333 insertions(+)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a18e321..fe9c6f8 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2998,6 +2998,339 @@ TRACE_EVENT(xfs_bmap_remap_alloc,
 );
 DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
 
+/* reflink tracepoint classes */
+
+/* two-file io tracepoint class */
+DECLARE_EVENT_CLASS(xfs_double_io_class,
+	TP_PROTO(struct xfs_inode *src, xfs_off_t soffset, xfs_off_t len,
+		 struct xfs_inode *dest, xfs_off_t doffset),
+	TP_ARGS(src, soffset, len, dest, doffset),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, src_ino)
+		__field(loff_t, src_isize)
+		__field(loff_t, src_disize)
+		__field(loff_t, src_offset)
+		__field(size_t, len)
+		__field(xfs_ino_t, dest_ino)
+		__field(loff_t, dest_isize)
+		__field(loff_t, dest_disize)
+		__field(loff_t, dest_offset)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(src)->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_isize = VFS_I(src)->i_size;
+		__entry->src_disize = src->i_d.di_size;
+		__entry->src_offset = soffset;
+		__entry->len = len;
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_isize = VFS_I(dest)->i_size;
+		__entry->dest_disize = dest->i_d.di_size;
+		__entry->dest_offset = doffset;
+	),
+	TP_printk("dev %d:%d count %zd "
+		  "ino 0x%llx isize 0x%llx disize 0x%llx offset 0x%llx -> "
+		  "ino 0x%llx isize 0x%llx disize 0x%llx offset 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->len,
+		  __entry->src_ino,
+		  __entry->src_isize,
+		  __entry->src_disize,
+		  __entry->src_offset,
+		  __entry->dest_ino,
+		  __entry->dest_isize,
+		  __entry->dest_disize,
+		  __entry->dest_offset)
+)
+
+#define DEFINE_DOUBLE_IO_EVENT(name)	\
+DEFINE_EVENT(xfs_double_io_class, name,	\
+	TP_PROTO(struct xfs_inode *src, xfs_off_t soffset, xfs_off_t len, \
+		 struct xfs_inode *dest, xfs_off_t doffset), \
+	TP_ARGS(src, soffset, len, dest, doffset))
+
+/* two-file vfs io tracepoint class */
+DECLARE_EVENT_CLASS(xfs_double_vfs_io_class,
+	TP_PROTO(struct inode *src, u64 soffset, u64 len,
+		 struct inode *dest, u64 doffset),
+	TP_ARGS(src, soffset, len, dest, doffset),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, src_ino)
+		__field(loff_t, src_isize)
+		__field(loff_t, src_offset)
+		__field(size_t, len)
+		__field(unsigned long, dest_ino)
+		__field(loff_t, dest_isize)
+		__field(loff_t, dest_offset)
+	),
+	TP_fast_assign(
+		__entry->dev = src->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_isize = i_size_read(src);
+		__entry->src_offset = soffset;
+		__entry->len = len;
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_isize = i_size_read(dest);
+		__entry->dest_offset = doffset;
+	),
+	TP_printk("dev %d:%d count %zd "
+		  "ino 0x%lx isize 0x%llx offset 0x%llx -> "
+		  "ino 0x%lx isize 0x%llx offset 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->len,
+		  __entry->src_ino,
+		  __entry->src_isize,
+		  __entry->src_offset,
+		  __entry->dest_ino,
+		  __entry->dest_isize,
+		  __entry->dest_offset)
+)
+
+#define DEFINE_DOUBLE_VFS_IO_EVENT(name)	\
+DEFINE_EVENT(xfs_double_vfs_io_class, name,	\
+	TP_PROTO(struct inode *src, u64 soffset, u64 len, \
+		 struct inode *dest, u64 doffset), \
+	TP_ARGS(src, soffset, len, dest, doffset))
+
+/* CoW write tracepoint */
+DECLARE_EVENT_CLASS(xfs_copy_on_write_class,
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk, xfs_fsblock_t pblk,
+		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
+	TP_ARGS(ip, lblk, pblk, len, new_pblk),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_fsblock_t, pblk)
+		__field(xfs_extlen_t, len)
+		__field(xfs_fsblock_t, new_pblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = lblk;
+		__entry->pblk = pblk;
+		__entry->len = len;
+		__entry->new_pblk = new_pblk;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx pblk 0x%llx "
+		  "len 0x%x new_pblk %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->pblk,
+		  __entry->len,
+		  __entry->new_pblk)
+)
+
+#define DEFINE_COW_EVENT(name)	\
+DEFINE_EVENT(xfs_copy_on_write_class, name,	\
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk, xfs_fsblock_t pblk, \
+		 xfs_extlen_t len, xfs_fsblock_t new_pblk), \
+	TP_ARGS(ip, lblk, pblk, len, new_pblk))
+
+/* inode/irec events */
+DECLARE_EVENT_CLASS(xfs_inode_irec_class,
+	TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec),
+	TP_ARGS(ip, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_extlen_t, len)
+		__field(xfs_fsblock_t, pblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = irec->br_startoff;
+		__entry->len = irec->br_blockcount;
+		__entry->pblk = irec->br_startblock;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->len,
+		  __entry->pblk)
+);
+#define DEFINE_INODE_IREC_EVENT(name) \
+DEFINE_EVENT(xfs_inode_irec_class, name, \
+	TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec), \
+	TP_ARGS(ip, irec))
+
+/* refcount/reflink tracepoint definitions */
+
+/* reflink tracepoints */
+DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
+DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
+DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
+DEFINE_IOMAP_EVENT(xfs_reflink_remap_imap);
+TRACE_EVENT(xfs_reflink_remap_blocks_loop,
+	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
+		 xfs_filblks_t len, struct xfs_inode *dest,
+		 xfs_fileoff_t doffset),
+	TP_ARGS(src, soffset, len, dest, doffset),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, src_ino)
+		__field(xfs_fileoff_t, src_lblk)
+		__field(xfs_filblks_t, len)
+		__field(xfs_ino_t, dest_ino)
+		__field(xfs_fileoff_t, dest_lblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(src)->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_lblk = soffset;
+		__entry->len = len;
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_lblk = doffset;
+	),
+	TP_printk("dev %d:%d len 0x%llx "
+		  "ino 0x%llx offset 0x%llx blocks -> "
+		  "ino 0x%llx offset 0x%llx blocks",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->len,
+		  __entry->src_ino,
+		  __entry->src_lblk,
+		  __entry->dest_ino,
+		  __entry->dest_lblk)
+);
+TRACE_EVENT(xfs_reflink_punch_range,
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
+		 xfs_extlen_t len),
+	TP_ARGS(ip, lblk, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = lblk;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->len)
+);
+TRACE_EVENT(xfs_reflink_remap,
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
+		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
+	TP_ARGS(ip, lblk, len, new_pblk),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_extlen_t, len)
+		__field(xfs_fsblock_t, new_pblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = lblk;
+		__entry->len = len;
+		__entry->new_pblk = new_pblk;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->len,
+		  __entry->new_pblk)
+);
+DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_reflink_main_loop_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_read_iomap_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
+
+/* dedupe tracepoints */
+DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_compare_extents_error);
+
+/* ioctl tracepoints */
+DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_reflink);
+DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_clone_range);
+DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_file_extent_same);
+TRACE_EVENT(xfs_ioctl_clone,
+	TP_PROTO(struct inode *src, struct inode *dest),
+	TP_ARGS(src, dest),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, src_ino)
+		__field(loff_t, src_isize)
+		__field(unsigned long, dest_ino)
+		__field(loff_t, dest_isize)
+	),
+	TP_fast_assign(
+		__entry->dev = src->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_isize = i_size_read(src);
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_isize = i_size_read(dest);
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%lx isize 0x%llx -> "
+		  "ino 0x%lx isize 0x%llx\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->src_ino,
+		  __entry->src_isize,
+		  __entry->dest_ino,
+		  __entry->dest_isize)
+);
+
+/* unshare tracepoints */
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_unshare);
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cow_eof_block);
+DEFINE_PAGE_EVENT(xfs_reflink_unshare_page);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_unshare_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_cow_eof_block_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_dirty_page_error);
+
+/* copy on write */
+DEFINE_INODE_IREC_EVENT(xfs_reflink_irec_is_shared);
+
+DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_reserve_cow_extent);
+DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
+
+DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
+DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
+DEFINE_SIMPLE_IO_EVENT(xfs_iomap_cow_delay);
+
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap_piece);
+
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_reserve_cow_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_reserve_cow_extent_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_allocate_cow_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_cow_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
+
+DEFINE_COW_EVENT(xfs_reflink_fork_buf);
+DEFINE_COW_EVENT(xfs_reflink_finish_fork_buf);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_fork_buf_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_finish_fork_buf_error);
+
+DEFINE_INODE_EVENT(xfs_reflink_cancel_pending_cow);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_pending_cow_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 075/119] xfs: add reflink feature flag to geometry
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (73 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 074/119] xfs: define tracepoints for reflink activities Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:25 ` [PATCH 076/119] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
                   ` (43 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Report the reflink feature in the XFS geometry so that xfs_info and
friends know the filesystem has this feature.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |    3 ++-
 fs/xfs/xfs_fsops.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 7945505..6f4f2c3 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -206,7 +206,8 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
 #define XFS_FSOP_GEOM_FLAGS_SPINODES	0x40000	/* sparse inode chunks	*/
-#define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* Reverse mapping btree */
+#define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* reverse mapping btree */
+#define XFS_FSOP_GEOM_FLAGS_REFLINK	0x100000 /* files can share blocks */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 3c1ded1..84e7ba3 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -107,7 +107,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hassparseinodes(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_SPINODES : 0) |
 			(xfs_sb_version_hasrmapbt(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0);
+				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0) |
+			(xfs_sb_version_hasreflink(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_REFLINK : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 076/119] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (74 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 075/119] xfs: add reflink feature flag to geometry Darrick J. Wong
@ 2016-06-17  1:25 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 077/119] xfs: introduce the CoW fork Darrick J. Wong
                   ` (42 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:25 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Only non-rt files can be reflinked, so check that when we load an
inode.  Also, don't leak the attr fork if there's a failure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_fork.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index bbcc8c7..7699a03 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -121,6 +121,26 @@ xfs_iformat_fork(
 		return -EFSCORRUPTED;
 	}
 
+	if (unlikely(xfs_is_reflink_inode(ip) &&
+	    (VFS_I(ip)->i_mode & S_IFMT) != S_IFREG)) {
+		xfs_warn(ip->i_mount,
+			"corrupt dinode %llu, wrong file type for reflink.",
+			ip->i_ino);
+		XFS_CORRUPTION_ERROR("xfs_iformat(reflink)",
+				     XFS_ERRLEVEL_LOW, ip->i_mount, dip);
+		return -EFSCORRUPTED;
+	}
+
+	if (unlikely(xfs_is_reflink_inode(ip) &&
+	    (ip->i_d.di_flags & XFS_DIFLAG_REALTIME))) {
+		xfs_warn(ip->i_mount,
+			"corrupt dinode %llu, has reflink+realtime flag set.",
+			ip->i_ino);
+		XFS_CORRUPTION_ERROR("xfs_iformat(reflink)",
+				     XFS_ERRLEVEL_LOW, ip->i_mount, dip);
+		return -EFSCORRUPTED;
+	}
+
 	switch (VFS_I(ip)->i_mode & S_IFMT) {
 	case S_IFIFO:
 	case S_IFCHR:
@@ -208,7 +228,8 @@ xfs_iformat_fork(
 			XFS_CORRUPTION_ERROR("xfs_iformat(8)",
 					     XFS_ERRLEVEL_LOW,
 					     ip->i_mount, dip);
-			return -EFSCORRUPTED;
+			error = -EFSCORRUPTED;
+			break;
 		}
 
 		error = xfs_iformat_local(ip, dip, XFS_ATTR_FORK, size);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 077/119] xfs: introduce the CoW fork
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (75 preceding siblings ...)
  2016-06-17  1:25 ` [PATCH 076/119] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 078/119] xfs: support bmapping delalloc extents in " Darrick J. Wong
                   ` (41 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Introduce a new in-core fork for storing copy-on-write delalloc
reservations and allocated extents that are in the process of being
written out.

v2: fix up bmapi_read so that we can query the CoW fork, and have it
return a "hole" extent if there's no CoW fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_bmap.c       |   27 +++++++--
 fs/xfs/libxfs/xfs_bmap.h       |   22 +++++++-
 fs/xfs/libxfs/xfs_bmap_btree.c |    1 
 fs/xfs/libxfs/xfs_inode_fork.c |   47 +++++++++++++++-
 fs/xfs/libxfs/xfs_inode_fork.h |   28 ++++++++--
 fs/xfs/libxfs/xfs_rmap.c       |    2 +
 fs/xfs/libxfs/xfs_types.h      |    1 
 fs/xfs/xfs_icache.c            |    5 ++
 fs/xfs/xfs_inode.h             |    4 +
 fs/xfs/xfs_reflink.c           |  114 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h           |   23 ++++++++
 fs/xfs/xfs_trace.h             |    4 +
 13 files changed, 258 insertions(+), 21 deletions(-)
 create mode 100644 fs/xfs/xfs_reflink.c
 create mode 100644 fs/xfs/xfs_reflink.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 941afe6..56c384b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -91,6 +91,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_reflink.o \
 				   xfs_stats.o \
 				   xfs_super.o \
 				   xfs_symlink.o \
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 5af4593..ccfaf60 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2924,6 +2924,7 @@ xfs_bmap_add_extent_hole_real(
 	ASSERT(!isnullstartblock(new->br_startblock));
 	ASSERT(!bma->cur ||
 	       !(bma->cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL));
+	ASSERT(whichfork != XFS_COW_FORK);
 
 	XFS_STATS_INC(mp, xs_add_exlist);
 
@@ -4058,12 +4059,11 @@ xfs_bmapi_read(
 	int			error;
 	int			eof;
 	int			n = 0;
-	int			whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-						XFS_ATTR_FORK : XFS_DATA_FORK;
+	int			whichfork = xfs_bmapi_whichfork(flags);
 
 	ASSERT(*nmap >= 1);
 	ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK|XFS_BMAPI_ENTIRE|
-			   XFS_BMAPI_IGSTATE)));
+			   XFS_BMAPI_IGSTATE|XFS_BMAPI_COWFORK)));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL));
 
 	if (unlikely(XFS_TEST_ERROR(
@@ -4081,6 +4081,16 @@ xfs_bmapi_read(
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 
+	/* No CoW fork?  Return a hole. */
+	if (whichfork == XFS_COW_FORK && !ifp) {
+		mval->br_startoff = bno;
+		mval->br_startblock = HOLESTARTBLOCK;
+		mval->br_blockcount = len;
+		mval->br_state = XFS_EXT_NORM;
+		*nmap = 1;
+		return 0;
+	}
+
 	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
 		error = xfs_iread_extents(NULL, ip, whichfork);
 		if (error)
@@ -4433,8 +4443,7 @@ xfs_bmapi_convert_unwritten(
 	xfs_filblks_t		len,
 	int			flags)
 {
-	int			whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-						XFS_ATTR_FORK : XFS_DATA_FORK;
+	int			whichfork = xfs_bmapi_whichfork(flags);
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(bma->ip, whichfork);
 	int			tmp_logflags = 0;
 	int			error;
@@ -4450,6 +4459,8 @@ xfs_bmapi_convert_unwritten(
 			(XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT))
 		return 0;
 
+	ASSERT(whichfork != XFS_COW_FORK);
+
 	/*
 	 * Modify (by adding) the state flag, if writing.
 	 */
@@ -4862,6 +4873,8 @@ xfs_bmap_del_extent(
 
 	if (whichfork == XFS_ATTR_FORK)
 		state |= BMAP_ATTRFORK;
+	else if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	ASSERT((*idx >= 0) && (*idx < ifp->if_bytes /
@@ -5200,8 +5213,8 @@ __xfs_bunmapi(
 
 	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
 
-	whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-		XFS_ATTR_FORK : XFS_DATA_FORK;
+	whichfork = xfs_bmapi_whichfork(flags);
+	ASSERT(whichfork != XFS_COW_FORK);
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	if (unlikely(
 	    XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 97828c5..a8ef1c6 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -104,6 +104,9 @@ struct xfs_bmap_free_item
  */
 #define XFS_BMAPI_REMAP		0x100
 
+/* Map something in the CoW fork. */
+#define XFS_BMAPI_COWFORK	0x200
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -113,12 +116,23 @@ struct xfs_bmap_free_item
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
 	{ XFS_BMAPI_ZERO,	"ZERO" }, \
-	{ XFS_BMAPI_REMAP,	"REMAP" }
+	{ XFS_BMAPI_REMAP,	"REMAP" }, \
+	{ XFS_BMAPI_COWFORK,	"COWFORK" }
 
 
 static inline int xfs_bmapi_aflag(int w)
 {
-	return (w == XFS_ATTR_FORK ? XFS_BMAPI_ATTRFORK : 0);
+	return (w == XFS_ATTR_FORK ? XFS_BMAPI_ATTRFORK :
+	       (w == XFS_COW_FORK ? XFS_BMAPI_COWFORK : 0));
+}
+
+static inline int xfs_bmapi_whichfork(int bmapi_flags)
+{
+	if (bmapi_flags & XFS_BMAPI_COWFORK)
+		return XFS_COW_FORK;
+	else if (bmapi_flags & XFS_BMAPI_ATTRFORK)
+		return XFS_ATTR_FORK;
+	return XFS_DATA_FORK;
 }
 
 /*
@@ -139,13 +153,15 @@ static inline int xfs_bmapi_aflag(int w)
 #define BMAP_LEFT_VALID		(1 << 6)
 #define BMAP_RIGHT_VALID	(1 << 7)
 #define BMAP_ATTRFORK		(1 << 8)
+#define BMAP_COWFORK		(1 << 9)
 
 #define XFS_BMAP_EXT_FLAGS \
 	{ BMAP_LEFT_CONTIG,	"LC" }, \
 	{ BMAP_RIGHT_CONTIG,	"RC" }, \
 	{ BMAP_LEFT_FILLING,	"LF" }, \
 	{ BMAP_RIGHT_FILLING,	"RF" }, \
-	{ BMAP_ATTRFORK,	"ATTR" }
+	{ BMAP_ATTRFORK,	"ATTR" }, \
+	{ BMAP_COWFORK,		"COW" }
 
 
 /*
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 3e68f9a..a5a8d37 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -776,6 +776,7 @@ xfs_bmbt_init_cursor(
 {
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	struct xfs_btree_cur	*cur;
+	ASSERT(whichfork != XFS_COW_FORK);
 
 	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
 
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 7699a03..d29954a 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -206,9 +206,14 @@ xfs_iformat_fork(
 		XFS_ERROR_REPORT("xfs_iformat(7)", XFS_ERRLEVEL_LOW, ip->i_mount);
 		return -EFSCORRUPTED;
 	}
-	if (error) {
+	if (error)
 		return error;
+
+	if (xfs_is_reflink_inode(ip)) {
+		ASSERT(ip->i_cowfp == NULL);
+		xfs_ifork_init_cow(ip);
 	}
+
 	if (!XFS_DFORK_Q(dip))
 		return 0;
 
@@ -247,6 +252,9 @@ xfs_iformat_fork(
 	if (error) {
 		kmem_zone_free(xfs_ifork_zone, ip->i_afp);
 		ip->i_afp = NULL;
+		if (ip->i_cowfp)
+			kmem_zone_free(xfs_ifork_zone, ip->i_cowfp);
+		ip->i_cowfp = NULL;
 		xfs_idestroy_fork(ip, XFS_DATA_FORK);
 	}
 	return error;
@@ -761,6 +769,9 @@ xfs_idestroy_fork(
 	if (whichfork == XFS_ATTR_FORK) {
 		kmem_zone_free(xfs_ifork_zone, ip->i_afp);
 		ip->i_afp = NULL;
+	} else if (whichfork == XFS_COW_FORK) {
+		kmem_zone_free(xfs_ifork_zone, ip->i_cowfp);
+		ip->i_cowfp = NULL;
 	}
 }
 
@@ -948,6 +959,19 @@ xfs_iext_get_ext(
 	}
 }
 
+/* XFS_IEXT_STATE_TO_FORK() -- Convert BMAP state flags to an inode fork. */
+xfs_ifork_t *
+XFS_IEXT_STATE_TO_FORK(
+	struct xfs_inode	*ip,
+	int			state)
+{
+	if (state & BMAP_COWFORK)
+		return ip->i_cowfp;
+	else if (state & BMAP_ATTRFORK)
+		return ip->i_afp;
+	return &ip->i_df;
+}
+
 /*
  * Insert new item(s) into the extent records for incore inode
  * fork 'ifp'.  'count' new items are inserted at index 'idx'.
@@ -960,7 +984,7 @@ xfs_iext_insert(
 	xfs_bmbt_irec_t	*new,		/* items to insert */
 	int		state)		/* type of extent conversion */
 {
-	xfs_ifork_t	*ifp = (state & BMAP_ATTRFORK) ? ip->i_afp : &ip->i_df;
+	xfs_ifork_t	*ifp = XFS_IEXT_STATE_TO_FORK(ip, state);
 	xfs_extnum_t	i;		/* extent record index */
 
 	trace_xfs_iext_insert(ip, idx, new, state, _RET_IP_);
@@ -1210,7 +1234,7 @@ xfs_iext_remove(
 	int		ext_diff,	/* number of extents to remove */
 	int		state)		/* type of extent conversion */
 {
-	xfs_ifork_t	*ifp = (state & BMAP_ATTRFORK) ? ip->i_afp : &ip->i_df;
+	xfs_ifork_t	*ifp = XFS_IEXT_STATE_TO_FORK(ip, state);
 	xfs_extnum_t	nextents;	/* number of extents in file */
 	int		new_size;	/* size of extents after removal */
 
@@ -1955,3 +1979,20 @@ xfs_iext_irec_update_extoffs(
 		ifp->if_u1.if_ext_irec[i].er_extoff += ext_diff;
 	}
 }
+
+/*
+ * Initialize an inode's copy-on-write fork.
+ */
+void
+xfs_ifork_init_cow(
+	struct xfs_inode	*ip)
+{
+	if (ip->i_cowfp)
+		return;
+
+	ip->i_cowfp = kmem_zone_zalloc(xfs_ifork_zone,
+				       KM_SLEEP | KM_NOFS);
+	ip->i_cowfp->if_flags = XFS_IFEXTENTS;
+	ip->i_cformat = XFS_DINODE_FMT_EXTENTS;
+	ip->i_cnextents = 0;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index f95e072..44d38eb 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -92,7 +92,9 @@ typedef struct xfs_ifork {
 #define XFS_IFORK_PTR(ip,w)		\
 	((w) == XFS_DATA_FORK ? \
 		&(ip)->i_df : \
-		(ip)->i_afp)
+		((w) == XFS_ATTR_FORK ? \
+			(ip)->i_afp : \
+			(ip)->i_cowfp))
 #define XFS_IFORK_DSIZE(ip) \
 	(XFS_IFORK_Q(ip) ? \
 		XFS_IFORK_BOFF(ip) : \
@@ -105,26 +107,38 @@ typedef struct xfs_ifork {
 #define XFS_IFORK_SIZE(ip,w) \
 	((w) == XFS_DATA_FORK ? \
 		XFS_IFORK_DSIZE(ip) : \
-		XFS_IFORK_ASIZE(ip))
+		((w) == XFS_ATTR_FORK ? \
+			XFS_IFORK_ASIZE(ip) : \
+			0))
 #define XFS_IFORK_FORMAT(ip,w) \
 	((w) == XFS_DATA_FORK ? \
 		(ip)->i_d.di_format : \
-		(ip)->i_d.di_aformat)
+		((w) == XFS_ATTR_FORK ? \
+			(ip)->i_d.di_aformat : \
+			(ip)->i_cformat))
 #define XFS_IFORK_FMT_SET(ip,w,n) \
 	((w) == XFS_DATA_FORK ? \
 		((ip)->i_d.di_format = (n)) : \
-		((ip)->i_d.di_aformat = (n)))
+		((w) == XFS_ATTR_FORK ? \
+			((ip)->i_d.di_aformat = (n)) : \
+			((ip)->i_cformat = (n))))
 #define XFS_IFORK_NEXTENTS(ip,w) \
 	((w) == XFS_DATA_FORK ? \
 		(ip)->i_d.di_nextents : \
-		(ip)->i_d.di_anextents)
+		((w) == XFS_ATTR_FORK ? \
+			(ip)->i_d.di_anextents : \
+			(ip)->i_cnextents))
 #define XFS_IFORK_NEXT_SET(ip,w,n) \
 	((w) == XFS_DATA_FORK ? \
 		((ip)->i_d.di_nextents = (n)) : \
-		((ip)->i_d.di_anextents = (n)))
+		((w) == XFS_ATTR_FORK ? \
+			((ip)->i_d.di_anextents = (n)) : \
+			((ip)->i_cnextents = (n))))
 #define XFS_IFORK_MAXEXT(ip, w) \
 	(XFS_IFORK_SIZE(ip, w) / sizeof(xfs_bmbt_rec_t))
 
+xfs_ifork_t	*XFS_IEXT_STATE_TO_FORK(struct xfs_inode *ip, int state);
+
 int		xfs_iformat_fork(struct xfs_inode *, struct xfs_dinode *);
 void		xfs_iflush_fork(struct xfs_inode *, struct xfs_dinode *,
 				struct xfs_inode_log_item *, int);
@@ -169,4 +183,6 @@ void		xfs_iext_irec_update_extoffs(struct xfs_ifork *, int, int);
 
 extern struct kmem_zone	*xfs_ifork_zone;
 
+extern void xfs_ifork_init_cow(struct xfs_inode *ip);
+
 #endif	/* __XFS_INODE_FORK_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index f179ea4..611107c 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -1346,6 +1346,8 @@ __xfs_rmap_add(
 
 	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
 		return 0;
+	if (ri->ri_whichfork == XFS_COW_FORK)
+		return 0;
 
 	trace_xfs_rmap_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock),
 			ri->ri_type,
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 690d616..cf044c0 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -93,6 +93,7 @@ typedef __int64_t	xfs_sfiloff_t;	/* signed block number in a file */
  */
 #define	XFS_DATA_FORK	0
 #define	XFS_ATTR_FORK	1
+#define	XFS_COW_FORK	2
 
 /*
  * Min numbers of data/attr fork btree root pointers.
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 99ee6eee..06f3b8c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -76,6 +76,9 @@ xfs_inode_alloc(
 	ip->i_mount = mp;
 	memset(&ip->i_imap, 0, sizeof(struct xfs_imap));
 	ip->i_afp = NULL;
+	ip->i_cowfp = NULL;
+	ip->i_cnextents = 0;
+	ip->i_cformat = XFS_DINODE_FMT_EXTENTS;
 	memset(&ip->i_df, 0, sizeof(xfs_ifork_t));
 	ip->i_flags = 0;
 	ip->i_delayed_blks = 0;
@@ -101,6 +104,8 @@ xfs_inode_free_callback(
 
 	if (ip->i_afp)
 		xfs_idestroy_fork(ip, XFS_ATTR_FORK);
+	if (ip->i_cowfp)
+		xfs_idestroy_fork(ip, XFS_COW_FORK);
 
 	if (ip->i_itemp) {
 		ASSERT(!(ip->i_itemp->ili_item.li_flags & XFS_LI_IN_AIL));
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index d0ea6ff..797fcc7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -47,6 +47,7 @@ typedef struct xfs_inode {
 
 	/* Extent information. */
 	xfs_ifork_t		*i_afp;		/* attribute fork pointer */
+	xfs_ifork_t		*i_cowfp;	/* copy on write extents */
 	xfs_ifork_t		i_df;		/* data fork */
 
 	/* operations vectors */
@@ -65,6 +66,9 @@ typedef struct xfs_inode {
 
 	struct xfs_icdinode	i_d;		/* most of ondisk inode */
 
+	xfs_extnum_t		i_cnextents;	/* # of extents in cow fork */
+	unsigned int		i_cformat;	/* format of cow fork */
+
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
 } xfs_inode_t;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
new file mode 100644
index 0000000..7adbb83
--- /dev/null
+++ b/fs/xfs/xfs_reflink.c
@@ -0,0 +1,114 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_error.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_ioctl.h"
+#include "xfs_trace.h"
+#include "xfs_log.h"
+#include "xfs_icache.h"
+#include "xfs_pnfs.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_bit.h"
+#include "xfs_alloc.h"
+#include "xfs_quota_defs.h"
+#include "xfs_quota.h"
+#include "xfs_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_reflink.h"
+
+/*
+ * Copy on Write of Shared Blocks
+ *
+ * XFS must preserve "the usual" file semantics even when two files share
+ * the same physical blocks.  This means that a write to one file must not
+ * alter the blocks in a different file; the way that we'll do that is
+ * through the use of a copy-on-write mechanism.  At a high level, that
+ * means that when we want to write to a shared block, we allocate a new
+ * block, write the data to the new block, and if that succeeds we map the
+ * new block into the file.
+ *
+ * XFS provides a "delayed allocation" mechanism that defers the allocation
+ * of disk blocks to dirty-but-not-yet-mapped file blocks as long as
+ * possible.  This reduces fragmentation by enabling the filesystem to ask
+ * for bigger chunks less often, which is exactly what we want for CoW.
+ *
+ * The delalloc mechanism begins when the kernel wants to make a block
+ * writable (write_begin or page_mkwrite).  If the offset is not mapped, we
+ * create a delalloc mapping, which is a regular in-core extent, but without
+ * a real startblock.  (For delalloc mappings, the startblock encodes both
+ * a flag that this is a delalloc mapping, and a worst-case estimate of how
+ * many blocks might be required to put the mapping into the BMBT.)  delalloc
+ * mappings are a reservation against the free space in the filesystem;
+ * adjacent mappings can also be combined into fewer larger mappings.
+ *
+ * When dirty pages are being written out (typically in writepage), the
+ * delalloc reservations are converted into real mappings by allocating
+ * blocks and replacing the delalloc mapping with real ones.  A delalloc
+ * mapping can be replaced by several real ones if the free space is
+ * fragmented.
+ *
+ * We want to adapt the delalloc mechanism for copy-on-write, since the
+ * write paths are similar.  The first two steps (creating the reservation
+ * and allocating the blocks) are exactly the same as delalloc except that
+ * the mappings must be stored in a separate CoW fork because we do not want
+ * to disturb the mapping in the data fork until we're sure that the write
+ * succeeded.  IO completion in this case is the process of removing the old
+ * mapping from the data fork and moving the new mapping from the CoW fork to
+ * the data fork.  This will be discussed shortly.
+ *
+ * For now, unaligned directio writes will be bounced back to the page cache.
+ * Block-aligned directio writes will use the same mechanism as buffered
+ * writes.
+ *
+ * CoW remapping must be done after the data block write completes,
+ * because we don't want to destroy the old data fork map until we're sure
+ * the new block has been written.  Since the new mappings are kept in a
+ * separate fork, we can simply iterate these mappings to find the ones
+ * that cover the file blocks that we just CoW'd.  For each extent, simply
+ * unmap the corresponding range in the data fork, map the new range into
+ * the data fork, and remove the extent from the CoW fork.
+ *
+ * Since the remapping operation can be applied to an arbitrary file
+ * range, we record the need for the remap step as a flag in the ioend
+ * instead of declaring a new IO type.  This is required for direct io
+ * because we only have ioend for the whole dio, and we have to be able to
+ * remember the presence of unwritten blocks and CoW blocks with a single
+ * ioend structure.  Better yet, the more ground we can cover with one
+ * ioend, the better.
+ */
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
new file mode 100644
index 0000000..820b151
--- /dev/null
+++ b/fs/xfs/xfs_reflink.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_REFLINK_H
+#define __XFS_REFLINK_H 1
+
+#endif /* __XFS_REFLINK_H */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fe9c6f8..079075f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -269,10 +269,10 @@ DECLARE_EVENT_CLASS(xfs_bmap_class,
 		__field(unsigned long, caller_ip)
 	),
 	TP_fast_assign(
-		struct xfs_ifork	*ifp = (state & BMAP_ATTRFORK) ?
-						ip->i_afp : &ip->i_df;
+		struct xfs_ifork	*ifp;
 		struct xfs_bmbt_irec	r;
 
+		ifp = XFS_IEXT_STATE_TO_FORK(ip, state);
 		xfs_bmbt_get_all(xfs_iext_get_ext(ifp, idx), &r);
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 078/119] xfs: support bmapping delalloc extents in the CoW fork
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (76 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 077/119] xfs: introduce the CoW fork Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 079/119] xfs: create delalloc extents in " Darrick J. Wong
                   ` (40 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Allow the creation of delayed allocation extents in the CoW fork.
In a subsequent patch we'll wire up write_begin and page_mkwrite to
actually do this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   29 +++++++++++++++++-----------
 fs/xfs/libxfs/xfs_bmap.h |    2 +-
 fs/xfs/xfs_iomap.c       |   48 +++++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_iomap.h       |    2 ++
 4 files changed, 60 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index ccfaf60..18b94e6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2760,6 +2760,7 @@ done:
 STATIC void
 xfs_bmap_add_extent_hole_delay(
 	xfs_inode_t		*ip,	/* incore inode pointer */
+	int			whichfork,
 	xfs_extnum_t		*idx,	/* extent number to update/insert */
 	xfs_bmbt_irec_t		*new)	/* new data to add to file extents */
 {
@@ -2771,8 +2772,10 @@ xfs_bmap_add_extent_hole_delay(
 	int			state;  /* state bits, accessed thru macros */
 	xfs_filblks_t		temp=0;	/* temp for indirect calculations */
 
-	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	ifp = XFS_IFORK_PTR(ip, whichfork);
 	state = 0;
+	if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
 	ASSERT(isnullstartblock(new->br_startblock));
 
 	/*
@@ -2790,7 +2793,7 @@ xfs_bmap_add_extent_hole_delay(
 	 * Check and set flags if the current (right) segment exists.
 	 * If it doesn't exist, we're converting the hole at end-of-file.
 	 */
-	if (*idx < ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t)) {
+	if (*idx < ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)) {
 		state |= BMAP_RIGHT_VALID;
 		xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *idx), &right);
 
@@ -4140,6 +4143,7 @@ xfs_bmapi_read(
 STATIC int
 xfs_bmapi_reserve_delalloc(
 	struct xfs_inode	*ip,
+	int			whichfork,
 	xfs_fileoff_t		aoff,
 	xfs_filblks_t		len,
 	struct xfs_bmbt_irec	*got,
@@ -4148,7 +4152,7 @@ xfs_bmapi_reserve_delalloc(
 	int			eof)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	xfs_extlen_t		alen;
 	xfs_extlen_t		indlen;
 	char			rt = XFS_IS_REALTIME_INODE(ip);
@@ -4207,7 +4211,7 @@ xfs_bmapi_reserve_delalloc(
 	got->br_startblock = nullstartblock(indlen);
 	got->br_blockcount = alen;
 	got->br_state = XFS_EXT_NORM;
-	xfs_bmap_add_extent_hole_delay(ip, lastx, got);
+	xfs_bmap_add_extent_hole_delay(ip, whichfork, lastx, got);
 
 	/*
 	 * Update our extent pointer, given that xfs_bmap_add_extent_hole_delay
@@ -4239,6 +4243,7 @@ out_unreserve_quota:
 int
 xfs_bmapi_delay(
 	struct xfs_inode	*ip,	/* incore inode */
+	int			whichfork, /* data or cow fork? */
 	xfs_fileoff_t		bno,	/* starting file offs. mapped */
 	xfs_filblks_t		len,	/* length to map in file */
 	struct xfs_bmbt_irec	*mval,	/* output: map values */
@@ -4246,7 +4251,7 @@ xfs_bmapi_delay(
 	int			flags)	/* XFS_BMAPI_... */
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	struct xfs_bmbt_irec	got;	/* current file extent record */
 	struct xfs_bmbt_irec	prev;	/* previous file extent record */
 	xfs_fileoff_t		obno;	/* old block number (offset) */
@@ -4256,14 +4261,15 @@ xfs_bmapi_delay(
 	int			n = 0;	/* current extent index */
 	int			error = 0;
 
+	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK);
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
 	ASSERT(!(flags & ~XFS_BMAPI_ENTIRE));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
 	if (unlikely(XFS_TEST_ERROR(
-	    (XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_EXTENTS &&
-	     XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_BTREE),
+	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	     XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE),
 	     mp, XFS_ERRTAG_BMAPIFORMAT, XFS_RANDOM_BMAPIFORMAT))) {
 		XFS_ERROR_REPORT("xfs_bmapi_delay", XFS_ERRLEVEL_LOW, mp);
 		return -EFSCORRUPTED;
@@ -4274,19 +4280,20 @@ xfs_bmapi_delay(
 
 	XFS_STATS_INC(mp, xs_blk_mapw);
 
-	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
-		error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
+	if (whichfork == XFS_DATA_FORK && !(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(NULL, ip, whichfork);
 		if (error)
 			return error;
 	}
 
-	xfs_bmap_search_extents(ip, bno, XFS_DATA_FORK, &eof, &lastx, &got, &prev);
+	xfs_bmap_search_extents(ip, bno, whichfork, &eof, &lastx, &got, &prev);
 	end = bno + len;
 	obno = bno;
 
 	while (bno < end && n < *nmap) {
 		if (eof || got.br_startoff > bno) {
-			error = xfs_bmapi_reserve_delalloc(ip, bno, len, &got,
+			error = xfs_bmapi_reserve_delalloc(ip, whichfork,
+							   bno, len, &got,
 							   &prev, &lastx, eof);
 			if (error) {
 				if (n == 0) {
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index a8ef1c6..d90f88e 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -205,7 +205,7 @@ int	xfs_bmap_read_extents(struct xfs_trans *tp, struct xfs_inode *ip,
 int	xfs_bmapi_read(struct xfs_inode *ip, xfs_fileoff_t bno,
 		xfs_filblks_t len, struct xfs_bmbt_irec *mval,
 		int *nmap, int flags);
-int	xfs_bmapi_delay(struct xfs_inode *ip, xfs_fileoff_t bno,
+int	xfs_bmapi_delay(struct xfs_inode *ip, int whichfork, xfs_fileoff_t bno,
 		xfs_filblks_t len, struct xfs_bmbt_irec *mval,
 		int *nmap, int flags);
 int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 61b61f51..82c4697 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -560,12 +560,13 @@ check_writeio:
 	return alloc_blocks;
 }
 
-int
-xfs_iomap_write_delay(
+STATIC int
+__xfs_iomap_write_delay(
 	xfs_inode_t	*ip,
 	xfs_off_t	offset,
 	size_t		count,
-	xfs_bmbt_irec_t *ret_imap)
+	xfs_bmbt_irec_t *ret_imap,
+	int		whichfork)
 {
 	xfs_mount_t	*mp = ip->i_mount;
 	xfs_fileoff_t	offset_fsb;
@@ -591,10 +592,14 @@ xfs_iomap_write_delay(
 	extsz = xfs_get_extsz_hint(ip);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 
-	error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
-				imap, XFS_WRITE_IMAPS, &prealloc);
-	if (error)
-		return error;
+	if (whichfork == XFS_DATA_FORK) {
+		error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count,
+					imap, XFS_WRITE_IMAPS, &prealloc);
+		if (error)
+			return error;
+	} else {
+		prealloc = 0;
+	}
 
 retry:
 	if (prealloc) {
@@ -626,8 +631,8 @@ retry:
 	ASSERT(last_fsb > offset_fsb);
 
 	nimaps = XFS_WRITE_IMAPS;
-	error = xfs_bmapi_delay(ip, offset_fsb, last_fsb - offset_fsb,
-				imap, &nimaps, XFS_BMAPI_ENTIRE);
+	error = xfs_bmapi_delay(ip, whichfork, offset_fsb,
+			last_fsb - offset_fsb, imap, &nimaps, XFS_BMAPI_ENTIRE);
 	switch (error) {
 	case 0:
 	case -ENOSPC:
@@ -665,6 +670,31 @@ retry:
 	return 0;
 }
 
+int
+xfs_iomap_write_delay(
+	xfs_inode_t	*ip,
+	xfs_off_t	offset,
+	size_t		count,
+	xfs_bmbt_irec_t *ret_imap)
+{
+	return __xfs_iomap_write_delay(ip, offset, count, ret_imap,
+				       XFS_DATA_FORK);
+}
+
+int
+xfs_iomap_cow_delay(
+	xfs_inode_t	*ip,
+	xfs_off_t	offset,
+	size_t		count,
+	xfs_bmbt_irec_t *ret_imap)
+{
+	ASSERT(XFS_IFORK_PTR(ip, XFS_COW_FORK) != NULL);
+	trace_xfs_iomap_cow_delay(ip, offset, count);
+
+	return __xfs_iomap_write_delay(ip, offset, count, ret_imap,
+				       XFS_COW_FORK);
+}
+
 /*
  * Pass in a delayed allocate extent, convert it to real extents;
  * return to the caller the extent we create which maps on top of
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 8688e66..f6a9adf 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -28,5 +28,7 @@ int xfs_iomap_write_delay(struct xfs_inode *, xfs_off_t, size_t,
 int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
 			struct xfs_bmbt_irec *);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
+int xfs_iomap_cow_delay(struct xfs_inode *, xfs_off_t, size_t,
+			struct xfs_bmbt_irec *);
 
 #endif /* __XFS_IOMAP_H__*/


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 079/119] xfs: create delalloc extents in CoW fork
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (77 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 078/119] xfs: support bmapping delalloc extents in " Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 080/119] xfs: support allocating delayed " Darrick J. Wong
                   ` (39 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Wire up write_begin and page_mkwrite to detect shared extents and
create delayed allocation extents in the CoW fork.

v2: Make trim_extent better at constraining the extent to just
the range passed in.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c    |   10 +++
 fs/xfs/xfs_file.c    |   10 +++
 fs/xfs/xfs_reflink.c |  158 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    3 +
 4 files changed, 181 insertions(+)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4c463b9..66a2a9b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -31,6 +31,7 @@
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_reflink.h"
 #include <linux/gfp.h>
 #include <linux/mpage.h>
 #include <linux/pagevec.h>
@@ -1567,6 +1568,15 @@ xfs_vm_write_begin(
 	if (!page)
 		return -ENOMEM;
 
+	/* Reserve delalloc blocks for CoW. */
+	status = xfs_reflink_reserve_cow_range(XFS_I(mapping->host), pos, len);
+	if (status) {
+		unlock_page(page);
+		put_page(page);
+		*pagep = NULL;
+		return status;
+	}
+
 	status = __block_write_begin(page, pos, len, xfs_get_blocks);
 	if (xfs_mp_fail_writes(mp))
 		status = -EIO;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 47fc632..148d0b3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -37,6 +37,7 @@
 #include "xfs_log.h"
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
+#include "xfs_reflink.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
@@ -1550,6 +1551,14 @@ xfs_filemap_page_mkwrite(
 	file_update_time(vma->vm_file);
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
+	/* Reserve delalloc blocks for CoW. */
+	ret = xfs_reflink_reserve_cow_range(XFS_I(inode),
+			vmf->page->index << PAGE_SHIFT, PAGE_SIZE);
+	if (ret) {
+		ret = block_page_mkwrite_return(ret);
+		goto out;
+	}
+
 	if (IS_DAX(inode)) {
 		ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
 	} else {
@@ -1557,6 +1566,7 @@ xfs_filemap_page_mkwrite(
 		ret = block_page_mkwrite_return(ret);
 	}
 
+out:
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 7adbb83..112d86b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -51,6 +51,7 @@
 #include "xfs_btree.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
+#include "xfs_iomap.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -112,3 +113,160 @@
  * ioend structure.  Better yet, the more ground we can cover with one
  * ioend, the better.
  */
+
+/* Trim extent to fit a logical block range. */
+static void
+xfs_trim_extent(
+	struct xfs_bmbt_irec	*irec,
+	xfs_fileoff_t		bno,
+	xfs_filblks_t		len)
+{
+	xfs_fileoff_t		distance;
+	xfs_fileoff_t		end = bno + len;
+
+	if (irec->br_startoff + irec->br_blockcount <= bno ||
+	    irec->br_startoff >= end) {
+		irec->br_blockcount = 0;
+		return;
+	}
+
+	if (irec->br_startoff < bno) {
+		distance = bno - irec->br_startoff;
+		if (irec->br_startblock != DELAYSTARTBLOCK &&
+		    irec->br_startblock != HOLESTARTBLOCK)
+			irec->br_startblock += distance;
+		irec->br_startoff += distance;
+		irec->br_blockcount -= distance;
+	}
+
+	if (end < irec->br_startoff + irec->br_blockcount) {
+		distance = irec->br_startoff + irec->br_blockcount - end;
+		irec->br_blockcount -= distance;
+	}
+}
+
+/* Find the shared ranges under an irec, and set up delalloc extents. */
+static int
+xfs_reflink_reserve_cow_extent(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*irec)
+{
+	struct xfs_bmbt_irec	rec;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		aglen;
+	xfs_agblock_t		fbno;
+	xfs_extlen_t		flen;
+	xfs_fileoff_t		lblk;
+	xfs_off_t		foffset;
+	xfs_extlen_t		distance;
+	size_t			fsize;
+	int			error = 0;
+
+	/* Holes, unwritten, and delalloc extents cannot be shared */
+	if (ISUNWRITTEN(irec) ||
+	    irec->br_startblock == HOLESTARTBLOCK ||
+	    irec->br_startblock == DELAYSTARTBLOCK)
+		return 0;
+
+	trace_xfs_reflink_reserve_cow_extent(ip, irec);
+	agno = XFS_FSB_TO_AGNO(ip->i_mount, irec->br_startblock);
+	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, irec->br_startblock);
+	lblk = irec->br_startoff;
+	aglen = irec->br_blockcount;
+
+	while (aglen > 0) {
+		/* Find maximal fork range within this extent */
+		error = xfs_refcount_find_shared(ip->i_mount, agno, agbno,
+				aglen, &fbno, &flen, true);
+		if (error)
+			break;
+		if (flen == 0) {
+			distance = fbno - agbno;
+			goto advloop;
+		}
+
+		/* Add as much as we can to the cow fork */
+		foffset = XFS_FSB_TO_B(ip->i_mount, lblk + fbno - agbno);
+		fsize = XFS_FSB_TO_B(ip->i_mount, flen);
+		error = xfs_iomap_cow_delay(ip, foffset, fsize, &rec);
+		if (error)
+			break;
+
+		distance = (rec.br_startoff - lblk) + rec.br_blockcount;
+advloop:
+		if (aglen < distance)
+			break;
+		aglen -= distance;
+		agbno += distance;
+		lblk += distance;
+	}
+
+	if (error)
+		trace_xfs_reflink_reserve_cow_extent_error(ip, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Create CoW reservations for all shared blocks within a byte range of
+ * a file.
+ */
+int
+xfs_reflink_reserve_cow_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		pos,
+	xfs_off_t		len)
+{
+	struct xfs_bmbt_irec	imap;
+	int			nimaps;
+	int			error = 0;
+	xfs_fileoff_t		lblk;
+	xfs_fileoff_t		next_lblk;
+	struct xfs_ifork	*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	xfs_extnum_t		idx;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	trace_xfs_reflink_reserve_cow_range(ip, len, pos, 0);
+
+	lblk = XFS_B_TO_FSBT(ip->i_mount, pos);
+	next_lblk = XFS_B_TO_FSB(ip->i_mount, pos + len);
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	while (lblk < next_lblk) {
+		/* Already reserved?  Skip the refcount btree access. */
+		gotp = xfs_iext_bno_to_ext(ifp, lblk, &idx);
+		if (gotp) {
+			xfs_bmbt_get_all(gotp, &imap);
+			if (imap.br_startoff <= lblk &&
+			    imap.br_startoff + imap.br_blockcount > lblk) {
+				lblk = imap.br_startoff + imap.br_blockcount;
+				continue;
+			}
+		}
+
+		/* Read extent from the source file. */
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, lblk, next_lblk - lblk, &imap,
+				&nimaps, 0);
+		if (error)
+			break;
+
+		if (nimaps == 0)
+			break;
+
+		/* Fork all the shared blocks in this extent. */
+		error = xfs_reflink_reserve_cow_extent(ip, &imap);
+		if (error)
+			break;
+
+		lblk += imap.br_blockcount;
+	}
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	if (error)
+		trace_xfs_reflink_reserve_cow_range_error(ip, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 820b151..7b0a215 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -20,4 +20,7 @@
 #ifndef __XFS_REFLINK_H
 #define __XFS_REFLINK_H 1
 
+extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip, xfs_off_t pos,
+		xfs_off_t len);
+
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 080/119] xfs: support allocating delayed extents in CoW fork
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (78 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 079/119] xfs: create delalloc extents in " Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 081/119] xfs: allocate " Darrick J. Wong
                   ` (38 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
allocation extents in the CoW fork to real allocations, and wire this
up all the way back to xfs_iomap_write_allocate().  In a subsequent
patch, we'll modify the writepage handler to call this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   51 ++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_aops.c        |    6 ++++-
 fs/xfs/xfs_iomap.c       |    7 +++++-
 fs/xfs/xfs_iomap.h       |    2 +-
 4 files changed, 46 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 18b94e6..8b419b3 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -141,7 +141,8 @@ xfs_bmbt_lookup_ge(
  */
 static inline bool xfs_bmap_needs_btree(struct xfs_inode *ip, int whichfork)
 {
-	return XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS &&
+	return whichfork != XFS_COW_FORK &&
+		XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS &&
 		XFS_IFORK_NEXTENTS(ip, whichfork) >
 			XFS_IFORK_MAXEXT(ip, whichfork);
 }
@@ -151,7 +152,8 @@ static inline bool xfs_bmap_needs_btree(struct xfs_inode *ip, int whichfork)
  */
 static inline bool xfs_bmap_wants_extents(struct xfs_inode *ip, int whichfork)
 {
-	return XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE &&
+	return whichfork != XFS_COW_FORK &&
+		XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE &&
 		XFS_IFORK_NEXTENTS(ip, whichfork) <=
 			XFS_IFORK_MAXEXT(ip, whichfork);
 }
@@ -641,6 +643,7 @@ xfs_bmap_btree_to_extents(
 
 	mp = ip->i_mount;
 	ifp = XFS_IFORK_PTR(ip, whichfork);
+	ASSERT(whichfork != XFS_COW_FORK);
 	ASSERT(ifp->if_flags & XFS_IFEXTENTS);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE);
 	rblock = ifp->if_broot;
@@ -707,6 +710,7 @@ xfs_bmap_extents_to_btree(
 	xfs_bmbt_ptr_t		*pp;		/* root block address pointer */
 
 	mp = ip->i_mount;
+	ASSERT(whichfork != XFS_COW_FORK);
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS);
 
@@ -838,6 +842,7 @@ xfs_bmap_local_to_extents_empty(
 {
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 
+	ASSERT(whichfork != XFS_COW_FORK);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_LOCAL);
 	ASSERT(ifp->if_bytes == 0);
 	ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0);
@@ -1671,7 +1676,8 @@ xfs_bmap_one_block(
  */
 STATIC int				/* error */
 xfs_bmap_add_extent_delay_real(
-	struct xfs_bmalloca	*bma)
+	struct xfs_bmalloca	*bma,
+	int			whichfork)
 {
 	struct xfs_bmbt_irec	*new = &bma->got;
 	int			diff;	/* temp value */
@@ -1689,11 +1695,14 @@ xfs_bmap_add_extent_delay_real(
 	xfs_filblks_t		temp=0;	/* value for da_new calculations */
 	xfs_filblks_t		temp2=0;/* value for da_new calculations */
 	int			tmp_rval;	/* partial logging flags */
-	int			whichfork = XFS_DATA_FORK;
 	struct xfs_mount	*mp;
+	xfs_extnum_t		*nextents;
 
 	mp = bma->ip->i_mount;
 	ifp = XFS_IFORK_PTR(bma->ip, whichfork);
+	ASSERT(whichfork != XFS_ATTR_FORK);
+	nextents = (whichfork == XFS_COW_FORK ? &bma->ip->i_cnextents :
+						&bma->ip->i_d.di_nextents);
 
 	ASSERT(bma->idx >= 0);
 	ASSERT(bma->idx <= ifp->if_bytes / sizeof(struct xfs_bmbt_rec));
@@ -1707,6 +1716,9 @@ xfs_bmap_add_extent_delay_real(
 #define	RIGHT		r[1]
 #define	PREV		r[2]
 
+	if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
+
 	/*
 	 * Set up a bunch of variables to make the tests simpler.
 	 */
@@ -1793,7 +1805,7 @@ xfs_bmap_add_extent_delay_real(
 		trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
 
 		xfs_iext_remove(bma->ip, bma->idx + 1, 2, state);
-		bma->ip->i_d.di_nextents--;
+		(*nextents)--;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -1895,7 +1907,7 @@ xfs_bmap_add_extent_delay_real(
 		xfs_bmbt_set_startblock(ep, new->br_startblock);
 		trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
 
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -1965,7 +1977,7 @@ xfs_bmap_add_extent_delay_real(
 		temp = PREV.br_blockcount - new->br_blockcount;
 		xfs_bmbt_set_blockcount(ep, temp);
 		xfs_iext_insert(bma->ip, bma->idx, 1, new, state);
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2049,7 +2061,7 @@ xfs_bmap_add_extent_delay_real(
 		trace_xfs_bmap_pre_update(bma->ip, bma->idx, state, _THIS_IP_);
 		xfs_bmbt_set_blockcount(ep, temp);
 		xfs_iext_insert(bma->ip, bma->idx + 1, 1, new, state);
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2118,7 +2130,7 @@ xfs_bmap_add_extent_delay_real(
 		RIGHT.br_blockcount = temp2;
 		/* insert LEFT (r[0]) and RIGHT (r[1]) at the same time */
 		xfs_iext_insert(bma->ip, bma->idx + 1, 2, &LEFT, state);
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2216,7 +2228,8 @@ xfs_bmap_add_extent_delay_real(
 
 	xfs_bmap_check_leaf_extents(bma->cur, bma->ip, whichfork);
 done:
-	bma->logflags |= rval;
+	if (whichfork != XFS_COW_FORK)
+		bma->logflags |= rval;
 	return error;
 #undef	LEFT
 #undef	RIGHT
@@ -3856,7 +3869,8 @@ xfs_bmap_btalloc(
 		ASSERT(nullfb || fb_agno == args.agno ||
 		       (ap->dfops->dop_low && fb_agno < args.agno));
 		ap->length = args.len;
-		ap->ip->i_d.di_nblocks += args.len;
+		if (!(ap->flags & XFS_BMAPI_COWFORK))
+			ap->ip->i_d.di_nblocks += args.len;
 		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
 		if (ap->wasdel)
 			ap->ip->i_delayed_blks -= args.len;
@@ -4330,8 +4344,7 @@ xfs_bmapi_allocate(
 	struct xfs_bmalloca	*bma)
 {
 	struct xfs_mount	*mp = bma->ip->i_mount;
-	int			whichfork = (bma->flags & XFS_BMAPI_ATTRFORK) ?
-						XFS_ATTR_FORK : XFS_DATA_FORK;
+	int			whichfork = xfs_bmapi_whichfork(bma->flags);
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(bma->ip, whichfork);
 	int			tmp_logflags = 0;
 	int			error;
@@ -4420,7 +4433,7 @@ xfs_bmapi_allocate(
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
 	if (bma->wasdel)
-		error = xfs_bmap_add_extent_delay_real(bma);
+		error = xfs_bmap_add_extent_delay_real(bma, whichfork);
 	else
 		error = xfs_bmap_add_extent_hole_real(bma, whichfork);
 
@@ -4574,8 +4587,7 @@ xfs_bmapi_write(
 	orig_mval = mval;
 	orig_nmap = *nmap;
 #endif
-	whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-		XFS_ATTR_FORK : XFS_DATA_FORK;
+	whichfork = xfs_bmapi_whichfork(flags);
 
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
@@ -4586,6 +4598,11 @@ xfs_bmapi_write(
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	if (whichfork == XFS_ATTR_FORK)
 		ASSERT(!(flags & XFS_BMAPI_REMAP));
+	if (whichfork == XFS_COW_FORK) {
+		ASSERT(!(flags & XFS_BMAPI_REMAP));
+		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
+		ASSERT(!(flags & XFS_BMAPI_CONVERT));
+	}
 	if (flags & XFS_BMAPI_REMAP) {
 		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
 		ASSERT(!(flags & XFS_BMAPI_CONVERT));
@@ -4655,6 +4672,8 @@ xfs_bmapi_write(
 		 */
 		if (flags & XFS_BMAPI_REMAP)
 			ASSERT(inhole);
+		if (flags & XFS_BMAPI_COWFORK)
+			ASSERT(!inhole);
 
 		/*
 		 * First, deal with the hole before the allocated space
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 66a2a9b..50c4bf11 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -337,9 +337,11 @@ xfs_map_blocks(
 
 	if (type == XFS_IO_DELALLOC &&
 	    (!nimaps || isnullstartblock(imap->br_startblock))) {
-		error = xfs_iomap_write_allocate(ip, offset, imap);
+		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
+				imap);
 		if (!error)
-			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
+			trace_xfs_map_blocks_alloc(ip, offset, count, type,
+					imap);
 		return error;
 	}
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 82c4697..e7e1346 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -708,6 +708,7 @@ xfs_iomap_cow_delay(
 int
 xfs_iomap_write_allocate(
 	xfs_inode_t	*ip,
+	int		whichfork,
 	xfs_off_t	offset,
 	xfs_bmbt_irec_t *imap)
 {
@@ -720,8 +721,12 @@ xfs_iomap_write_allocate(
 	xfs_trans_t	*tp;
 	int		nimaps;
 	int		error = 0;
+	int		flags = 0;
 	int		nres;
 
+	if (whichfork == XFS_COW_FORK)
+		flags |= XFS_BMAPI_COWFORK;
+
 	/*
 	 * Make sure that the dquots are there.
 	 */
@@ -811,7 +816,7 @@ xfs_iomap_write_allocate(
 			 * pointer that the caller gave to us.
 			 */
 			error = xfs_bmapi_write(tp, ip, map_start_fsb,
-						count_fsb, 0, &first_block,
+						count_fsb, flags, &first_block,
 						nres, imap, &nimaps,
 						&dfops);
 			if (error)
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index f6a9adf..ba037d6 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -25,7 +25,7 @@ int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
 			struct xfs_bmbt_irec *, int);
 int xfs_iomap_write_delay(struct xfs_inode *, xfs_off_t, size_t,
 			struct xfs_bmbt_irec *);
-int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
+int xfs_iomap_write_allocate(struct xfs_inode *, int, xfs_off_t,
 			struct xfs_bmbt_irec *);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
 int xfs_iomap_cow_delay(struct xfs_inode *, xfs_off_t, size_t,


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 081/119] xfs: allocate delayed extents in CoW fork
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (79 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 080/119] xfs: support allocating delayed " Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 082/119] xfs: support removing extents from " Darrick J. Wong
                   ` (37 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Modify the writepage handler to find and convert pending delalloc
extents to real allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c    |   57 ++++++++++++++++++++++++---
 fs/xfs/xfs_aops.h    |    4 +-
 fs/xfs/xfs_reflink.c |  106 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    5 ++
 4 files changed, 164 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 50c4bf11..802d432 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -312,10 +312,15 @@ xfs_map_blocks(
 	int			error = 0;
 	int			bmapi_flags = XFS_BMAPI_ENTIRE;
 	int			nimaps = 1;
+	int			whichfork;
+	bool			need_alloc;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	whichfork = (type == XFS_IO_COW ? XFS_COW_FORK : XFS_DATA_FORK);
+	need_alloc = (type == XFS_IO_DELALLOC);
+
 	if (type == XFS_IO_UNWRITTEN)
 		bmapi_flags |= XFS_BMAPI_IGSTATE;
 
@@ -328,16 +333,29 @@ xfs_map_blocks(
 		count = mp->m_super->s_maxbytes - offset;
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				imap, &nimaps, bmapi_flags);
+
+	if (type == XFS_IO_COW)
+		error = xfs_reflink_find_cow_mapping(ip, offset, imap,
+						     &need_alloc);
+	else {
+		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+				       imap, &nimaps, bmapi_flags);
+		/*
+		 * Truncate an overwrite extent if there's a pending CoW
+		 * reservation before the end of this extent.  This forces us
+		 * to come back to writepage to take care of the CoW.
+		 */
+		if (nimaps && type == XFS_IO_OVERWRITE)
+			xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
+	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (error)
 		return error;
 
-	if (type == XFS_IO_DELALLOC &&
+	if (need_alloc &&
 	    (!nimaps || isnullstartblock(imap->br_startblock))) {
-		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
+		error = xfs_iomap_write_allocate(ip, whichfork, offset,
 				imap);
 		if (!error)
 			trace_xfs_map_blocks_alloc(ip, offset, count, type,
@@ -625,7 +643,8 @@ xfs_check_page_type(
 			if (type == XFS_IO_DELALLOC)
 				return true;
 		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
-			if (type == XFS_IO_OVERWRITE)
+			if (type == XFS_IO_OVERWRITE ||
+			    type == XFS_IO_COW)
 				return true;
 		}
 
@@ -637,6 +656,26 @@ xfs_check_page_type(
 	return false;
 }
 
+/*
+ * Figure out if CoW is pending at this offset.
+ */
+static bool
+xfs_is_cow_io(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset)
+{
+	bool			is_cow;
+
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb))
+		return false;
+
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	is_cow = xfs_reflink_is_cow_pending(ip, offset);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+
+	return is_cow;
+}
+
 STATIC void
 xfs_vm_invalidatepage(
 	struct page		*page,
@@ -745,6 +784,7 @@ xfs_writepage_map(
 	int			error = 0;
 	int			count = 0;
 	int			uptodate = 1;
+	unsigned int		new_type;
 
 	bh = head = page_buffers(page);
 	offset = page_offset(page);
@@ -776,8 +816,11 @@ xfs_writepage_map(
 				wpc->imap_valid = false;
 			}
 		} else if (buffer_uptodate(bh)) {
-			if (wpc->io_type != XFS_IO_OVERWRITE) {
-				wpc->io_type = XFS_IO_OVERWRITE;
+			new_type = xfs_is_cow_io(XFS_I(inode), offset) ?
+					XFS_IO_COW : XFS_IO_OVERWRITE;
+
+			if (wpc->io_type != new_type) {
+				wpc->io_type = new_type;
 				wpc->imap_valid = false;
 			}
 		} else {
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 814aab7..ee64d57 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -28,13 +28,15 @@ enum {
 	XFS_IO_DELALLOC,	/* covers delalloc region */
 	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
 	XFS_IO_OVERWRITE,	/* covers already allocated extent */
+	XFS_IO_COW,		/* covers copy-on-write extent */
 };
 
 #define XFS_IO_TYPES \
 	{ XFS_IO_INVALID,		"invalid" }, \
 	{ XFS_IO_DELALLOC,		"delalloc" }, \
 	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
-	{ XFS_IO_OVERWRITE,		"overwrite" }
+	{ XFS_IO_OVERWRITE,		"overwrite" }, \
+	{ XFS_IO_COW,			"CoW" }
 
 /*
  * Structure for buffered I/O completions.
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 112d86b..f0a9e42 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -270,3 +270,109 @@ xfs_reflink_reserve_cow_range(
 		trace_xfs_reflink_reserve_cow_range_error(ip, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Determine if there's a CoW reservation at a byte offset of an inode.
+ */
+bool
+xfs_reflink_is_cow_pending(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset)
+{
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	struct xfs_bmbt_irec		irec;
+	xfs_fileoff_t			bno;
+	xfs_extnum_t			idx;
+
+	if (!xfs_is_reflink_inode(ip))
+		return false;
+
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	bno = XFS_B_TO_FSBT(ip->i_mount, offset);
+	gotp = xfs_iext_bno_to_ext(ifp, bno, &idx);
+
+	if (!gotp)
+		return false;
+
+	xfs_bmbt_get_all(gotp, &irec);
+	if (bno >= irec.br_startoff + irec.br_blockcount ||
+	    bno < irec.br_startoff)
+		return false;
+	return true;
+}
+
+/*
+ * Find the CoW reservation (and whether or not it needs block allocation)
+ * for a given byte offset of a file.
+ */
+int
+xfs_reflink_find_cow_mapping(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	struct xfs_bmbt_irec		*imap,
+	bool				*need_alloc)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	xfs_fileoff_t			bno;
+	xfs_extnum_t			idx;
+
+	/* Find the extent in the CoW fork. */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	bno = XFS_B_TO_FSBT(ip->i_mount, offset);
+	gotp = xfs_iext_bno_to_ext(ifp, bno, &idx);
+	xfs_bmbt_get_all(gotp, &irec);
+
+	trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
+			&irec);
+
+	/* If it's still delalloc, we must allocate later. */
+	*imap = irec;
+	*need_alloc = !!(isnullstartblock(irec.br_startblock));
+
+	return 0;
+}
+
+/*
+ * Trim an extent to end at the next CoW reservation past offset_fsb.
+ */
+int
+xfs_reflink_trim_irec_to_next_cow(
+	struct xfs_inode		*ip,
+	xfs_fileoff_t			offset_fsb,
+	struct xfs_bmbt_irec		*imap)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	xfs_extnum_t			idx;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	/* Find the extent in the CoW fork. */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	gotp = xfs_iext_bno_to_ext(ifp, offset_fsb, &idx);
+	if (!gotp)
+		return 0;
+	xfs_bmbt_get_all(gotp, &irec);
+
+	/* This is the extent before; try sliding up one. */
+	if (irec.br_startoff < offset_fsb) {
+		idx++;
+		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
+			return 0;
+		gotp = xfs_iext_get_ext(ifp, idx);
+		xfs_bmbt_get_all(gotp, &irec);
+	}
+
+	if (irec.br_startoff >= imap->br_startoff + imap->br_blockcount)
+		return 0;
+
+	imap->br_blockcount = irec.br_startoff - imap->br_startoff;
+	trace_xfs_reflink_trim_irec(ip, imap);
+
+	return 0;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 7b0a215..a2a23f5 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -22,5 +22,10 @@
 
 extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip, xfs_off_t pos,
 		xfs_off_t len);
+extern bool xfs_reflink_is_cow_pending(struct xfs_inode *ip, xfs_off_t offset);
+extern int xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
+		struct xfs_bmbt_irec *imap, bool *need_alloc);
+extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
+		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 082/119] xfs: support removing extents from CoW fork
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (80 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 081/119] xfs: allocate " Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 083/119] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
                   ` (36 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create a helper method to remove extents from the CoW fork without
any of the side effects (rmapbt/bmbt updates) of the regular extent
deletion routine.  We'll eventually use this to clear out the CoW fork
during ioend processing.

v2: Use bmapi_read to iterate and trim the CoW extents instead of
reading them raw via the iext code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  176 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h |    1 
 2 files changed, 177 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 8b419b3..2ba513b 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4981,6 +4981,7 @@ xfs_bmap_del_extent(
 		/*
 		 * Matches the whole extent.  Delete the entry.
 		 */
+		trace_xfs_bmap_pre_update(ip, *idx, state, _THIS_IP_);
 		xfs_iext_remove(ip, *idx, 1,
 				whichfork == XFS_ATTR_FORK ? BMAP_ATTRFORK : 0);
 		--*idx;
@@ -5198,6 +5199,181 @@ done:
 }
 
 /*
+ * xfs_bunmapi_cow() -- Remove the relevant parts of the CoW fork.
+ *			See xfs_bmap_del_extent.
+ * @ip: XFS inode.
+ * @idx: Extent number to delete.
+ * @del: Extent to remove.
+ */
+int
+xfs_bunmapi_cow(
+	xfs_inode_t		*ip,
+	xfs_bmbt_irec_t		*del)
+{
+	xfs_filblks_t		da_new;	/* new delay-alloc indirect blocks */
+	xfs_filblks_t		da_old;	/* old delay-alloc indirect blocks */
+	xfs_fsblock_t		del_endblock = 0;/* first block past del */
+	xfs_fileoff_t		del_endoff;	/* first offset past del */
+	int			delay;	/* current block is delayed allocated */
+	xfs_bmbt_rec_host_t	*ep;	/* current extent entry pointer */
+	int			error;	/* error return value */
+	xfs_bmbt_irec_t		got;	/* current extent entry */
+	xfs_fileoff_t		got_endoff;	/* first offset past got */
+	xfs_ifork_t		*ifp;	/* inode fork pointer */
+	xfs_mount_t		*mp;	/* mount structure */
+	xfs_filblks_t		nblks;	/* quota/sb block count */
+	xfs_bmbt_irec_t		new;	/* new record to be inserted */
+	/* REFERENCED */
+	uint			qfield;	/* quota field to update */
+	xfs_filblks_t		temp;	/* for indirect length calculations */
+	xfs_filblks_t		temp2;	/* for indirect length calculations */
+	int			state = BMAP_COWFORK;
+	int			eof;
+	xfs_extnum_t		eidx;
+
+	mp = ip->i_mount;
+	XFS_STATS_INC(mp, xs_del_exlist);
+
+	ep = xfs_bmap_search_extents(ip, del->br_startoff, XFS_COW_FORK, &eof,
+			&eidx, &got, &new);
+
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	ASSERT((eidx >= 0) && (eidx < ifp->if_bytes /
+		(uint)sizeof(xfs_bmbt_rec_t)));
+	ASSERT(del->br_blockcount > 0);
+	ASSERT(got.br_startoff <= del->br_startoff);
+	del_endoff = del->br_startoff + del->br_blockcount;
+	got_endoff = got.br_startoff + got.br_blockcount;
+	ASSERT(got_endoff >= del_endoff);
+	delay = isnullstartblock(got.br_startblock);
+	ASSERT(isnullstartblock(del->br_startblock) == delay);
+	qfield = 0;
+	error = 0;
+	/*
+	 * If deleting a real allocation, must free up the disk space.
+	 */
+	if (!delay) {
+		nblks = del->br_blockcount;
+		qfield = XFS_TRANS_DQ_BCOUNT;
+		/*
+		 * Set up del_endblock and cur for later.
+		 */
+		del_endblock = del->br_startblock + del->br_blockcount;
+		da_old = da_new = 0;
+	} else {
+		da_old = startblockval(got.br_startblock);
+		da_new = 0;
+		nblks = 0;
+	}
+	qfield = qfield;
+	nblks = nblks;
+
+	/*
+	 * Set flag value to use in switch statement.
+	 * Left-contig is 2, right-contig is 1.
+	 */
+	switch (((got.br_startoff == del->br_startoff) << 1) |
+		(got_endoff == del_endoff)) {
+	case 3:
+		/*
+		 * Matches the whole extent.  Delete the entry.
+		 */
+		xfs_iext_remove(ip, eidx, 1, BMAP_COWFORK);
+		--eidx;
+		break;
+
+	case 2:
+		/*
+		 * Deleting the first part of the extent.
+		 */
+		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
+		xfs_bmbt_set_startoff(ep, del_endoff);
+		temp = got.br_blockcount - del->br_blockcount;
+		xfs_bmbt_set_blockcount(ep, temp);
+		if (delay) {
+			temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+				da_old);
+			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
+			trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+			da_new = temp;
+			break;
+		}
+		xfs_bmbt_set_startblock(ep, del_endblock);
+		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+		break;
+
+	case 1:
+		/*
+		 * Deleting the last part of the extent.
+		 */
+		temp = got.br_blockcount - del->br_blockcount;
+		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
+		xfs_bmbt_set_blockcount(ep, temp);
+		if (delay) {
+			temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+				da_old);
+			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
+			trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+			da_new = temp;
+			break;
+		}
+		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+		break;
+
+	case 0:
+		/*
+		 * Deleting the middle of the extent.
+		 */
+		temp = del->br_startoff - got.br_startoff;
+		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
+		xfs_bmbt_set_blockcount(ep, temp);
+		new.br_startoff = del_endoff;
+		temp2 = got_endoff - del_endoff;
+		new.br_blockcount = temp2;
+		new.br_state = got.br_state;
+		if (!delay) {
+			new.br_startblock = del_endblock;
+		} else {
+			temp = xfs_bmap_worst_indlen(ip, temp);
+			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
+			temp2 = xfs_bmap_worst_indlen(ip, temp2);
+			new.br_startblock = nullstartblock((int)temp2);
+			da_new = temp + temp2;
+			while (da_new > da_old) {
+				if (temp) {
+					temp--;
+					da_new--;
+					xfs_bmbt_set_startblock(ep,
+						nullstartblock((int)temp));
+				}
+				if (da_new == da_old)
+					break;
+				if (temp2) {
+					temp2--;
+					da_new--;
+					new.br_startblock =
+						nullstartblock((int)temp2);
+				}
+			}
+		}
+		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+		xfs_iext_insert(ip, eidx + 1, 1, &new, state);
+		++eidx;
+		break;
+	}
+
+	/*
+	 * Account for change in delayed indirect blocks.
+	 * Nothing to do for disk quota accounting here.
+	 */
+	ASSERT(da_old >= da_new);
+	if (da_old > da_new)
+		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), false);
+
+	return error;
+}
+
+/*
  * Unmap (remove) blocks from a file.
  * If nexts is nonzero then the number of extents to remove is limited to
  * that value.  If not all extents in the block range can be removed then
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index d90f88e..1c7ab70 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -221,6 +221,7 @@ int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
 		struct xfs_defer_ops *dfops, int *done);
+int	xfs_bunmapi_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *del);
 int	xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
 		xfs_extnum_t num);
 uint	xfs_default_attroffset(struct xfs_inode *ip);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 083/119] xfs: move mappings from cow fork to data fork after copy-write
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (81 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 082/119] xfs: support removing extents from " Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 084/119] xfs: implement CoW for directio writes Darrick J. Wong
                   ` (35 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

After the write component of a copy-write operation finishes, clean up
the bookkeeping left behind.  On error, we simply free the new blocks
and pass the error up.  If we succeed, however, then we must remove
the old data fork mapping and move the cow fork mapping to the data
fork.

v2: If CoW fails, we need to remove the CoW fork mapping and free the
blocks.  Furthermore, if xfs_cancel_ioend happens, we also need to
clean out all the CoW record keeping.

v3: When we're removing CoW extents, only free one extent per
transaction to avoid running out of reservation.  Also,
xfs_cancel_ioend mustn't clean out the CoW fork because it is called
when async writeback can't get an inode lock and will try again.

v4: Use bmapi_read to iterate the CoW fork instead of calling the
iext functions directly, and make the CoW remapping atomic by
using the deferred ops mechanism which takes care of logging redo
items for us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: Call the CoW failure function during xfs_cancel_ioend]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c    |   22 ++++
 fs/xfs/xfs_reflink.c |  250 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    8 ++
 3 files changed, 278 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 802d432..232039c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -262,6 +262,23 @@ xfs_end_io(
 		error = -EIO;
 
 	/*
+	 * For a CoW extent, we need to move the mapping from the CoW fork
+	 * to the data fork.  If instead an error happened, just dump the
+	 * new blocks.
+	 */
+	if (ioend->io_type == XFS_IO_COW) {
+		if (ioend->io_bio->bi_error) {
+			error = xfs_reflink_cancel_cow_range(ip,
+					ioend->io_offset, ioend->io_size);
+			goto done;
+		}
+		error = xfs_reflink_end_cow(ip, ioend->io_offset,
+				ioend->io_size);
+		if (error)
+			goto done;
+	}
+
+	/*
 	 * For unwritten extents we need to issue transactions to convert a
 	 * range to normal written extens after the data I/O has finished.
 	 * Detecting and handling completion IO errors is done individually
@@ -276,7 +293,8 @@ xfs_end_io(
 	} else if (ioend->io_append_trans) {
 		error = xfs_setfilesize_ioend(ioend, error);
 	} else {
-		ASSERT(!xfs_ioend_is_append(ioend));
+		ASSERT(!xfs_ioend_is_append(ioend) ||
+		       ioend->io_type == XFS_IO_COW);
 	}
 
 done:
@@ -290,7 +308,7 @@ xfs_end_bio(
 	struct xfs_ioend	*ioend = bio->bi_private;
 	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
 
-	if (ioend->io_type == XFS_IO_UNWRITTEN)
+	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
 		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
 	else if (ioend->io_append_trans)
 		queue_work(mp->m_data_workqueue, &ioend->io_work);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f0a9e42..59c8e86 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -52,6 +52,7 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_iomap.h"
+#include "xfs_rmap_btree.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -376,3 +377,252 @@ xfs_reflink_trim_irec_to_next_cow(
 
 	return 0;
 }
+
+/*
+ * Cancel all pending CoW reservations for some block range of an inode.
+ */
+int
+xfs_reflink_cancel_cow_blocks(
+	struct xfs_inode		*ip,
+	struct xfs_trans		**tpp,
+	xfs_fileoff_t			offset_fsb,
+	xfs_fileoff_t			end_fsb)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp;
+	xfs_filblks_t			count_fsb;
+	xfs_fsblock_t			firstfsb;
+	struct xfs_defer_ops		dfops;
+	int				error = 0;
+	int				nimaps;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	/* Go find the old extent in the CoW fork. */
+	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	while (count_fsb) {
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
+				&nimaps, XFS_BMAPI_COWFORK);
+		if (error)
+			break;
+		ASSERT(nimaps == 1);
+
+		xfs_trim_extent(&irec, offset_fsb, count_fsb);
+		trace_xfs_reflink_cancel_cow(ip, &irec);
+
+		if (irec.br_startblock == DELAYSTARTBLOCK) {
+			/* Free a delayed allocation. */
+			xfs_mod_fdblocks(ip->i_mount, irec.br_blockcount,
+					false);
+			ip->i_delayed_blks -= irec.br_blockcount;
+
+			/* Remove the mapping from the CoW fork. */
+			error = xfs_bunmapi_cow(ip, &irec);
+			if (error)
+				break;
+		} else if (irec.br_startblock == HOLESTARTBLOCK) {
+			/* empty */
+		} else {
+			xfs_trans_ijoin(*tpp, ip, 0);
+			xfs_defer_init(&dfops, &firstfsb);
+
+			xfs_bmap_add_free(ip->i_mount, &dfops,
+					irec.br_startblock, irec.br_blockcount,
+					NULL);
+
+			/* Update quota accounting */
+			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
+					-(long)irec.br_blockcount);
+
+			/* Roll the transaction */
+			error = xfs_defer_finish(tpp, &dfops, ip);
+			if (error) {
+				xfs_defer_cancel(&dfops);
+				break;
+			}
+
+			/* Remove the mapping from the CoW fork. */
+			error = xfs_bunmapi_cow(ip, &irec);
+			if (error)
+				break;
+		}
+
+		/* Roll on... */
+		count_fsb -= irec.br_startoff + irec.br_blockcount - offset_fsb;
+		offset_fsb = irec.br_startoff + irec.br_blockcount;
+	}
+
+	return error;
+}
+
+/*
+ * Cancel all pending CoW reservations for some byte range of an inode.
+ */
+int
+xfs_reflink_cancel_cow_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		count)
+{
+	struct xfs_trans	*tp;
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		end_fsb;
+	int			error;
+
+	trace_xfs_reflink_cancel_cow_range(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+	if (count == NULLFILEOFF)
+		end_fsb = NULLFILEOFF;
+	else
+		end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+
+	/* Start a rolling transaction to remove the mappings */
+	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
+			0, 0, 0, &tp);
+	if (error)
+		goto out;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/* Scrape out the old CoW reservations */
+	error = xfs_reflink_cancel_cow_blocks(ip, &tp, offset_fsb, end_fsb);
+	if (error)
+		goto out_defer;
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out;
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return 0;
+
+out_defer:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_cancel_cow_range_error(ip, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Remap parts of a file's data fork after a successful CoW.
+ */
+int
+xfs_reflink_end_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_bmbt_irec		uirec;
+	struct xfs_trans		*tp;
+	struct xfs_ifork		*ifp;
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	xfs_filblks_t			count_fsb;
+	xfs_fsblock_t			firstfsb;
+	struct xfs_defer_ops		dfops;
+	int				done;
+	int				error;
+	unsigned int			resblks;
+	xfs_filblks_t			ilen;
+	xfs_filblks_t			rlen;
+	int				nimaps;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
+
+	/* Start a rolling transaction to switch the mappings */
+	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
+			resblks, 0, 0, &tp);
+	if (error)
+		goto out;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/* Go find the old extent in the CoW fork. */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	while (count_fsb) {
+		/* Read extent from the source file */
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
+				&nimaps, XFS_BMAPI_COWFORK);
+		if (error)
+			goto out_cancel;
+		ASSERT(nimaps == 1);
+
+		ASSERT(irec.br_startblock != DELAYSTARTBLOCK);
+		xfs_trim_extent(&irec, offset_fsb, count_fsb);
+		trace_xfs_reflink_cow_remap(ip, &irec);
+
+		/*
+		 * We can have a hole in the CoW fork if part of a directio
+		 * write is CoW but part of it isn't.
+		 */
+		rlen = ilen = irec.br_blockcount;
+		if (irec.br_startblock == HOLESTARTBLOCK)
+			goto next_extent;
+
+		/* Unmap the old blocks in the data fork. */
+		done = false;
+		while (rlen) {
+			xfs_defer_init(&dfops, &firstfsb);
+			error = __xfs_bunmapi(tp, ip, irec.br_startoff,
+					&rlen, 0, 1, &firstfsb, &dfops);
+			if (error)
+				goto out_defer;
+
+			/* Trim the extent to whatever got unmapped. */
+			uirec = irec;
+			xfs_trim_extent(&uirec, irec.br_startoff + rlen,
+					irec.br_blockcount - rlen);
+			irec.br_blockcount = rlen;
+			trace_xfs_reflink_cow_remap_piece(ip, &uirec);
+
+			/* Map the new blocks into the data fork. */
+			error = xfs_bmap_map_extent(tp->t_mountp, &dfops,
+					ip, XFS_DATA_FORK, &uirec);
+			if (error)
+				goto out_defer;
+
+			/* Remove the mapping from the CoW fork. */
+			error = xfs_bunmapi_cow(ip, &uirec);
+			if (error)
+				goto out_defer;
+
+			error = xfs_defer_finish(&tp, &dfops, ip);
+			if (error)
+				goto out_defer;
+		}
+
+next_extent:
+		/* Roll on... */
+		count_fsb -= irec.br_startoff + ilen - offset_fsb;
+		offset_fsb = irec.br_startoff + ilen;
+	}
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error)
+		goto out;
+	return 0;
+
+out_defer:
+	xfs_defer_cancel(&dfops);
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index a2a23f5..27ae6c0 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -28,4 +28,12 @@ extern int xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
 		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
+extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
+		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
+		xfs_fileoff_t end_fsb);
+extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
+extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
+
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 084/119] xfs: implement CoW for directio writes
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (82 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 083/119] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:26 ` [PATCH 085/119] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks Darrick J. Wong
                   ` (34 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

For O_DIRECT writes to shared blocks, we have to CoW them just like
we would with buffered writes.  For writes that are not block-aligned,
just bounce them to the page cache.

For block-aligned writes, however, we can do better than that.  Use
the same mechanisms that we employ for buffered CoW to set up a
delalloc reservation, allocate all the blocks at once, issue the
writes against the new blocks and use the same ioend functions to
remap the blocks after the write.  This should be fairly performant.

v2: Turns out that there's no way for xfs_end_io_direct_write to know
if the write completed successfully.  Therefore, do /not/ use the
ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
where we *can* tell if the write succeeded or not.

v3: Update the file size if we do a directio CoW across EOF.  This
can happen if the last block is shared, the cowextsize hint is set,
and we do a dio write past the end of the file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c    |  112 +++++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_file.c    |   12 ++++-
 fs/xfs/xfs_reflink.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    5 ++
 4 files changed, 225 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 232039c..31318b3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -40,6 +40,7 @@
 /* flags for direct write completions */
 #define XFS_DIO_FLAG_UNWRITTEN	(1 << 0)
 #define XFS_DIO_FLAG_APPEND	(1 << 1)
+#define XFS_DIO_FLAG_COW	(1 << 2)
 
 /*
  * structure owned by writepages passed to individual writepage calls
@@ -1130,18 +1131,24 @@ xfs_map_direct(
 	struct inode		*inode,
 	struct buffer_head	*bh_result,
 	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
+	xfs_off_t		offset,
+	bool			is_cow)
 {
 	uintptr_t		*flags = (uintptr_t *)&bh_result->b_private;
 	xfs_off_t		size = bh_result->b_size;
 
 	trace_xfs_get_blocks_map_direct(XFS_I(inode), offset, size,
-		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : XFS_IO_OVERWRITE, imap);
+		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : is_cow ? XFS_IO_COW :
+		XFS_IO_OVERWRITE, imap);
 
 	if (ISUNWRITTEN(imap)) {
 		*flags |= XFS_DIO_FLAG_UNWRITTEN;
 		set_buffer_defer_completion(bh_result);
-	} else if (offset + size > i_size_read(inode) || offset + size < 0) {
+	} else if (is_cow) {
+		*flags |= XFS_DIO_FLAG_COW;
+		set_buffer_defer_completion(bh_result);
+	}
+	if (offset + size > i_size_read(inode) || offset + size < 0) {
 		*flags |= XFS_DIO_FLAG_APPEND;
 		set_buffer_defer_completion(bh_result);
 	}
@@ -1187,6 +1194,43 @@ xfs_map_trim_size(
 	bh_result->b_size = mapping_size;
 }
 
+/* Bounce unaligned directio writes to the page cache. */
+static int
+xfs_bounce_unaligned_dio_write(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		offset_fsb,
+	struct xfs_bmbt_irec	*imap)
+{
+	bool			shared;
+	struct xfs_bmbt_irec	irec;
+	xfs_fileoff_t		delta;
+	int			error;
+
+	irec = *imap;
+	if (offset_fsb > irec.br_startoff) {
+		delta = offset_fsb - irec.br_startoff;
+		irec.br_blockcount -= delta;
+		irec.br_startblock += delta;
+		irec.br_startoff = offset_fsb;
+	}
+	error = xfs_reflink_irec_is_shared(ip, &irec, &shared);
+	if (error)
+		return error;
+	/*
+	 * Are we doing a DIO write to a shared block?  In
+	 * the ideal world we at least would fork full blocks,
+	 * but for now just fall back to buffered mode.  Yuck.
+	 * Use -EREMCHG ("remote address changed") to signal
+	 * this, since in general XFS doesn't do this sort of
+	 * fallback.
+	 */
+	if (shared) {
+		trace_xfs_reflink_bounce_dio_write(ip, imap);
+		return -EREMCHG;
+	}
+	return 0;
+}
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
@@ -1206,6 +1250,8 @@ __xfs_get_blocks(
 	xfs_off_t		offset;
 	ssize_t			size;
 	int			new = 0;
+	bool			is_cow = false;
+	bool			need_alloc = false;
 
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
@@ -1237,8 +1283,27 @@ __xfs_get_blocks(
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 
-	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				&imap, &nimaps, XFS_BMAPI_ENTIRE);
+	if (create && direct)
+		is_cow = xfs_reflink_is_cow_pending(ip, offset);
+	if (is_cow)
+		error = xfs_reflink_find_cow_mapping(ip, offset, &imap,
+						     &need_alloc);
+	else {
+		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+					&imap, &nimaps, XFS_BMAPI_ENTIRE);
+		/*
+		 * Truncate an overwrite extent if there's a pending CoW
+		 * reservation before the end of this extent.  This forces us
+		 * to come back to writepage to take care of the CoW.
+		 */
+		if (create && direct && nimaps &&
+		    imap.br_startblock != HOLESTARTBLOCK &&
+		    imap.br_startblock != DELAYSTARTBLOCK &&
+		    !ISUNWRITTEN(&imap))
+			xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb,
+					&imap);
+	}
+	ASSERT(!need_alloc);
 	if (error)
 		goto out_unlock;
 
@@ -1310,6 +1375,13 @@ __xfs_get_blocks(
 	if (imap.br_startblock != HOLESTARTBLOCK &&
 	    imap.br_startblock != DELAYSTARTBLOCK &&
 	    (create || !ISUNWRITTEN(&imap))) {
+		if (create && direct && !is_cow) {
+			error = xfs_bounce_unaligned_dio_write(ip, offset_fsb,
+					&imap);
+			if (error)
+				return error;
+		}
+
 		xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
@@ -1318,7 +1390,8 @@ __xfs_get_blocks(
 			if (dax_fault)
 				ASSERT(!ISUNWRITTEN(&imap));
 			else
-				xfs_map_direct(inode, bh_result, &imap, offset);
+				xfs_map_direct(inode, bh_result, &imap, offset,
+						is_cow);
 		}
 	}
 
@@ -1452,7 +1525,11 @@ xfs_end_io_direct_write(
 		trace_xfs_end_io_direct_write_unwritten(ip, offset, size);
 
 		error = xfs_iomap_write_unwritten(ip, offset, size);
-	} else if (flags & XFS_DIO_FLAG_APPEND) {
+	}
+	if (flags & XFS_DIO_FLAG_COW) {
+		error = xfs_reflink_end_cow(ip, offset, size);
+	}
+	if (flags & XFS_DIO_FLAG_APPEND) {
 		struct xfs_trans *tp;
 
 		trace_xfs_end_io_direct_write_append(ip, offset, size);
@@ -1475,6 +1552,27 @@ xfs_vm_direct_IO(
 	dio_iodone_t		*endio = NULL;
 	int			flags = 0;
 	struct block_device	*bdev;
+	loff_t			end;
+	loff_t			block_mask;
+	bool			dio_cow = false;
+	int			error;
+
+	/* If this is a block-aligned directio CoW, remap immediately. */
+	end = iocb->ki_pos + iov_iter_count(iter);
+	block_mask = (1 << inode->i_blkbits) - 1;
+	if (iov_iter_rw(iter) == WRITE &&
+	    xfs_is_reflink_inode(XFS_I(inode)) &&
+	    !((iocb->ki_pos | end) & block_mask)) {
+		dio_cow = true;
+		error = xfs_reflink_reserve_cow_range(XFS_I(inode),
+				iocb->ki_pos, iov_iter_count(iter));
+		if (error)
+			return error;
+		error = xfs_reflink_allocate_cow_range(XFS_I(inode),
+				iocb->ki_pos, iov_iter_count(iter));
+		if (error)
+			return error;
+	}
 
 	if (iov_iter_rw(iter) == WRITE) {
 		endio = xfs_end_io_direct_write;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 148d0b3..b979f01 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -896,10 +896,18 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if ((iocb->ki_flags & IOCB_DIRECT) || IS_DAX(inode))
+	/*
+	 * Allow DIO to fall back to buffered *only* in the case that we're
+	 * doing a reflink CoW.
+	 */
+	if ((iocb->ki_flags & IOCB_DIRECT) || IS_DAX(inode)) {
 		ret = xfs_file_dio_aio_write(iocb, from);
-	else
+		if (ret == -EREMCHG)
+			goto buffered;
+	} else {
+buffered:
 		ret = xfs_file_buffered_aio_write(iocb, from);
+	}
 
 	if (ret > 0) {
 		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 59c8e86..113f333 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -146,6 +146,51 @@ xfs_trim_extent(
 	}
 }
 
+/*
+ * Determine if any of the blocks in this mapping are shared.
+ */
+int
+xfs_reflink_irec_is_shared(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*irec,
+	bool			*shared)
+{
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		aglen;
+	xfs_agblock_t		fbno;
+	xfs_extlen_t		flen;
+	int			error = 0;
+
+	/* Holes, unwritten, and delalloc extents cannot be shared */
+	if (!xfs_is_reflink_inode(ip) ||
+	    ISUNWRITTEN(irec) ||
+	    irec->br_startblock == HOLESTARTBLOCK ||
+	    irec->br_startblock == DELAYSTARTBLOCK) {
+		*shared = false;
+		return 0;
+	}
+
+	trace_xfs_reflink_irec_is_shared(ip, irec);
+
+	agno = XFS_FSB_TO_AGNO(ip->i_mount, irec->br_startblock);
+	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, irec->br_startblock);
+	aglen = irec->br_blockcount;
+
+	/* Are there any shared blocks here? */
+	error = xfs_refcount_find_shared(ip->i_mount, agno, agbno,
+			aglen, &fbno, &flen, false);
+	if (error)
+		return error;
+	if (flen == 0) {
+		*shared = false;
+		return 0;
+	}
+
+	*shared = true;
+	return 0;
+}
+
 /* Find the shared ranges under an irec, and set up delalloc extents. */
 static int
 xfs_reflink_reserve_cow_extent(
@@ -273,6 +318,66 @@ xfs_reflink_reserve_cow_range(
 }
 
 /*
+ * Allocate blocks to all CoW reservations within a byte range of a file.
+ */
+int
+xfs_reflink_allocate_cow_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		pos,
+	xfs_off_t		len)
+{
+	struct xfs_ifork	*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	struct xfs_bmbt_irec	imap;
+	int			error = 0;
+	xfs_fileoff_t		start_lblk;
+	xfs_fileoff_t		end_lblk;
+	xfs_extnum_t		idx;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	trace_xfs_reflink_allocate_cow_range(ip, len, pos, 0);
+
+	start_lblk = XFS_B_TO_FSBT(ip->i_mount, pos);
+	end_lblk = XFS_B_TO_FSB(ip->i_mount, pos + len);
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	gotp = xfs_iext_bno_to_ext(ifp, start_lblk, &idx);
+	while (gotp) {
+		xfs_bmbt_get_all(gotp, &imap);
+
+		if (imap.br_startoff >= end_lblk)
+			break;
+		if (!isnullstartblock(imap.br_startblock))
+			goto advloop;
+		xfs_trim_extent(&imap, start_lblk, end_lblk - start_lblk);
+		trace_xfs_reflink_allocate_cow_extent(ip, &imap);
+
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK,
+				XFS_FSB_TO_B(ip->i_mount, imap.br_startoff +
+						imap.br_blockcount - 1), &imap);
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		if (error)
+			break;
+advloop:
+		/* Roll on... */
+		idx++;
+		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
+			break;
+		gotp = xfs_iext_get_ext(ifp, idx);
+	}
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	if (error)
+		trace_xfs_reflink_allocate_cow_range_error(ip, error, _RET_IP_);
+	return error;
+}
+
+/*
  * Determine if there's a CoW reservation at a byte offset of an inode.
  */
 bool
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 27ae6c0..fb128dd 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -20,8 +20,13 @@
 #ifndef __XFS_REFLINK_H
 #define __XFS_REFLINK_H 1
 
+extern int xfs_reflink_irec_is_shared(struct xfs_inode *ip,
+		struct xfs_bmbt_irec *imap, bool *shared);
+
 extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip, xfs_off_t pos,
 		xfs_off_t len);
+extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip, xfs_off_t pos,
+		xfs_off_t len);
 extern bool xfs_reflink_is_cow_pending(struct xfs_inode *ip, xfs_off_t offset);
 extern int xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap, bool *need_alloc);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 085/119] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (83 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 084/119] xfs: implement CoW for directio writes Darrick J. Wong
@ 2016-06-17  1:26 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 086/119] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
                   ` (33 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:26 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

When we're writing zeroes to a reflinked block (such as when we're
punching a reflinked range), we need to fork the the block and write
to that, otherwise we can corrupt the other reflinks.

v2: Only call the end_cow functions if we had set should_fork, and
release the buffer if xfs_map_cow_blocks fails.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch@lst.de: Only call the end_cow functions if we had set should_fork]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c      |   12 ++++++++++++
 fs/xfs/xfs_bmap_util.c |   38 ++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_reflink.h   |    4 ++++
 3 files changed, 52 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 31318b3..812bae5 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -394,6 +394,18 @@ xfs_map_blocks(
 	return 0;
 }
 
+/*
+ * Find a CoW mapping and ensure that blocks have been allocated to it.
+ */
+int
+xfs_map_cow_blocks(
+	struct inode		*inode,
+	xfs_off_t		offset,
+	struct xfs_bmbt_irec	*imap)
+{
+	return xfs_map_blocks(inode, offset, imap, XFS_IO_COW);
+}
+
 STATIC bool
 xfs_imap_valid(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8666873..79225fb 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -42,6 +42,8 @@
 #include "xfs_icache.h"
 #include "xfs_log.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_iomap.h"
+#include "xfs_reflink.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -1042,7 +1044,8 @@ xfs_zero_remaining_bytes(
 	xfs_buf_t		*bp;
 	xfs_mount_t		*mp = ip->i_mount;
 	int			nimap;
-	int			error = 0;
+	int			error = 0, err2;
+	bool			should_fork = false;
 
 	/*
 	 * Avoid doing I/O beyond eof - it's not necessary
@@ -1055,6 +1058,11 @@ xfs_zero_remaining_bytes(
 	if (endoff > XFS_ISIZE(ip))
 		endoff = XFS_ISIZE(ip);
 
+	error = xfs_reflink_reserve_cow_range(ip, startoff,
+			endoff - startoff + 1);
+	if (error)
+		return error;
+
 	for (offset = startoff; offset <= endoff; offset = lastoffset + 1) {
 		uint lock_mode;
 
@@ -1063,6 +1071,10 @@ xfs_zero_remaining_bytes(
 
 		lock_mode = xfs_ilock_data_map_shared(ip);
 		error = xfs_bmapi_read(ip, offset_fsb, 1, &imap, &nimap, 0);
+
+		/* Do we need to CoW this block? */
+		if (error == 0 && nimap == 1)
+			should_fork = xfs_reflink_is_cow_pending(ip, offset);
 		xfs_iunlock(ip, lock_mode);
 
 		if (error || nimap < 1)
@@ -1084,7 +1096,7 @@ xfs_zero_remaining_bytes(
 			lastoffset = endoff;
 
 		/* DAX can just zero the backing device directly */
-		if (IS_DAX(VFS_I(ip))) {
+		if (IS_DAX(VFS_I(ip)) && !should_fork) {
 			error = dax_zero_page_range(VFS_I(ip), offset,
 						    lastoffset - offset + 1,
 						    xfs_get_blocks_direct);
@@ -1105,8 +1117,30 @@ xfs_zero_remaining_bytes(
 				(offset - XFS_FSB_TO_B(mp, imap.br_startoff)),
 		       0, lastoffset - offset + 1);
 
+		if (should_fork) {
+			xfs_fsblock_t	new_fsbno;
+
+			error = xfs_map_cow_blocks(VFS_I(ip), offset, &imap);
+			if (error) {
+				xfs_buf_relse(bp);
+				return error;
+			}
+			new_fsbno = imap.br_startblock +
+					(offset_fsb - imap.br_startoff);
+			XFS_BUF_SET_ADDR(bp, XFS_FSB_TO_DADDR(mp, new_fsbno));
+		}
+
 		error = xfs_bwrite(bp);
 		xfs_buf_relse(bp);
+		if (should_fork) {
+			if (error) {
+				err2 = xfs_reflink_cancel_cow_range(ip, offset,
+						lastoffset - offset + 1);
+				return error;
+			}
+			error = xfs_reflink_end_cow(ip, offset,
+					lastoffset - offset + 1);
+		}
 		if (error)
 			return error;
 	}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index fb128dd..2f3c829 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -41,4 +41,8 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 
+/* xfs_aops.c */
+extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,
+		struct xfs_bmbt_irec *imap);
+
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 086/119] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (84 preceding siblings ...)
  2016-06-17  1:26 ` [PATCH 085/119] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 087/119] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
                   ` (32 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When we're freeing blocks (truncate, punch, etc.), clear all CoW
reservations in the range being freed.  If the file block count
drops to zero, also clear the inode reflink flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   34 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.c     |   13 +++++++++++++
 2 files changed, 47 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 79225fb..9285111 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1147,6 +1147,32 @@ xfs_zero_remaining_bytes(
 	return error;
 }
 
+STATIC int
+xfs_free_cow_space(
+	struct xfs_inode	*ip,
+	struct xfs_trans	**tpp,
+	xfs_fileoff_t		startoffset_fsb,
+	xfs_fileoff_t		endoffset_fsb)
+{
+	int			error;
+
+	/* Remove any pending CoW reservations. */
+	error = xfs_reflink_cancel_cow_blocks(ip, tpp, startoffset_fsb,
+			endoffset_fsb);
+	if (error)
+		goto out;
+
+	/*
+	 * Clear the reflink flag if we freed everything.
+	 */
+	if (ip->i_d.di_nblocks == 0) {
+		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
+	}
+out:
+	return error;
+}
+
 int
 xfs_free_file_space(
 	struct xfs_inode	*ip,
@@ -1294,6 +1320,14 @@ xfs_free_file_space(
 		if (error)
 			goto error0;
 
+		/* Remove CoW reservations and inode flag if applicable. */
+		if (done && xfs_is_reflink_inode(ip)) {
+			error = xfs_free_cow_space(ip, &tp, startoffset_fsb,
+					endoffset_fsb);
+			if (error)
+				goto error0;
+		}
+
 		error = xfs_trans_commit(tp);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e08eaea..b8d3c4f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -49,6 +49,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_reflink.h"
 
 kmem_zone_t *xfs_inode_zone;
 
@@ -1586,6 +1587,18 @@ xfs_itruncate_extents(
 			goto out;
 	}
 
+	/* Remove all pending CoW reservations. */
+	error = xfs_reflink_cancel_cow_blocks(ip, &tp, first_unmap_block,
+			last_block);
+	if (error)
+		goto out;
+
+	/*
+	 * Clear the reflink flag if we truncated everything.
+	 */
+	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip))
+		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
 	/*
 	 * Always re-log the inode so that our permanent transaction can keep
 	 * on rolling it forward in the log.


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 087/119] xfs: cancel pending CoW reservations when destroying inodes
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (85 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 086/119] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 088/119] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
                   ` (31 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When destroying the inode, cancel all pending reservations in the CoW
fork so that all the reserved blocks go back to the free pile.  In
theory this sort of cleanup is only needed to clean up after write
errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_super.c |    8 ++++++++
 1 file changed, 8 insertions(+)


diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 18f74b3..09f9af7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -50,6 +50,7 @@
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
+#include "xfs_reflink.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -939,6 +940,7 @@ xfs_fs_destroy_inode(
 	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
 
 	trace_xfs_destroy_inode(ip);
 
@@ -946,6 +948,12 @@ xfs_fs_destroy_inode(
 	XFS_STATS_INC(ip->i_mount, vn_rele);
 	XFS_STATS_INC(ip->i_mount, vn_remove);
 
+	error = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
+	if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount))
+		xfs_warn(ip->i_mount, "Error %d while evicting CoW blocks "
+				"for inode %llu.",
+				error, ip->i_ino);
+
 	xfs_inactive(ip);
 
 	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 088/119] xfs: store in-progress CoW allocations in the refcount btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (86 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 087/119] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 089/119] xfs: reflink extents from one file to another Darrick J. Wong
                   ` (30 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Due to the way the CoW algorithm in XFS works, there's an interval
during which blocks allocated to handle a CoW can be lost -- if the FS
goes down after the blocks are allocated but before the block
remapping takes place.  This is exacerbated by the cowextsz hint --
allocated reservations can sit around for a while, waiting to get
used.

Since the refcount btree doesn't normally store records with refcount
of 1, we can use it to record these in-progress extents.  In-progress
blocks cannot be shared because they're not user-visible, so there
shouldn't be any conflicts with other programs.  This is a better
solution than holding EFIs during writeback because (a) EFIs can't be
relogged currently, (b) even if they could, EFIs are bound by
available log space, which puts an unnecessary upper bound on how much
CoW we can have in flight, and (c) we already have a mechanism to
track blocks.

At mount time, read the refcount records and free anything we find
with a refcount of 1 because those were in-progress when the FS went
down.

v2: Use the deferred operations system to avoid deadlocks and blowing
out the transaction reservation.  This allows us to unmap a CoW
extent from the refcountbt and into a file atomically.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c     |   11 +
 fs/xfs/libxfs/xfs_format.h   |    3 
 fs/xfs/libxfs/xfs_refcount.c |  321 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_refcount.h |    7 +
 fs/xfs/xfs_log_recover.c     |   12 ++
 fs/xfs/xfs_mount.c           |    7 +
 fs/xfs/xfs_reflink.c         |  144 +++++++++++++++++++
 fs/xfs/xfs_reflink.h         |    2 
 fs/xfs/xfs_super.c           |    6 +
 fs/xfs/xfs_trace.h           |    4 +
 10 files changed, 511 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2ba513b..0909532 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4705,6 +4705,17 @@ xfs_bmapi_write(
 				goto error0;
 			if (bma.blkno == NULLFSBLOCK)
 				break;
+
+			/*
+			 * If this is a CoW allocation, record the data in
+			 * the refcount btree for orphan recovery.
+			 */
+			if (whichfork == XFS_COW_FORK) {
+				error = xfs_refcount_alloc_cow_extent(mp, dfops,
+						bma.blkno, bma.length);
+				if (error)
+					goto error0;
+			}
 		}
 
 		/* Deal with the allocated space we found.  */
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 45bbdad..3d336e9 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1401,7 +1401,8 @@ xfs_rmap_ino_owner(
 #define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
 #define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
 #define XFS_RMAP_OWN_REFC	(-8ULL) /* refcount tree */
-#define XFS_RMAP_OWN_MIN	(-9ULL) /* guard */
+#define XFS_RMAP_OWN_COW	(-9ULL) /* cow allocations */
+#define XFS_RMAP_OWN_MIN	(-10ULL) /* guard */
 
 #define XFS_RMAP_NON_INODE_OWNER(owner)	(!!((owner) & (1ULL << 63)))
 
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index ebbb714..88f91d5 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -36,13 +36,23 @@
 #include "xfs_trans.h"
 #include "xfs_bit.h"
 #include "xfs_refcount.h"
+#include "xfs_rmap_btree.h"
 
 /* Allowable refcount adjustment amounts. */
 enum xfs_refc_adjust_op {
 	XFS_REFCOUNT_ADJUST_INCREASE	= 1,
 	XFS_REFCOUNT_ADJUST_DECREASE	= -1,
+	XFS_REFCOUNT_ADJUST_COW_ALLOC	= 0,
+	XFS_REFCOUNT_ADJUST_COW_FREE	= -1,
 };
 
+STATIC int __xfs_refcount_cow_alloc(struct xfs_btree_cur *rcur,
+		xfs_agblock_t agbno, xfs_extlen_t aglen,
+		struct xfs_defer_ops *dfops);
+STATIC int __xfs_refcount_cow_free(struct xfs_btree_cur *rcur,
+		xfs_agblock_t agbno, xfs_extlen_t aglen,
+		struct xfs_defer_ops *dfops);
+
 /*
  * Look up the first record less than or equal to [bno, len] in the btree
  * given by cur.
@@ -468,6 +478,8 @@ out_error:
 	return error;
 }
 
+#define XFS_FIND_RCEXT_SHARED	1
+#define XFS_FIND_RCEXT_COW	2
 /*
  * Find the left extent and the one after it (cleft).  This function assumes
  * that we've already split any extent crossing agbno.
@@ -478,7 +490,8 @@ xfs_refcount_find_left_extents(
 	struct xfs_refcount_irec	*left,
 	struct xfs_refcount_irec	*cleft,
 	xfs_agblock_t			agbno,
-	xfs_extlen_t			aglen)
+	xfs_extlen_t			aglen,
+	int				flags)
 {
 	struct xfs_refcount_irec	tmp;
 	int				error;
@@ -498,6 +511,10 @@ xfs_refcount_find_left_extents(
 
 	if (RCNEXT(tmp) != agbno)
 		return 0;
+	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
+		return 0;
+	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
+		return 0;
 	/* We have a left extent; retrieve (or invent) the next right one */
 	*left = tmp;
 
@@ -554,7 +571,8 @@ xfs_refcount_find_right_extents(
 	struct xfs_refcount_irec	*right,
 	struct xfs_refcount_irec	*cright,
 	xfs_agblock_t			agbno,
-	xfs_extlen_t			aglen)
+	xfs_extlen_t			aglen,
+	int				flags)
 {
 	struct xfs_refcount_irec	tmp;
 	int				error;
@@ -574,6 +592,10 @@ xfs_refcount_find_right_extents(
 
 	if (tmp.rc_startblock != agbno + aglen)
 		return 0;
+	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
+		return 0;
+	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
+		return 0;
 	/* We have a right extent; retrieve (or invent) the next left one */
 	*right = tmp;
 
@@ -630,6 +652,7 @@ xfs_refcount_merge_extents(
 	xfs_agblock_t		*agbno,
 	xfs_extlen_t		*aglen,
 	enum xfs_refc_adjust_op adjust,
+	int			flags,
 	bool			*shape_changed)
 {
 	struct xfs_refcount_irec	left = {0}, cleft = {0};
@@ -645,11 +668,11 @@ xfs_refcount_merge_extents(
 	 * [right].
 	 */
 	error = xfs_refcount_find_left_extents(cur, &left, &cleft, *agbno,
-			*aglen);
+			*aglen, flags);
 	if (error)
 		return error;
 	error = xfs_refcount_find_right_extents(cur, &right, &cright, *agbno,
-			*aglen);
+			*aglen, flags);
 	if (error)
 		return error;
 
@@ -936,7 +959,7 @@ xfs_refcount_adjust(
 	 */
 	orig_aglen = aglen;
 	error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
-			&shape_changed);
+			XFS_FIND_RCEXT_SHARED, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
@@ -1053,6 +1076,18 @@ xfs_refcount_finish_one(
 		error = xfs_refcount_adjust(rcur, bno, blockcount, adjusted,
 			XFS_REFCOUNT_ADJUST_DECREASE, dfops, NULL);
 		break;
+	case XFS_REFCOUNT_ALLOC_COW:
+		*adjusted = 0;
+		error = __xfs_refcount_cow_alloc(rcur, bno, blockcount, dfops);
+		if (!error)
+			*adjusted = blockcount;
+		break;
+	case XFS_REFCOUNT_FREE_COW:
+		*adjusted = 0;
+		error = __xfs_refcount_cow_free(rcur, bno, blockcount, dfops);
+		if (!error)
+			*adjusted = blockcount;
+		break;
 	default:
 		ASSERT(0);
 		error = -EFSCORRUPTED;
@@ -1238,3 +1273,279 @@ out:
 		trace_xfs_refcount_find_shared_error(mp, agno, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Recovering CoW Blocks After a Crash
+ *
+ * Due to the way that the copy on write mechanism works, there's a window of
+ * opportunity in which we can lose track of allocated blocks during a crash.
+ * Because CoW uses delayed allocation in the in-core CoW fork, writeback
+ * causes blocks to be allocated and stored in the CoW fork.  The blocks are
+ * no longer in the free space btree but are not otherwise recorded anywhere
+ * until the write completes and the blocks are mapped into the file.  A crash
+ * in between allocation and remapping results in the replacement blocks being
+ * lost.  This situation is exacerbated by the CoW extent size hint because
+ * allocations can hang around for long time.
+ *
+ * However, there is a place where we can record these allocations before they
+ * become mappings -- the reference count btree.  The btree does not record
+ * extents with refcount == 1, so we can record allocations with a refcount of
+ * 1.  Blocks being used for CoW writeout cannot be shared, so there should be
+ * no conflict with shared block records.  These mappings should be created
+ * when we allocate blocks to the CoW fork and deleted when they're removed
+ * from the CoW fork.
+ *
+ * Minor nit: records for in-progress CoW allocations and records for shared
+ * extents must never be merged, to preserve the property that (except for CoW
+ * allocations) there are no refcount btree entries with refcount == 1.  The
+ * only time this could potentially happen is when unsharing a block that's
+ * adjacent to CoW allocations, so we must be careful to avoid this.
+ *
+ * At mount time we recover lost CoW allocations by searching the refcount
+ * btree for these refcount == 1 mappings.  These represent CoW allocations
+ * that were in progress at the time the filesystem went down, so we can free
+ * them to get the space back.
+ *
+ * This mechanism is superior to creating EFIs for unmapped CoW extents for
+ * several reasons -- first, EFIs pin the tail of the log and would have to be
+ * periodically relogged to avoid filling up the log.  Second, CoW completions
+ * will have to file an EFD and create new EFIs for whatever remains in the
+ * CoW fork; this partially takes care of (1) but extent-size reservations
+ * will have to periodically relog even if there's no writeout in progress.
+ * This can happen if the CoW extent size hint is set, which you really want.
+ * Third, EFIs cannot currently be automatically relogged into newer
+ * transactions to advance the log tail.  Fourth, stuffing the log full of
+ * EFIs places an upper bound on the number of CoW allocations that can be
+ * held filesystem-wide at any given time.  Recording them in the refcount
+ * btree doesn't require us to maintain any state in memory and doesn't pin
+ * the log.
+ */
+/*
+ * Adjust the refcounts of CoW allocations.  These allocations are "magic"
+ * in that they're not referenced anywhere else in the filesystem, so we
+ * stash them in the refcount btree with a refcount of 1 until either file
+ * remapping (or CoW cancellation) happens.
+ */
+STATIC int
+xfs_refcount_adjust_cow_extents(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_refcount_irec	ext, tmp;
+	int				error;
+	int				found_rec, found_tmp;
+
+	if (aglen == 0)
+		return 0;
+
+	/* Find any overlapping refcount records */
+	error = xfs_refcountbt_lookup_ge(cur, agbno, &found_rec);
+	if (error)
+		goto out_error;
+	error = xfs_refcountbt_get_rec(cur, &ext, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec) {
+		ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
+		ext.rc_blockcount = 0;
+		ext.rc_refcount = 0;
+	}
+
+	switch (adj) {
+	case XFS_REFCOUNT_ADJUST_COW_ALLOC:
+		/* Adding a CoW reservation, there should be nothing here. */
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+				ext.rc_startblock >= agbno + aglen, out_error);
+
+		tmp.rc_startblock = agbno;
+		tmp.rc_blockcount = aglen;
+		tmp.rc_refcount = 1;
+		trace_xfs_refcount_modify_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &tmp);
+
+		error = xfs_refcountbt_insert(cur, &tmp,
+				&found_tmp);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+				found_tmp == 1, out_error);
+		break;
+	case XFS_REFCOUNT_ADJUST_COW_FREE:
+		/* Removing a CoW reservation, there should be one extent. */
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+			ext.rc_startblock == agbno, out_error);
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+			ext.rc_blockcount == aglen, out_error);
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+			ext.rc_refcount == 1, out_error);
+
+		ext.rc_refcount = 0;
+		trace_xfs_refcount_modify_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &ext);
+		error = xfs_refcountbt_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+				found_rec == 1, out_error);
+		break;
+	default:
+		ASSERT(0);
+	}
+
+	return error;
+out_error:
+	trace_xfs_refcount_modify_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Add or remove refcount btree entries for CoW reservations.
+ */
+STATIC int
+xfs_refcount_adjust_cow(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops)
+{
+	bool			shape_changed;
+	int			error;
+
+	/*
+	 * Ensure that no rcextents cross the boundary of the adjustment range.
+	 */
+	error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
+	if (error)
+		goto out_error;
+
+	error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
+	if (error)
+		goto out_error;
+
+	/*
+	 * Try to merge with the left or right extents of the range.
+	 */
+	error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
+			XFS_FIND_RCEXT_COW, &shape_changed);
+	if (error)
+		goto out_error;
+
+	/* Now that we've taken care of the ends, adjust the middle extents */
+	error = xfs_refcount_adjust_cow_extents(cur, agbno, aglen, adj,
+			dfops, NULL);
+	if (error)
+		goto out_error;
+
+	return 0;
+
+out_error:
+	trace_xfs_refcount_adjust_cow_error(cur->bc_mp, cur->bc_private.a.agno,
+			error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Record a CoW allocation in the refcount btree.
+ */
+STATIC int
+__xfs_refcount_cow_alloc(
+	struct xfs_btree_cur	*rcur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	struct xfs_defer_ops	*dfops)
+{
+	int			error;
+
+	trace_xfs_refcount_cow_increase(rcur->bc_mp, rcur->bc_private.a.agno,
+			agbno, aglen);
+
+	/* Add refcount btree reservation */
+	error = xfs_refcount_adjust_cow(rcur, agbno, aglen,
+			XFS_REFCOUNT_ADJUST_COW_ALLOC, dfops);
+	if (error)
+		return error;
+
+	/* Add rmap entry */
+	if (xfs_sb_version_hasrmapbt(&rcur->bc_mp->m_sb)) {
+		error = xfs_rmap_alloc_defer(rcur->bc_mp, dfops,
+				rcur->bc_private.a.agno,
+				agbno, aglen, XFS_RMAP_OWN_COW);
+		if (error)
+			return error;
+	}
+
+	return error;
+}
+
+/*
+ * Remove a CoW allocation from the refcount btree.
+ */
+STATIC int
+__xfs_refcount_cow_free(
+	struct xfs_btree_cur	*rcur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	struct xfs_defer_ops	*dfops)
+{
+	int			error;
+
+	trace_xfs_refcount_cow_decrease(rcur->bc_mp, rcur->bc_private.a.agno,
+			agbno, aglen);
+
+	/* Remove refcount btree reservation */
+	error = xfs_refcount_adjust_cow(rcur, agbno, aglen,
+			XFS_REFCOUNT_ADJUST_COW_FREE, dfops);
+	if (error)
+		return error;
+
+	/* Remove rmap entry */
+	if (xfs_sb_version_hasrmapbt(&rcur->bc_mp->m_sb)) {
+		error = xfs_rmap_free_defer(rcur->bc_mp, dfops,
+				rcur->bc_private.a.agno,
+				agbno, aglen, XFS_RMAP_OWN_COW);
+		if (error)
+			return error;
+	}
+
+	return error;
+}
+
+/* Record a CoW staging extent in the refcount btree. */
+int
+xfs_refcount_alloc_cow_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	xfs_fsblock_t			fsb,
+	xfs_extlen_t			len)
+{
+	struct xfs_refcount_intent	ri;
+
+	ri.ri_type = XFS_REFCOUNT_ALLOC_COW;
+	ri.ri_startblock = fsb;
+	ri.ri_blockcount = len;
+
+	return __xfs_refcount_add(mp, dfops, &ri);
+}
+
+/* Forget a CoW staging event in the refcount btree. */
+int
+xfs_refcount_free_cow_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	xfs_fsblock_t			fsb,
+	xfs_extlen_t			len)
+{
+	struct xfs_refcount_intent	ri;
+
+	ri.ri_type = XFS_REFCOUNT_FREE_COW;
+	ri.ri_startblock = fsb;
+	ri.ri_blockcount = len;
+
+	return __xfs_refcount_add(mp, dfops, &ri);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index b7b83b8..6665eeb 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -57,4 +57,11 @@ extern int xfs_refcount_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
 		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
 		xfs_extlen_t *flen, bool find_maximal);
 
+extern int xfs_refcount_alloc_cow_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, xfs_fsblock_t fsb,
+		xfs_extlen_t len);
+extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, xfs_fsblock_t fsb,
+		xfs_extlen_t len);
+
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 3faaf10..58a700b 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -5040,6 +5040,18 @@ xlog_recover_process_cui(
 				error = xfs_refcount_decrease_extent(
 						tp->t_mountp, &dfops, &irec);
 				break;
+			case XFS_REFCOUNT_ALLOC_COW:
+				error = xfs_refcount_alloc_cow_extent(
+						tp->t_mountp, &dfops,
+						irec.br_startblock,
+						irec.br_blockcount);
+				break;
+			case XFS_REFCOUNT_FREE_COW:
+				error = xfs_refcount_free_cow_extent(
+						tp->t_mountp, &dfops,
+						irec.br_startblock,
+						irec.br_blockcount);
+				break;
 			default:
 				ASSERT(0);
 			}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 48b8b1e..6351dce 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -44,6 +44,7 @@
 #include "xfs_sysfs.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_reflink.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -960,6 +961,12 @@ xfs_mountfs(
 		if (error)
 			xfs_warn(mp,
 	"Unable to allocate reserve blocks. Continuing without reserve pool.");
+
+		/* Recover any CoW blocks that never got remapped. */
+		error = xfs_reflink_recover_cow(mp);
+		if (error && !XFS_FORCED_SHUTDOWN(mp))
+			xfs_err(mp,
+	"Error %d recovering leftover CoW allocations.", error);
 	}
 
 	return 0;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 113f333..0f9cc82 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -534,6 +534,13 @@ xfs_reflink_cancel_cow_blocks(
 			xfs_trans_ijoin(*tpp, ip, 0);
 			xfs_defer_init(&dfops, &firstfsb);
 
+			/* Free the CoW orphan record. */
+			error = xfs_refcount_free_cow_extent(ip->i_mount,
+					&dfops, irec.br_startblock,
+					irec.br_blockcount);
+			if (error)
+				break;
+
 			xfs_bmap_add_free(ip->i_mount, &dfops,
 					irec.br_startblock, irec.br_blockcount,
 					NULL);
@@ -694,6 +701,13 @@ xfs_reflink_end_cow(
 			irec.br_blockcount = rlen;
 			trace_xfs_reflink_cow_remap_piece(ip, &uirec);
 
+			/* Free the CoW orphan record. */
+			error = xfs_refcount_free_cow_extent(tp->t_mountp,
+					&dfops, uirec.br_startblock,
+					uirec.br_blockcount);
+			if (error)
+				goto out_defer;
+
 			/* Map the new blocks into the data fork. */
 			error = xfs_bmap_map_extent(tp->t_mountp, &dfops,
 					ip, XFS_DATA_FORK, &uirec);
@@ -731,3 +745,133 @@ out:
 	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
 	return error;
 }
+
+struct xfs_reflink_recovery {
+	struct list_head		rr_list;
+	struct xfs_refcount_irec	rr_rrec;
+};
+
+/*
+ * Find and remove leftover CoW reservations.
+ */
+STATIC int
+xfs_reflink_recover_cow_ag(
+	struct xfs_mount		*mp,
+	xfs_agnumber_t			agno)
+{
+	struct list_head		debris;
+	struct xfs_trans		*tp;
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agbp;
+	struct xfs_refcount_irec	tmp;
+	struct xfs_reflink_recovery	*rr, *n;
+	struct xfs_defer_ops		dfops;
+	xfs_fsblock_t			fsb;
+	int				i, have;
+	int				error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
+
+	/* Start iterating btree entries. */
+	INIT_LIST_HEAD(&debris);
+	error = xfs_refcountbt_lookup_ge(cur, 0, &have);
+	if (error)
+		goto out_error;
+	while (have) {
+		/* If refcount == 1, save the stashed entry for later. */
+		error = xfs_refcountbt_get_rec(cur, &tmp, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+		if (tmp.rc_refcount != 1)
+			goto advloop;
+
+		rr = kmem_alloc(sizeof(struct xfs_reflink_recovery), KM_SLEEP);
+		rr->rr_rrec = tmp;
+		list_add_tail(&rr->rr_list, &debris);
+
+advloop:
+		/* Look at the next one */
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+	}
+
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	xfs_buf_relse(agbp);
+
+	/* Now iterate the list to free the leftovers */
+	list_for_each_entry(rr, &debris, rr_list) {
+		/* Set up transaction. */
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
+		if (error)
+			goto out_free;
+
+		trace_xfs_reflink_recover_extent(mp, agno, &rr->rr_rrec);
+
+		/* Free the orphan record */
+		xfs_defer_init(&dfops, &fsb);
+		fsb = XFS_AGB_TO_FSB(mp, agno, rr->rr_rrec.rc_startblock);
+		error = xfs_refcount_free_cow_extent(mp, &dfops, fsb,
+				rr->rr_rrec.rc_blockcount);
+		if (error)
+			goto out_defer;
+
+		/* Free the block. */
+		xfs_bmap_add_free(mp, &dfops, fsb,
+				rr->rr_rrec.rc_blockcount, NULL);
+
+		error = xfs_defer_finish(&tp, &dfops, NULL);
+		if (error)
+			goto out_defer;
+
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto out_cancel;
+	}
+	goto out_free;
+
+out_defer:
+	xfs_defer_cancel(&dfops);
+out_cancel:
+	xfs_trans_cancel(tp);
+
+out_free:
+	/* Free the leftover list */
+	list_for_each_entry_safe(rr, n, &debris, rr_list) {
+		list_del(&rr->rr_list);
+		kmem_free(rr);
+	}
+
+	return error;
+
+out_error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	xfs_buf_relse(agbp);
+	return error;
+}
+
+/*
+ * Free leftover CoW reservations that didn't get cleaned out.
+ */
+int
+xfs_reflink_recover_cow(
+	struct xfs_mount	*mp)
+{
+	xfs_agnumber_t		agno;
+	int			error = 0;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = xfs_reflink_recover_cow_ag(mp, agno);
+		if (error)
+			break;
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 2f3c829..0f5fd0c 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -41,6 +41,8 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 
+extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
+
 /* xfs_aops.c */
 extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 09f9af7..87e997a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1306,6 +1306,12 @@ xfs_fs_remount(
 		 */
 		xfs_restore_resvblks(mp);
 		xfs_log_work_queue(mp);
+
+		/* Recover any CoW blocks that never got remapped. */
+		error = xfs_reflink_recover_cow(mp);
+		if (error && !XFS_FORCED_SHUTDOWN(mp))
+			xfs_err(mp,
+	"Error %d recovering leftover CoW allocations.", error);
 	}
 
 	/* rw -> ro */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 079075f..fe4a5be 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2889,14 +2889,18 @@ DEFINE_AG_ERROR_EVENT(xfs_refcountbt_update_error);
 /* refcount adjustment tracepoints */
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_increase);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_decrease);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_increase);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_decrease);
 DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(xfs_refcount_merge_center_extents);
 DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_modify_extent);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_reflink_recover_extent);
 DEFINE_REFCOUNT_EXTENT_AT_EVENT(xfs_refcount_split_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_left_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_right_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_left_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_right_extent);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_cow_error);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_center_extents_error);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_modify_extent_error);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_split_extent_error);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 089/119] xfs: reflink extents from one file to another
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (87 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 088/119] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 090/119] xfs: add clone file and clone range vfs functions Darrick J. Wong
                   ` (29 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Reflink extents from one file to another; that is to say, iteratively
remove the mappings from the destination file, copy the mappings from
the source file to the destination file, and increment the reference
count of all the blocks that got remapped.

v2: Call xfs_defer_cancel before cancelling the transaction if the
remap operation fails.  Use the deferred operations system to avoid
deadlocks or blowing out the transaction reservation, and make the
entire reflink operation atomic for each extent being remapped.  The
destination file's i_size will be updated if necessary to avoid
violating the assumption that there are no shared blocks past the EOF
block.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |  426 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    3 
 2 files changed, 429 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 0f9cc82..c01c0c7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -875,3 +875,429 @@ xfs_reflink_recover_cow(
 
 	return error;
 }
+
+/*
+ * Reflinking (Block) Ranges of Two Files Together
+ *
+ * First, ensure that the reflink flag is set on both inodes.  The flag is an
+ * optimization to avoid unnecessary refcount btree lookups in the write path.
+ *
+ * Now we can iteratively remap the range of extents (and holes) in src to the
+ * corresponding ranges in dest.  Let drange and srange denote the ranges of
+ * logical blocks in dest and src touched by the reflink operation.
+ *
+ * While the length of drange is greater than zero,
+ *    - Read src's bmbt at the start of srange ("imap")
+ *    - If imap doesn't exist, make imap appear to start at the end of srange
+ *      with zero length.
+ *    - If imap starts before srange, advance imap to start at srange.
+ *    - If imap goes beyond srange, truncate imap to end at the end of srange.
+ *    - Punch (imap start - srange start + imap len) blocks from dest at
+ *      offset (drange start).
+ *    - If imap points to a real range of pblks,
+ *         > Increase the refcount of the imap's pblks
+ *         > Map imap's pblks into dest at the offset
+ *           (drange start + imap start - srange start)
+ *    - Advance drange and srange by (imap start - srange start + imap len)
+ *
+ * Finally, if the reflink made dest longer, update both the in-core and
+ * on-disk file sizes.
+ *
+ * ASCII Art Demonstration:
+ *
+ * Let's say we want to reflink this source file:
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS (src file)
+ *   <-------------------->
+ *
+ * into this destination file:
+ *
+ * --DDDDDDDDDDDDDDDDDDD--DDD (dest file)
+ *        <-------------------->
+ * '-' means a hole, and 'S' and 'D' are written blocks in the src and dest.
+ * Observe that the range has different logical offsets in either file.
+ *
+ * Consider that the first extent in the source file doesn't line up with our
+ * reflink range.  Unmapping  and remapping are separate operations, so we can
+ * unmap more blocks from the destination file than we remap.
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *   <------->
+ * --DDDDD---------DDDDD--DDD
+ *        <------->
+ *
+ * Now remap the source extent into the destination file:
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *   <------->
+ * --DDDDD--SSSSSSSDDDDD--DDD
+ *        <------->
+ *
+ * Do likewise with the second hole and extent in our range.  Holes in the
+ * unmap range don't affect our operation.
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *            <---->
+ * --DDDDD--SSSSSSS-SSSSS-DDD
+ *                 <---->
+ *
+ * Finally, unmap and remap part of the third extent.  This will increase the
+ * size of the destination file.
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *                  <----->
+ * --DDDDD--SSSSSSS-SSSSS----SSS
+ *                       <----->
+ *
+ * Once we update the destination file's i_size, we're done.
+ */
+
+/*
+ * Ensure the reflink bit is set in both inodes.
+ */
+STATIC int
+xfs_reflink_set_inode_flag(
+	struct xfs_inode	*src,
+	struct xfs_inode	*dest)
+{
+	struct xfs_mount	*mp = src->i_mount;
+	int			error;
+	struct xfs_trans	*tp;
+
+	if (xfs_is_reflink_inode(src) && xfs_is_reflink_inode(dest))
+		return 0;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		goto out_error;
+
+	/* Lock both files against IO */
+	if (src->i_ino == dest->i_ino)
+		xfs_ilock(src, XFS_ILOCK_EXCL);
+	else
+		xfs_lock_two_inodes(src, dest, XFS_ILOCK_EXCL);
+
+	if (!xfs_is_reflink_inode(src)) {
+		trace_xfs_reflink_set_inode_flag(src);
+		xfs_trans_ijoin(tp, src, XFS_ILOCK_EXCL);
+		src->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+		xfs_trans_log_inode(tp, src, XFS_ILOG_CORE);
+		xfs_ifork_init_cow(src);
+	} else
+		xfs_iunlock(src, XFS_ILOCK_EXCL);
+
+	if (src->i_ino == dest->i_ino)
+		goto commit_flags;
+
+	if (!xfs_is_reflink_inode(dest)) {
+		trace_xfs_reflink_set_inode_flag(dest);
+		xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+		dest->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+		xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
+		xfs_ifork_init_cow(dest);
+	} else
+		xfs_iunlock(dest, XFS_ILOCK_EXCL);
+
+commit_flags:
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_error;
+	return error;
+
+out_error:
+	trace_xfs_reflink_set_inode_flag_error(dest, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Update destination inode size, if necessary.
+ */
+STATIC int
+xfs_reflink_update_dest(
+	struct xfs_inode	*dest,
+	xfs_off_t		newlen)
+{
+	struct xfs_mount	*mp = dest->i_mount;
+	struct xfs_trans	*tp;
+	int			error;
+
+	if (newlen <= i_size_read(VFS_I(dest)))
+		return 0;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		goto out_error;
+
+	xfs_ilock(dest, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+
+	trace_xfs_reflink_update_inode_size(dest, newlen);
+	i_size_write(VFS_I(dest), newlen);
+	dest->i_d.di_size = newlen;
+	xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_error;
+	return error;
+
+out_error:
+	trace_xfs_reflink_update_inode_size_error(dest, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Unmap a range of blocks from a file, then map other blocks into the hole.
+ * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
+ * The extent irec is mapped into dest at irec->br_startoff.
+ */
+STATIC int
+xfs_reflink_remap_extent(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*irec,
+	xfs_fileoff_t		destoff,
+	xfs_off_t		new_isize)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	xfs_fsblock_t		firstfsb;
+	unsigned int		resblks;
+	struct xfs_defer_ops	dfops;
+	struct xfs_bmbt_irec	uirec;
+	bool			real_extent;
+	xfs_filblks_t		rlen;
+	xfs_filblks_t		unmap_len;
+	xfs_off_t		newlen;
+	int			error;
+
+	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
+	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
+
+	/* Only remap normal extents. */
+	real_extent =  (irec->br_startblock != HOLESTARTBLOCK &&
+			irec->br_startblock != DELAYSTARTBLOCK &&
+			!ISUNWRITTEN(irec));
+
+	/* Start a rolling transaction to switch the mappings */
+	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
+	if (error)
+		goto out;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/* If we're not just clearing space, then do we have enough quota? */
+	if (real_extent) {
+		error = xfs_trans_reserve_quota_nblks(tp, ip,
+				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
+		if (error)
+			goto out_cancel;
+	}
+
+	trace_xfs_reflink_remap(ip, irec->br_startoff,
+				irec->br_blockcount, irec->br_startblock);
+
+	/* Unmap the old blocks in the data fork. */
+	rlen = unmap_len;
+	while (rlen) {
+		xfs_defer_init(&dfops, &firstfsb);
+		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
+				&firstfsb, &dfops);
+		if (error)
+			goto out_defer;
+
+		/* Trim the extent to whatever got unmapped. */
+		uirec = *irec;
+		xfs_trim_extent(&uirec, destoff + rlen, unmap_len - rlen);
+		unmap_len = rlen;
+
+		/* If this isn't a real mapping, we're done. */
+		if (!real_extent || uirec.br_blockcount == 0)
+			goto next_extent;
+
+		trace_xfs_reflink_remap(ip, uirec.br_startoff,
+				uirec.br_blockcount, uirec.br_startblock);
+
+		/* Update the refcount tree */
+		error = xfs_refcount_increase_extent(mp, &dfops, &uirec);
+		if (error)
+			goto out_defer;
+
+		/* Map the new blocks into the data fork. */
+		error = xfs_bmap_map_extent(mp, &dfops, ip, XFS_DATA_FORK,
+				&uirec);
+		if (error)
+			goto out_defer;
+
+		/* Update quota accounting. */
+		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
+				uirec.br_blockcount);
+
+		/* Update dest isize if needed. */
+		newlen = XFS_FSB_TO_B(mp,
+				uirec.br_startoff + uirec.br_blockcount);
+		newlen = min_t(xfs_off_t, newlen, new_isize);
+		if (newlen > i_size_read(VFS_I(ip))) {
+			trace_xfs_reflink_update_inode_size(ip, newlen);
+			i_size_write(VFS_I(ip), newlen);
+			ip->i_d.di_size = newlen;
+			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		}
+
+next_extent:
+		/* Process all the deferred stuff. */
+		error = xfs_defer_finish(&tp, &dfops, ip);
+		if (error)
+			goto out_defer;
+	}
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error)
+		goto out;
+	return 0;
+
+out_defer:
+	xfs_defer_cancel(&dfops);
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Iteratively remap one file's extents (and holes) to another's.
+ */
+STATIC int
+xfs_reflink_remap_blocks(
+	struct xfs_inode	*src,
+	xfs_fileoff_t		srcoff,
+	struct xfs_inode	*dest,
+	xfs_fileoff_t		destoff,
+	xfs_filblks_t		len,
+	xfs_off_t		new_isize)
+{
+	struct xfs_bmbt_irec	imap;
+	int			nimaps;
+	int			error = 0;
+	xfs_filblks_t		range_len;
+
+	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
+	while (len) {
+		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
+				dest, destoff);
+		/* Read extent from the source file */
+		nimaps = 1;
+		xfs_ilock(src, XFS_ILOCK_EXCL);
+		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
+		xfs_iunlock(src, XFS_ILOCK_EXCL);
+		if (error)
+			goto err;
+		ASSERT(nimaps == 1);
+		xfs_trim_extent(&imap, srcoff, len);
+
+		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
+				&imap);
+
+		/* Translate imap into the destination file. */
+		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
+		imap.br_startoff += destoff - srcoff;
+
+		/* Clear dest from destoff to the end of imap and map it in. */
+		error = xfs_reflink_remap_extent(dest, &imap, destoff,
+				new_isize);
+		if (error)
+			goto err;
+
+		if (fatal_signal_pending(current)) {
+			error = -EINTR;
+			goto err;
+		}
+
+		/* Advance drange/srange */
+		srcoff += range_len;
+		destoff += range_len;
+		len -= range_len;
+	}
+
+	return 0;
+
+err:
+	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Link a range of blocks from one file to another.
+ */
+int
+xfs_reflink_remap_range(
+	struct xfs_inode	*src,
+	xfs_off_t		srcoff,
+	struct xfs_inode	*dest,
+	xfs_off_t		destoff,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = src->i_mount;
+	xfs_fileoff_t		sfsbno, dfsbno;
+	xfs_filblks_t		fsblen;
+	int			error;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/* Don't reflink realtime inodes */
+	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
+		return -EINVAL;
+
+	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
+
+	/* Lock both files against IO */
+	if (src->i_ino == dest->i_ino) {
+		xfs_ilock(src, XFS_IOLOCK_EXCL);
+		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
+	} else {
+		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
+		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
+	}
+
+	error = xfs_reflink_set_inode_flag(src, dest);
+	if (error)
+		goto out_error;
+
+	/*
+	 * Invalidate the page cache so that we can clear any CoW mappings
+	 * in the destination file.
+	 */
+	truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
+				   PAGE_ALIGN(destoff + len) - 1);
+
+	dfsbno = XFS_B_TO_FSBT(mp, destoff);
+	sfsbno = XFS_B_TO_FSBT(mp, srcoff);
+	fsblen = XFS_B_TO_FSB(mp, len);
+	error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
+			destoff + len);
+	if (error)
+		goto out_error;
+
+	error = xfs_reflink_update_dest(dest, destoff + len);
+	if (error)
+		goto out_error;
+
+out_error:
+	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(src, XFS_IOLOCK_EXCL);
+	if (src->i_ino != dest->i_ino) {
+		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
+	}
+	if (error)
+		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 0f5fd0c..92c0ebd 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -43,6 +43,9 @@ extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 
+extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
+		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
+
 /* xfs_aops.c */
 extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 090/119] xfs: add clone file and clone range vfs functions
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (88 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 089/119] xfs: reflink extents from one file to another Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 091/119] xfs: add dedupe range vfs function Darrick J. Wong
                   ` (28 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Define two VFS functions which allow userspace to reflink a range of
blocks between two files or to reflink one file's contents to another.
These functions fit the new VFS ioctls that standardize the checking
for the btrfs CLONE and CLONE RANGE ioctls.

v2: Plug into the VFS function pointers instead of handling ioctls
directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |   11 ++++
 fs/xfs/xfs_file.c      |  142 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c     |    2 +
 3 files changed, 155 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 6f4f2c3..788e006 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -533,6 +533,17 @@ typedef struct xfs_swapext
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
+/* reflink ioctls; these MUST match the btrfs ioctl definitions */
+/* from struct btrfs_ioctl_clone_range_args */
+struct xfs_clone_args {
+	__s64 src_fd;
+	__u64 src_offset;
+	__u64 src_length;
+	__u64 dest_offset;
+};
+
+#define XFS_IOC_CLONE		 _IOW (0x94, 9, int)
+#define XFS_IOC_CLONE_RANGE	 _IOW (0x94, 13, struct xfs_clone_args)
 
 #ifndef HAVE_BBMACROS
 /*
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b979f01..c2953bd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1050,6 +1050,146 @@ out_unlock:
 	return error;
 }
 
+/*
+ * Flush all file writes out to disk.
+ */
+static int
+xfs_file_wait_for_io(
+	struct inode	*inode,
+	loff_t		offset,
+	size_t		len)
+{
+	loff_t		rounding;
+	loff_t		ioffset;
+	loff_t		iendoffset;
+	loff_t		bs;
+	int		ret;
+
+	bs = inode->i_sb->s_blocksize;
+	inode_dio_wait(inode);
+
+	rounding = max_t(xfs_off_t, bs, PAGE_SIZE);
+	ioffset = round_down(offset, rounding);
+	iendoffset = round_up(offset + len, rounding) - 1;
+	ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
+					   iendoffset);
+	return ret;
+}
+
+/* Hook up to the VFS reflink function */
+STATIC int
+xfs_file_share_range(
+	struct file	*file_in,
+	loff_t		pos_in,
+	struct file	*file_out,
+	loff_t		pos_out,
+	u64		len)
+{
+	struct inode	*inode_in;
+	struct inode	*inode_out;
+	ssize_t		ret;
+	loff_t		bs;
+	loff_t		isize;
+	int		same_inode;
+	loff_t		blen;
+
+	inode_in = file_inode(file_in);
+	inode_out = file_inode(file_out);
+	bs = inode_out->i_sb->s_blocksize;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode_out))
+		return -EPERM;
+	if (IS_SWAPFILE(inode_in) ||
+	    IS_SWAPFILE(inode_out))
+		return -ETXTBSY;
+
+	/* Reflink only works within this filesystem. */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+	same_inode = (inode_in->i_ino == inode_out->i_ino);
+
+	/* Don't reflink dirs, pipes, sockets... */
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
+		return -EINVAL;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EINVAL;
+
+	/* Are we going all the way to the end? */
+	isize = i_size_read(inode_in);
+	if (isize == 0)
+		return 0;
+	if (len == 0)
+		len = isize - pos_in;
+
+	/* Ensure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > isize)
+		return -EINVAL;
+
+	/* If we're linking to EOF, continue to the block boundary. */
+	if (pos_in + len == isize)
+		blen = ALIGN(isize, bs) - pos_in;
+	else
+		blen = len;
+
+	/* Only reflink if we're aligned to block boundaries */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+		return -EINVAL;
+
+	/* Don't allow overlapped reflink within the same file */
+	if (same_inode && pos_out + blen > pos_in && pos_out < pos_in + blen)
+		return -EINVAL;
+
+	/* Wait for the completion of any pending IOs on srcfile */
+	ret = xfs_file_wait_for_io(inode_in, pos_in, len);
+	if (ret)
+		goto out_unlock;
+	ret = xfs_file_wait_for_io(inode_out, pos_out, len);
+	if (ret)
+		goto out_unlock;
+
+	ret = xfs_reflink_remap_range(XFS_I(inode_in), pos_in, XFS_I(inode_out),
+			pos_out, len);
+	if (ret < 0)
+		goto out_unlock;
+
+out_unlock:
+	return ret;
+}
+
+STATIC ssize_t
+xfs_file_copy_range(
+	struct file	*file_in,
+	loff_t		pos_in,
+	struct file	*file_out,
+	loff_t		pos_out,
+	size_t		len,
+	unsigned int	flags)
+{
+	int		error;
+
+	error = xfs_file_share_range(file_in, pos_in, file_out, pos_out,
+				     len);
+	if (error)
+		return error;
+	return len;
+}
+
+STATIC int
+xfs_file_clone_range(
+	struct file	*file_in,
+	loff_t		pos_in,
+	struct file	*file_out,
+	loff_t		pos_out,
+	u64		len)
+{
+	return xfs_file_share_range(file_in, pos_in, file_out, pos_out,
+				     len);
+}
 
 STATIC int
 xfs_file_open(
@@ -1719,6 +1859,8 @@ const struct file_operations xfs_file_operations = {
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
 	.fallocate	= xfs_file_fallocate,
+	.copy_file_range = xfs_file_copy_range,
+	.clone_file_range = xfs_file_clone_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index dbca737..0e06a82 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -41,6 +41,7 @@
 #include "xfs_trans.h"
 #include "xfs_pnfs.h"
 #include "xfs_acl.h"
+#include "xfs_reflink.h"
 
 #include <linux/capability.h>
 #include <linux/dcache.h>
@@ -49,6 +50,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/exportfs.h>
+#include <linux/fsnotify.h>
 
 /*
  * xfs_find_handle maps from userspace xfs_fsop_handlereq structure to


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 091/119] xfs: add dedupe range vfs function
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (89 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 090/119] xfs: add clone file and clone range vfs functions Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 092/119] xfs: teach get_bmapx and fiemap about shared extents and the CoW fork Darrick J. Wong
                   ` (27 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Define a VFS function which allows userspace to request that the
kernel reflink a range of blocks between two files if the ranges'
contents match.  The function fits the new VFS ioctl that standardizes
the checking for the btrfs EXTENT SAME ioctl.

v2: Plug into the VFS function pointers instead of handling ioctls
directly, and lock the pages so they don't disappear while we're
trying to compare them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |   30 +++++++++++
 fs/xfs/xfs_file.c      |   48 +++++++++++++++++-
 fs/xfs/xfs_reflink.c   |  127 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h   |    5 ++
 4 files changed, 204 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 788e006..6230230 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -542,8 +542,38 @@ struct xfs_clone_args {
 	__u64 dest_offset;
 };
 
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define XFS_EXTENT_DATA_SAME	0
+#define XFS_EXTENT_DATA_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct xfs_extent_data_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == XFS_EXTENT_DATA_SAME if dedupe succeeds
+	 * == XFS_EXTENT_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct xfs_extent_data {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;
+	__u32 reserved2;
+	struct xfs_extent_data_info info[0];
+};
+
 #define XFS_IOC_CLONE		 _IOW (0x94, 9, int)
 #define XFS_IOC_CLONE_RANGE	 _IOW (0x94, 13, struct xfs_clone_args)
+#define XFS_IOC_FILE_EXTENT_SAME _IOWR(0x94, 54, struct xfs_extent_data)
 
 #ifndef HAVE_BBMACROS
 /*
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c2953bd..02e9139 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1083,7 +1083,8 @@ xfs_file_share_range(
 	loff_t		pos_in,
 	struct file	*file_out,
 	loff_t		pos_out,
-	u64		len)
+	u64		len,
+	bool		is_dedupe)
 {
 	struct inode	*inode_in;
 	struct inode	*inode_out;
@@ -1092,6 +1093,7 @@ xfs_file_share_range(
 	loff_t		isize;
 	int		same_inode;
 	loff_t		blen;
+	unsigned int	flags = 0;
 
 	inode_in = file_inode(file_in);
 	inode_out = file_inode(file_out);
@@ -1129,6 +1131,15 @@ xfs_file_share_range(
 	    pos_in + len > isize)
 		return -EINVAL;
 
+	/* Don't allow dedupe past EOF in the dest file */
+	if (is_dedupe) {
+		loff_t	disize;
+
+		disize = i_size_read(inode_out);
+		if (pos_out >= disize || pos_out + len > disize)
+			return -EINVAL;
+	}
+
 	/* If we're linking to EOF, continue to the block boundary. */
 	if (pos_in + len == isize)
 		blen = ALIGN(isize, bs) - pos_in;
@@ -1152,8 +1163,10 @@ xfs_file_share_range(
 	if (ret)
 		goto out_unlock;
 
+	if (is_dedupe)
+		flags |= XFS_REFLINK_DEDUPE;
 	ret = xfs_reflink_remap_range(XFS_I(inode_in), pos_in, XFS_I(inode_out),
-			pos_out, len);
+			pos_out, len, flags);
 	if (ret < 0)
 		goto out_unlock;
 
@@ -1173,7 +1186,7 @@ xfs_file_copy_range(
 	int		error;
 
 	error = xfs_file_share_range(file_in, pos_in, file_out, pos_out,
-				     len);
+				     len, false);
 	if (error)
 		return error;
 	return len;
@@ -1188,7 +1201,33 @@ xfs_file_clone_range(
 	u64		len)
 {
 	return xfs_file_share_range(file_in, pos_in, file_out, pos_out,
-				     len);
+				     len, false);
+}
+
+#define XFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+STATIC ssize_t
+xfs_file_dedupe_range(
+	struct file	*src_file,
+	u64		loff,
+	u64		len,
+	struct file	*dst_file,
+	u64		dst_loff)
+{
+	int		error;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > XFS_MAX_DEDUPE_LEN)
+		len = XFS_MAX_DEDUPE_LEN;
+
+	error = xfs_file_share_range(src_file, loff, dst_file, dst_loff,
+				     len, true);
+	if (error)
+		return error;
+	return len;
 }
 
 STATIC int
@@ -1861,6 +1900,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.copy_file_range = xfs_file_copy_range,
 	.clone_file_range = xfs_file_clone_range,
+	.dedupe_file_range = xfs_file_dedupe_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index c01c0c7..c42a6e1 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1231,6 +1231,111 @@ err:
 }
 
 /*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *
+xfs_get_page(
+	struct inode	*inode,
+	xfs_off_t	offset)
+{
+	struct address_space	*mapping;
+	struct page		*page;
+	pgoff_t			n;
+
+	n = offset >> PAGE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		put_page(page);
+		return ERR_PTR(-EIO);
+	}
+	lock_page(page);
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ */
+static int
+xfs_compare_extents(
+	struct inode	*src,
+	xfs_off_t	srcoff,
+	struct inode	*dest,
+	xfs_off_t	destoff,
+	xfs_off_t	len,
+	bool		*is_same)
+{
+	xfs_off_t	src_poff;
+	xfs_off_t	dest_poff;
+	void		*src_addr;
+	void		*dest_addr;
+	struct page	*src_page;
+	struct page	*dest_page;
+	xfs_off_t	cmp_len;
+	bool		same;
+	int		error;
+
+	error = -EINVAL;
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_SIZE - 1);
+		dest_poff = destoff & (PAGE_SIZE - 1);
+		cmp_len = min(PAGE_SIZE - src_poff,
+			      PAGE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		ASSERT(cmp_len > 0);
+
+		trace_xfs_reflink_compare_extents(XFS_I(src), srcoff, cmp_len,
+				XFS_I(dest), destoff);
+
+		src_page = xfs_get_page(src, srcoff);
+		if (IS_ERR(src_page)) {
+			error = PTR_ERR(src_page);
+			goto out_error;
+		}
+		dest_page = xfs_get_page(dest, destoff);
+		if (IS_ERR(dest_page)) {
+			error = PTR_ERR(dest_page);
+			unlock_page(src_page);
+			put_page(src_page);
+			goto out_error;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(dest_addr);
+		kunmap_atomic(src_addr);
+		unlock_page(dest_page);
+		unlock_page(src_page);
+		put_page(dest_page);
+		put_page(src_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+
+out_error:
+	trace_xfs_reflink_compare_extents_error(XFS_I(dest), error, _RET_IP_);
+	return error;
+}
+
+/*
  * Link a range of blocks from one file to another.
  */
 int
@@ -1239,12 +1344,14 @@ xfs_reflink_remap_range(
 	xfs_off_t		srcoff,
 	struct xfs_inode	*dest,
 	xfs_off_t		destoff,
-	xfs_off_t		len)
+	xfs_off_t		len,
+	unsigned int		flags)
 {
 	struct xfs_mount	*mp = src->i_mount;
 	xfs_fileoff_t		sfsbno, dfsbno;
 	xfs_filblks_t		fsblen;
 	int			error;
+	bool			is_same;
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
 		return -EOPNOTSUPP;
@@ -1256,6 +1363,9 @@ xfs_reflink_remap_range(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		return -EINVAL;
 
+	if (flags & ~XFS_REFLINK_ALL)
+		return -EINVAL;
+
 	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
 
 	/* Lock both files against IO */
@@ -1267,6 +1377,21 @@ xfs_reflink_remap_range(
 		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
 	}
 
+	/*
+	 * Check that the extents are the same.
+	 */
+	if (flags & XFS_REFLINK_DEDUPE) {
+		is_same = false;
+		error = xfs_compare_extents(VFS_I(src), srcoff, VFS_I(dest),
+				destoff, len, &is_same);
+		if (error)
+			goto out_error;
+		if (!is_same) {
+			error = -EBADE;
+			goto out_error;
+		}
+	}
+
 	error = xfs_reflink_set_inode_flag(src, dest);
 	if (error)
 		goto out_error;
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 92c0ebd..1d38b97 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -43,8 +43,11 @@ extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 
+#define XFS_REFLINK_DEDUPE	1	/* only reflink if contents match */
+#define XFS_REFLINK_ALL		(XFS_REFLINK_DEDUPE)
 extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
-		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
+		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
+		unsigned int flags);
 
 /* xfs_aops.c */
 extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 092/119] xfs: teach get_bmapx and fiemap about shared extents and the CoW fork
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (90 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 091/119] xfs: add dedupe range vfs function Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 093/119] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
                   ` (26 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Teach xfs_getbmapx how to report shared extents and CoW fork contents,
then modify the FIEMAP formatters to set the appropriate flags.  A
previous version of this patch only modified the fiemap formatter,
which is insufficient.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |    4 ++-
 fs/xfs/xfs_bmap_util.c |   66 ++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_iops.c      |    2 +
 3 files changed, 63 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 6230230..b1af423 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -81,14 +81,16 @@ struct getbmapx {
 #define BMV_IF_PREALLOC		0x4	/* rtn status BMV_OF_PREALLOC if req */
 #define BMV_IF_DELALLOC		0x8	/* rtn status BMV_OF_DELALLOC if req */
 #define BMV_IF_NO_HOLES		0x10	/* Do not return holes */
+#define BMV_IF_COWFORK		0x20	/* return CoW fork rather than data */
 #define BMV_IF_VALID	\
 	(BMV_IF_ATTRFORK|BMV_IF_NO_DMAPI_READ|BMV_IF_PREALLOC|	\
-	 BMV_IF_DELALLOC|BMV_IF_NO_HOLES)
+	 BMV_IF_DELALLOC|BMV_IF_NO_HOLES|BMV_IF_COWFORK)
 
 /*	bmv_oflags values - returned for each non-header segment */
 #define BMV_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
 #define BMV_OF_DELALLOC		0x2	/* segment = delayed allocation */
 #define BMV_OF_LAST		0x4	/* segment is the last in the file */
+#define BMV_OF_SHARED		0x8	/* segment shared with another file */
 
 /*
  * Structure for XFS_IOC_FSSETDM.
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 9285111..3d71a17 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -44,6 +44,7 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_refcount.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -391,6 +392,7 @@ xfs_bmap_count_blocks(
 STATIC int
 xfs_getbmapx_fix_eof_hole(
 	xfs_inode_t		*ip,		/* xfs incore inode pointer */
+	int			whichfork,
 	struct getbmapx		*out,		/* output structure */
 	int			prealloced,	/* this is a file with
 						 * preallocated data space */
@@ -420,7 +422,7 @@ xfs_getbmapx_fix_eof_hole(
 		else
 			out->bmv_block = xfs_fsb_to_db(ip, startblock);
 		fileblock = XFS_BB_TO_FSB(ip->i_mount, out->bmv_offset);
-		ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+		ifp = XFS_IFORK_PTR(ip, whichfork);
 		if (xfs_iext_bno_to_ext(ifp, fileblock, &lastx) &&
 		   (lastx == (ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t))-1))
 			out->bmv_oflags |= BMV_OF_LAST;
@@ -464,9 +466,19 @@ xfs_getbmap(
 
 	mp = ip->i_mount;
 	iflags = bmv->bmv_iflags;
-	whichfork = iflags & BMV_IF_ATTRFORK ? XFS_ATTR_FORK : XFS_DATA_FORK;
 
-	if (whichfork == XFS_ATTR_FORK) {
+	if ((iflags & BMV_IF_ATTRFORK) && (iflags & BMV_IF_COWFORK))
+		return -EINVAL;
+
+	if (iflags & BMV_IF_ATTRFORK)
+		whichfork = XFS_ATTR_FORK;
+	else if (iflags & BMV_IF_COWFORK)
+		whichfork = XFS_COW_FORK;
+	else
+		whichfork = XFS_DATA_FORK;
+
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
 		if (XFS_IFORK_Q(ip)) {
 			if (ip->i_d.di_aformat != XFS_DINODE_FMT_EXTENTS &&
 			    ip->i_d.di_aformat != XFS_DINODE_FMT_BTREE &&
@@ -482,7 +494,15 @@ xfs_getbmap(
 
 		prealloced = 0;
 		fixlen = 1LL << 32;
-	} else {
+		break;
+	case XFS_COW_FORK:
+		if (ip->i_cformat != XFS_DINODE_FMT_EXTENTS)
+			return -EINVAL;
+
+		prealloced = 0;
+		fixlen = XFS_ISIZE(ip);
+		break;
+	default:
 		if (ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
 		    ip->i_d.di_format != XFS_DINODE_FMT_BTREE &&
 		    ip->i_d.di_format != XFS_DINODE_FMT_LOCAL)
@@ -496,6 +516,7 @@ xfs_getbmap(
 			prealloced = 0;
 			fixlen = XFS_ISIZE(ip);
 		}
+		break;
 	}
 
 	if (bmv->bmv_length == -1) {
@@ -522,7 +543,8 @@ xfs_getbmap(
 		return -ENOMEM;
 
 	xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	if (whichfork == XFS_DATA_FORK) {
+	switch (whichfork) {
+	case XFS_DATA_FORK:
 		if (!(iflags & BMV_IF_DELALLOC) &&
 		    (ip->i_delayed_blks || XFS_ISIZE(ip) > ip->i_d.di_size)) {
 			error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
@@ -540,8 +562,14 @@ xfs_getbmap(
 		}
 
 		lock = xfs_ilock_data_map_shared(ip);
-	} else {
+		break;
+	case XFS_COW_FORK:
+		lock = XFS_ILOCK_SHARED;
+		xfs_ilock(ip, lock);
+		break;
+	case XFS_ATTR_FORK:
 		lock = xfs_ilock_attr_map_shared(ip);
+		break;
 	}
 
 	/*
@@ -616,8 +644,30 @@ xfs_getbmap(
 				goto out_free_map;
 			}
 
-			if (!xfs_getbmapx_fix_eof_hole(ip, &out[cur_ext],
-					prealloced, bmvend,
+			/* Is this a shared block? */
+			if (whichfork == XFS_DATA_FORK &&
+			    map[i].br_startblock != DELAYSTARTBLOCK &&
+			    map[i].br_startblock != HOLESTARTBLOCK &&
+			    !ISUNWRITTEN(&map[i]) && xfs_is_reflink_inode(ip)) {
+				xfs_agblock_t	ebno;
+				xfs_extlen_t	elen;
+
+				error = xfs_refcount_find_shared(mp,
+						XFS_FSB_TO_AGNO(mp,
+							map[i].br_startblock),
+						XFS_FSB_TO_AGBNO(mp,
+							map[i].br_startblock),
+						map[i].br_blockcount,
+						&ebno, &elen, true);
+				if (error)
+					goto out_free_map;
+				if (elen)
+					out[cur_ext].bmv_oflags |=
+							BMV_OF_SHARED;
+			}
+
+			if (!xfs_getbmapx_fix_eof_hole(ip, whichfork,
+					&out[cur_ext], prealloced, bmvend,
 					map[i].br_startblock))
 				goto out_free_map;
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c5d4eba..e57bfe8 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1025,6 +1025,8 @@ xfs_fiemap_format(
 
 	if (bmv->bmv_oflags & BMV_OF_PREALLOC)
 		fiemap_flags |= FIEMAP_EXTENT_UNWRITTEN;
+	else if (bmv->bmv_oflags & BMV_OF_SHARED)
+		fiemap_flags |= FIEMAP_EXTENT_SHARED;
 	else if (bmv->bmv_oflags & BMV_OF_DELALLOC) {
 		fiemap_flags |= (FIEMAP_EXTENT_DELALLOC |
 				 FIEMAP_EXTENT_UNKNOWN);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 093/119] xfs: swap inode reflink flags when swapping inode extents
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (91 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 092/119] xfs: teach get_bmapx and fiemap about shared extents and the CoW fork Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:27 ` [PATCH 094/119] xfs: unshare a range of blocks via fallocate Darrick J. Wong
                   ` (25 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When we're swapping the extents of two inodes, be sure to swap the
reflink inode flag too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 3d71a17..a5f5515 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1735,6 +1735,8 @@ xfs_swap_extents(
 	int		taforkblks = 0;
 	__uint64_t	tmp;
 	int		lock_flags;
+	struct xfs_ifork	*cowfp;
+	__uint64_t	f;
 
 	/* XXX: we can't do this with rmap, will fix later */
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
@@ -1948,6 +1950,19 @@ xfs_swap_extents(
 		break;
 	}
 
+	/* Do we have to swap reflink flags? */
+	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
+	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
+		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
+		cowfp = ip->i_cowfp;
+		ip->i_cowfp = tip->i_cowfp;
+		tip->i_cowfp = cowfp;
+	}
+
 	xfs_trans_log_inode(tp, ip,  src_log_flags);
 	xfs_trans_log_inode(tp, tip, target_log_flags);
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 094/119] xfs: unshare a range of blocks via fallocate
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (92 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 093/119] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
@ 2016-06-17  1:27 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 095/119] xfs: CoW shared EOF block when truncating file Darrick J. Wong
                   ` (24 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:27 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

Now that we have an fallocate flag to unshare a range of blocks, make
XFS actually implement it.

v2: NFS doesn't pass around struct file pointers, which means that our
unshare functions all crash when filp == NULL.  We don't need filp
anyway, so remove all the parts where we pass filp around.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch@lst.de: pass inode instead of file to xfs_reflink_dirty_range]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c    |    6 +
 fs/xfs/xfs_reflink.c |  314 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    2 
 3 files changed, 321 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 02e9139..618bd12 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1011,9 +1011,13 @@ xfs_file_fallocate(
 
 		if (mode & FALLOC_FL_ZERO_RANGE)
 			error = xfs_zero_file_space(ip, offset, len);
-		else
+		else {
+			error = xfs_reflink_unshare(ip, offset, len);
+			if (error)
+				goto out_unlock;
 			error = xfs_alloc_file_space(ip, offset, len,
 						     XFS_BMAPI_PREALLOC);
+		}
 		if (error)
 			goto out_unlock;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index c42a6e1..78f24c3 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1426,3 +1426,317 @@ out_error:
 		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Dirty all the shared blocks within a byte range of a file so that they're
+ * rewritten elsewhere.  Similar to generic_perform_write().
+ */
+static int
+xfs_reflink_dirty_range(
+	struct inode		*inode,
+	xfs_off_t		pos,
+	xfs_off_t		len)
+{
+	struct address_space	*mapping;
+	const struct address_space_operations *a_ops;
+	int			error;
+	unsigned int		flags;
+	struct page		*page;
+	struct page		*rpage;
+	unsigned long		offset;	/* Offset into pagecache page */
+	unsigned long		bytes;	/* Bytes to write to page */
+	void			*fsdata;
+
+	mapping = inode->i_mapping;
+	a_ops = mapping->a_ops;
+	flags = AOP_FLAG_UNINTERRUPTIBLE;
+	do {
+
+		offset = (pos & (PAGE_SIZE - 1));
+		bytes = min_t(unsigned long, PAGE_SIZE - offset, len);
+		rpage = xfs_get_page(inode, pos);
+		if (IS_ERR(rpage)) {
+			error = PTR_ERR(rpage);
+			break;
+		}
+
+		unlock_page(rpage);
+		error = a_ops->write_begin(NULL, mapping, pos, bytes, flags,
+					   &page, &fsdata);
+		put_page(rpage);
+		if (error < 0)
+			break;
+
+		trace_xfs_reflink_unshare_page(inode, page, pos, bytes);
+
+		if (!PageUptodate(page)) {
+			xfs_err(XFS_I(inode)->i_mount,
+					"%s: STALE? ino=%llu pos=%llu\n",
+					__func__, XFS_I(inode)->i_ino, pos);
+			WARN_ON(1);
+		}
+		if (mapping_writably_mapped(mapping))
+			flush_dcache_page(page);
+
+		error = a_ops->write_end(NULL, mapping, pos, bytes, bytes,
+					 page, fsdata);
+		if (error < 0)
+			break;
+		else if (error == 0) {
+			error = -EIO;
+			break;
+		} else {
+			bytes = error;
+			error = 0;
+		}
+
+		cond_resched();
+
+		pos += bytes;
+		len -= bytes;
+
+		balance_dirty_pages_ratelimited(mapping);
+		if (fatal_signal_pending(current)) {
+			error = -EINTR;
+			break;
+		}
+	} while (len > 0);
+
+	return error;
+}
+
+/*
+ * The user wants to preemptively CoW all shared blocks in this file,
+ * which enables us to turn off the reflink flag.  Iterate all
+ * extents which are not prealloc/delalloc to see which ranges are
+ * mentioned in the refcount tree, then read those blocks into the
+ * pagecache, dirty them, fsync them back out, and then we can update
+ * the inode flag.  What happens if we run out of memory? :)
+ */
+STATIC int
+xfs_reflink_dirty_extents(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		fbno,
+	xfs_filblks_t		end,
+	xfs_off_t		isize)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		aglen;
+	xfs_agblock_t		rbno;
+	xfs_extlen_t		rlen;
+	xfs_off_t		fpos;
+	xfs_off_t		flen;
+	struct xfs_bmbt_irec	map[2];
+	int			nmaps;
+	int			error;
+
+	while (end - fbno > 0) {
+		nmaps = 1;
+		/*
+		 * Look for extents in the file.  Skip holes, delalloc, or
+		 * unwritten extents; they can't be reflinked.
+		 */
+		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
+		if (error)
+			goto out;
+		if (nmaps == 0)
+			break;
+		if (map[0].br_startblock == HOLESTARTBLOCK ||
+		    map[0].br_startblock == DELAYSTARTBLOCK ||
+		    ISUNWRITTEN(&map[0]))
+			goto next;
+
+		map[1] = map[0];
+		while (map[1].br_blockcount) {
+			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
+			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
+			aglen = map[1].br_blockcount;
+
+			error = xfs_refcount_find_shared(mp, agno, agbno, aglen,
+							 &rbno, &rlen, true);
+			if (error)
+				goto out;
+			if (rlen == 0)
+				goto skip_copy;
+
+			/* Dirty the pages */
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			fpos = XFS_FSB_TO_B(mp, map[1].br_startoff +
+					(rbno - agbno));
+			flen = XFS_FSB_TO_B(mp, rlen);
+			if (fpos + flen > isize)
+				flen = isize - fpos;
+			error = xfs_reflink_dirty_range(VFS_I(ip), fpos, flen);
+			xfs_ilock(ip, XFS_ILOCK_EXCL);
+			if (error)
+				goto out;
+skip_copy:
+			map[1].br_blockcount -= (rbno - agbno + rlen);
+			map[1].br_startoff += (rbno - agbno + rlen);
+			map[1].br_startblock += (rbno - agbno + rlen);
+		}
+
+next:
+		fbno = map[0].br_startoff + map[0].br_blockcount;
+	}
+out:
+	return error;
+}
+
+/* Iterate the extents; if there are no reflinked blocks, clear the flag. */
+STATIC int
+xfs_reflink_try_clear_inode_flag(
+	struct xfs_inode	*ip,
+	xfs_off_t		old_isize)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	xfs_fileoff_t		fbno;
+	xfs_filblks_t		end;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		aglen;
+	xfs_agblock_t		rbno;
+	xfs_extlen_t		rlen;
+	struct xfs_bmbt_irec	map[2];
+	int			nmaps;
+	int			error = 0;
+
+	/* Start a rolling transaction to remove the mappings */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	if (old_isize != i_size_read(VFS_I(ip)))
+		goto cancel;
+	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK))
+		goto cancel;
+
+	fbno = 0;
+	end = XFS_B_TO_FSB(mp, old_isize);
+	while (end - fbno > 0) {
+		nmaps = 1;
+		/*
+		 * Look for extents in the file.  Skip holes, delalloc, or
+		 * unwritten extents; they can't be reflinked.
+		 */
+		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
+		if (error)
+			goto cancel;
+		if (nmaps == 0)
+			break;
+		if (map[0].br_startblock == HOLESTARTBLOCK ||
+		    map[0].br_startblock == DELAYSTARTBLOCK ||
+		    ISUNWRITTEN(&map[0]))
+			goto next;
+
+		map[1] = map[0];
+		while (map[1].br_blockcount) {
+			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
+			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
+			aglen = map[1].br_blockcount;
+
+			error = xfs_refcount_find_shared(mp, agno, agbno, aglen,
+							 &rbno, &rlen, false);
+			if (error)
+				goto cancel;
+			/* Is there still a shared block here? */
+			if (rlen > 0) {
+				error = 0;
+				goto cancel;
+			}
+
+			map[1].br_blockcount -= aglen;
+			map[1].br_startoff += aglen;
+			map[1].br_startblock += aglen;
+		}
+
+next:
+		fbno = map[0].br_startoff + map[0].br_blockcount;
+	}
+
+	/*
+	 * We didn't find any shared blocks so turn off the reflink flag.
+	 * First, get rid of any leftover CoW mappings.
+	 */
+	error = xfs_reflink_cancel_cow_blocks(ip, &tp, 0, NULLFILEOFF);
+	if (error)
+		goto cancel;
+
+	/* Clear the inode flag. */
+	trace_xfs_reflink_unset_inode_flag(ip);
+	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+	xfs_trans_ijoin(tp, ip, 0);
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out;
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return 0;
+cancel:
+	xfs_trans_cancel(tp);
+out:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
+/*
+ * Pre-COW all shared blocks within a given byte range of a file and turn off
+ * the reflink flag if we unshare all of the file's blocks.
+ */
+int
+xfs_reflink_unshare(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		fbno;
+	xfs_filblks_t		end;
+	xfs_off_t		old_isize, isize;
+	int			error;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	trace_xfs_reflink_unshare(ip, offset, len);
+
+	inode_dio_wait(VFS_I(ip));
+
+	/* Try to CoW the selected ranges */
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	fbno = XFS_B_TO_FSB(mp, offset);
+	old_isize = isize = i_size_read(VFS_I(ip));
+	end = XFS_B_TO_FSB(mp, offset + len);
+	error = xfs_reflink_dirty_extents(ip, fbno, end, isize);
+	if (error)
+		goto out_unlock;
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	/* Wait for the IO to finish */
+	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+	if (error)
+		goto out;
+
+	/* Turn off the reflink flag if we unshared the whole file */
+	if (offset == 0 && len == isize) {
+		error = xfs_reflink_try_clear_inode_flag(ip, old_isize);
+		if (error)
+			goto out;
+	}
+
+	return 0;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 1d38b97..a369b2a 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -48,6 +48,8 @@ extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
 		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
 		unsigned int flags);
+extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t len);
 
 /* xfs_aops.c */
 extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 095/119] xfs: CoW shared EOF block when truncating file
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (93 preceding siblings ...)
  2016-06-17  1:27 ` [PATCH 094/119] xfs: unshare a range of blocks via fallocate Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 096/119] xfs: support FS_XFLAG_REFLINK on reflink filesystems Darrick J. Wong
                   ` (23 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When shrinking a file, the VFS zeroes everything in the associated
page between the new EOF and the previous EOF to avoid leaking data.
If this block is shared we need to CoW it before the VFS does its
zeroing to avoid corrupting the other files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iops.c    |    9 +++++++++
 fs/xfs/xfs_reflink.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    1 +
 3 files changed, 52 insertions(+)


diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index e57bfe8..0fa86bd 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -38,6 +38,7 @@
 #include "xfs_dir2.h"
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
+#include "xfs_reflink.h"
 
 #include <linux/capability.h>
 #include <linux/xattr.h>
@@ -816,6 +817,14 @@ xfs_setattr_size(
 	}
 
 	/*
+	 * CoW the EOF block of the file if it's necessary to avoid
+	 * corrupting other files.
+	 */
+	error = xfs_reflink_cow_eof_block(ip, newsize);
+	if (error)
+		return error;
+
+	/*
 	 * We are going to log the inode size change in this transaction so
 	 * any previous writes that are beyond the on disk EOF and the new
 	 * EOF that have not been written out need to be written here.  If we
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 78f24c3..b42ffb0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1740,3 +1740,45 @@ out:
 	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * If we're trying to truncate a file whose last block is shared and the new
+ * size isn't aligned to a block boundary, we need to dirty that last block
+ * ahead of the VFS zeroing the page.
+ */
+int
+xfs_reflink_cow_eof_block(
+	struct xfs_inode	*ip,
+	xfs_off_t		newsize)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		fbno;
+	xfs_off_t		isize;
+	int			error;
+
+	if (!xfs_is_reflink_inode(ip) ||
+	    (newsize & ((1 << VFS_I(ip)->i_blkbits) - 1)) == 0)
+		return 0;
+
+	/* Try to CoW the shared last block */
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	fbno = XFS_B_TO_FSBT(mp, newsize);
+	isize = i_size_read(VFS_I(ip));
+
+	if (newsize > isize)
+		trace_xfs_reflink_cow_eof_block(ip, isize, newsize - isize);
+	else
+		trace_xfs_reflink_cow_eof_block(ip, newsize, isize - newsize);
+
+	error = xfs_reflink_dirty_extents(ip, fbno, fbno + 1, isize);
+	if (error)
+		goto out_unlock;
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return 0;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	trace_xfs_reflink_cow_eof_block_error(ip, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index a369b2a..437087c5 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -50,6 +50,7 @@ extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
 		unsigned int flags);
 extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t len);
+extern int xfs_reflink_cow_eof_block(struct xfs_inode *ip, xfs_off_t newsize);
 
 /* xfs_aops.c */
 extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 096/119] xfs: support FS_XFLAG_REFLINK on reflink filesystems
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (94 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 095/119] xfs: CoW shared EOF block when truncating file Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 097/119] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
                   ` (22 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Add support for reporting the "reflink" inode flag in the XFS-specific
getxflags ioctl, and allow the user to clear the flag if file size is
zero.

v2: Move the reflink flag out of the way of the DAX flag, and add the
new cowextsize flag.

v3: do not report (or allow changes to) FL_NOCOW_FL, since we don't
support a flag to prevent CoWing and the reflink flag is a poor
proxy.  We'll try to design away the need for the NOCOW flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c   |    2 ++
 fs/xfs/xfs_ioctl.c   |    4 ++++
 fs/xfs/xfs_reflink.c |   26 ++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    4 ++++
 4 files changed, 36 insertions(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b8d3c4f..127bf54 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -652,6 +652,8 @@ _xfs_dic2xflags(
 	if (di_flags2 & XFS_DIFLAG2_ANY) {
 		if (di_flags2 & XFS_DIFLAG2_DAX)
 			flags |= FS_XFLAG_DAX;
+		if (di_flags2 & XFS_DIFLAG2_REFLINK)
+			flags |= FS_XFLAG_REFLINK;
 	}
 
 	if (has_attr)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0e06a82..b8eceee 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1258,6 +1258,10 @@ xfs_ioctl_setattr(
 
 	trace_xfs_ioctl_setattr(ip);
 
+	code = xfs_reflink_check_flag_adjust(ip, &fa->fsx_xflags);
+	if (code)
+		return code;
+
 	code = xfs_ioctl_setattr_check_projid(ip, fa);
 	if (code)
 		return code;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index b42ffb0..7c64104 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1782,3 +1782,29 @@ out_unlock:
 	trace_xfs_reflink_cow_eof_block_error(ip, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Ensure that the only change we allow to the inode reflink flag is to clear
+ * it when the fs supports reflink and the size is zero.
+ */
+int
+xfs_reflink_check_flag_adjust(
+	struct xfs_inode	*ip,
+	unsigned int		*xflags)
+{
+	unsigned int		chg;
+
+	chg = !!(*xflags & FS_XFLAG_REFLINK) ^ !!xfs_is_reflink_inode(ip);
+
+	if (!chg)
+		return 0;
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb))
+		return -EOPNOTSUPP;
+	if (i_size_read(VFS_I(ip)) != 0)
+		return -EINVAL;
+	if (*xflags & FS_XFLAG_REFLINK) {
+		*xflags &= ~FS_XFLAG_REFLINK;
+		return 0;
+	}
+	return 0;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 437087c5..97e8705 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -52,6 +52,10 @@ extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t len);
 extern int xfs_reflink_cow_eof_block(struct xfs_inode *ip, xfs_off_t newsize);
 
+extern void xfs_reflink_get_lxflags(struct xfs_inode *ip, unsigned int *flags);
+extern int xfs_reflink_check_flag_adjust(struct xfs_inode *ip,
+		unsigned int *xflags);
+
 /* xfs_aops.c */
 extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 097/119] xfs: create a separate cow extent size hint for the allocator
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (95 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 096/119] xfs: support FS_XFLAG_REFLINK on reflink filesystems Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 098/119] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
                   ` (21 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create a per-inode extent size allocator hint for copy-on-write.  This
hint is separate from the existing extent size hint so that CoW can
take advantage of the fragmentation-reducing properties of extent size
hints without disabling delalloc for regular writes.

The extent size hint that's fed to the allocator during a copy on
write operation is the greater of the cowextsize and regular extsize
hint.

During reflink, if we're sharing the entire source file to the entire
destination file and the destination file doesn't already have a
cowextsize hint, propagate the source file's cowextsize hint to the
destination file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       |   13 +++++++-
 fs/xfs/libxfs/xfs_format.h     |    3 +-
 fs/xfs/libxfs/xfs_fs.h         |    3 +-
 fs/xfs/libxfs/xfs_inode_buf.c  |    4 ++
 fs/xfs/libxfs/xfs_inode_buf.h  |    1 +
 fs/xfs/libxfs/xfs_log_format.h |    3 +-
 fs/xfs/xfs_bmap_util.c         |    9 ++++-
 fs/xfs/xfs_inode.c             |   33 ++++++++++++++++++++
 fs/xfs/xfs_inode.h             |    1 +
 fs/xfs/xfs_inode_item.c        |    2 +
 fs/xfs/xfs_ioctl.c             |   67 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.c             |    5 ++-
 fs/xfs/xfs_itable.c            |    5 +++
 fs/xfs/xfs_reflink.c           |   36 +++++++++++++++++----
 14 files changed, 166 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0909532..a6c08bf 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3665,7 +3665,13 @@ xfs_bmap_btalloc(
 	else if (mp->m_dalign)
 		stripe_align = mp->m_dalign;
 
-	align = ap->userdata ? xfs_get_extsz_hint(ap->ip) : 0;
+	if (ap->userdata) {
+		if (ap->flags & XFS_BMAPI_COWFORK)
+			align = xfs_get_cowextsz_hint(ap->ip);
+		else
+			align = xfs_get_extsz_hint(ap->ip);
+	} else
+		align = 0;
 	if (unlikely(align)) {
 		error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
 						align, 0, ap->eof, 0, ap->conv,
@@ -4178,7 +4184,10 @@ xfs_bmapi_reserve_delalloc(
 		alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
 
 	/* Figure out the extent size, adjust alen */
-	extsz = xfs_get_extsz_hint(ip);
+	if (whichfork == XFS_COW_FORK)
+		extsz = xfs_get_cowextsz_hint(ip);
+	else
+		extsz = xfs_get_extsz_hint(ip);
 	if (extsz) {
 		error = xfs_bmap_extsize_align(mp, got, prev, extsz, rt, eof,
 					       1, 0, &aoff, &alen);
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 3d336e9..a35f4e5 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -890,7 +890,8 @@ typedef struct xfs_dinode {
 	__be64		di_changecount;	/* number of attribute changes */
 	__be64		di_lsn;		/* flush sequence */
 	__be64		di_flags2;	/* more random flags */
-	__u8		di_pad2[16];	/* more padding for future expansion */
+	__be32		di_cowextsize;	/* basic cow extent size for file */
+	__u8		di_pad2[12];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_timestamp_t	di_crtime;	/* time created */
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b1af423..10ebf99 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -278,7 +278,8 @@ typedef struct xfs_bstat {
 #define	bs_projid	bs_projid_lo	/* (previously just bs_projid)	*/
 	__u16		bs_forkoff;	/* inode fork offset in bytes	*/
 	__u16		bs_projid_hi;	/* higher part of project id	*/
-	unsigned char	bs_pad[10];	/* pad space, unused		*/
+	unsigned char	bs_pad[6];	/* pad space, unused		*/
+	__u32		bs_cowextsize;	/* cow extent size		*/
 	__u32		bs_dmevmask;	/* DMIG event mask		*/
 	__u16		bs_dmstate;	/* DMIG state info		*/
 	__u16		bs_aextents;	/* attribute number of extents	*/
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 44f325c..2efa42c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -267,6 +267,7 @@ xfs_inode_from_disk(
 		to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
 		to->di_flags2 = be64_to_cpu(from->di_flags2);
+		to->di_cowextsize = be32_to_cpu(from->di_cowextsize);
 	}
 }
 
@@ -316,7 +317,7 @@ xfs_inode_to_disk(
 		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
-
+		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
 		to->di_ino = cpu_to_be64(ip->i_ino);
 		to->di_lsn = cpu_to_be64(lsn);
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
@@ -368,6 +369,7 @@ xfs_log_dinode_to_disk(
 		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
+		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
 		to->di_ino = cpu_to_be64(from->di_ino);
 		to->di_lsn = cpu_to_be64(from->di_lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 958c543..6848a0a 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -47,6 +47,7 @@ struct xfs_icdinode {
 	__uint16_t	di_flags;	/* random flags, XFS_DIFLAG_... */
 
 	__uint64_t	di_flags2;	/* more random flags */
+	__uint32_t	di_cowextsize;	/* basic cow extent size for file */
 
 	xfs_ictimestamp_t di_crtime;	/* time created */
 };
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 320a305..9cab67f 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -423,7 +423,8 @@ struct xfs_log_dinode {
 	__uint64_t	di_changecount;	/* number of attribute changes */
 	xfs_lsn_t	di_lsn;		/* flush sequence */
 	__uint64_t	di_flags2;	/* more random flags */
-	__uint8_t	di_pad2[16];	/* more padding for future expansion */
+	__uint32_t	di_cowextsize;	/* basic cow extent size for file */
+	__uint8_t	di_pad2[12];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_ictimestamp_t di_crtime;	/* time created */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a5f5515..b0c2c6d5 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -499,8 +499,13 @@ xfs_getbmap(
 		if (ip->i_cformat != XFS_DINODE_FMT_EXTENTS)
 			return -EINVAL;
 
-		prealloced = 0;
-		fixlen = XFS_ISIZE(ip);
+		if (xfs_get_cowextsz_hint(ip)) {
+			prealloced = 1;
+			fixlen = mp->m_super->s_maxbytes;
+		} else {
+			prealloced = 0;
+			fixlen = XFS_ISIZE(ip);
+		}
 		break;
 	default:
 		if (ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 127bf54..480e48a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -78,6 +78,27 @@ xfs_get_extsz_hint(
 }
 
 /*
+ * Helper function to extract CoW extent size hint from inode.
+ * Between the extent size hint and the CoW extent size hint, we
+ * return the greater of the two.
+ */
+xfs_extlen_t
+xfs_get_cowextsz_hint(
+	struct xfs_inode	*ip)
+{
+	xfs_extlen_t		a, b;
+
+	a = 0;
+	if (ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
+		a = ip->i_d.di_cowextsize;
+	b = xfs_get_extsz_hint(ip);
+
+	if (a > b)
+		return a;
+	return b;
+}
+
+/*
  * These two are wrapper routines around the xfs_ilock() routine used to
  * centralize some grungy code.  They are used in places that wish to lock the
  * inode solely for reading the extents.  The reason these places can't just
@@ -654,6 +675,8 @@ _xfs_dic2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (di_flags2 & XFS_DIFLAG2_REFLINK)
 			flags |= FS_XFLAG_REFLINK;
+		if (di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
+			flags |= FS_XFLAG_COWEXTSIZE;
 	}
 
 	if (has_attr)
@@ -837,6 +860,7 @@ xfs_ialloc(
 	if (ip->i_d.di_version == 3) {
 		inode->i_version = 1;
 		ip->i_d.di_flags2 = 0;
+		ip->i_d.di_cowextsize = 0;
 		ip->i_d.di_crtime.t_sec = (__int32_t)tv.tv_sec;
 		ip->i_d.di_crtime.t_nsec = (__int32_t)tv.tv_nsec;
 	}
@@ -899,6 +923,15 @@ xfs_ialloc(
 			ip->i_d.di_flags |= di_flags;
 			ip->i_d.di_flags2 |= di_flags2;
 		}
+		if (pip &&
+		    (pip->i_d.di_flags2 & XFS_DIFLAG2_ANY) &&
+		    pip->i_d.di_version == 3 &&
+		    ip->i_d.di_version == 3) {
+			if (pip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE) {
+				ip->i_d.di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+				ip->i_d.di_cowextsize = pip->i_d.di_cowextsize;
+			}
+		}
 		/* FALLTHROUGH */
 	case S_IFLNK:
 		ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 797fcc7..2c1fb3f 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -419,6 +419,7 @@ int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
 void		xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
 
 xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
+xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
 
 int		xfs_dir_ialloc(struct xfs_trans **, struct xfs_inode *, umode_t,
 			       xfs_nlink_t, xfs_dev_t, prid_t, int,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index a1b0761..9a1d62b 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -368,7 +368,7 @@ xfs_inode_to_log_dinode(
 		to->di_crtime.t_sec = from->di_crtime.t_sec;
 		to->di_crtime.t_nsec = from->di_crtime.t_nsec;
 		to->di_flags2 = from->di_flags2;
-
+		to->di_cowextsize = from->di_cowextsize;
 		to->di_ino = ip->i_ino;
 		to->di_lsn = lsn;
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index b8eceee..d2b4e81 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -900,6 +900,8 @@ xfs_ioc_fsgetxattr(
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	fa.fsx_xflags = xfs_ip2xflags(ip);
 	fa.fsx_extsize = ip->i_d.di_extsize << ip->i_mount->m_sb.sb_blocklog;
+	fa.fsx_cowextsize = ip->i_d.di_cowextsize <<
+			ip->i_mount->m_sb.sb_blocklog;
 	fa.fsx_projid = xfs_get_projid(ip);
 
 	if (attr) {
@@ -970,12 +972,13 @@ xfs_set_diflags(
 	if (ip->i_d.di_version < 3)
 		return;
 
-	di_flags2 = 0;
+	di_flags2 = (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK);
 	if (xflags & FS_XFLAG_DAX)
 		di_flags2 |= XFS_DIFLAG2_DAX;
+	if (xflags & FS_XFLAG_COWEXTSIZE)
+		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
 
 	ip->i_d.di_flags2 = di_flags2;
-
 }
 
 STATIC void
@@ -1216,6 +1219,56 @@ xfs_ioctl_setattr_check_extsize(
 	return 0;
 }
 
+/*
+ * CoW extent size hint validation rules are:
+ *
+ * 1. CoW extent size hint can only be set if reflink is enabled on the fs.
+ *    The inode does not have to have any shared blocks, but it must be a v3.
+ * 2. FS_XFLAG_COWEXTSIZE is only valid for directories and regular files;
+ *    for a directory, the hint is propagated to new files.
+ * 3. Can be changed on files & directories at any time.
+ * 4. CoW extsize hint of 0 turns off hints, clears inode flags.
+ * 5. Extent size must be a multiple of the appropriate block size.
+ * 6. The extent size hint must be limited to half the AG size to avoid
+ *    alignment extending the extent beyond the limits of the AG.
+ */
+static int
+xfs_ioctl_setattr_check_cowextsize(
+	struct xfs_inode	*ip,
+	struct fsxattr		*fa)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!(fa->fsx_xflags & FS_XFLAG_COWEXTSIZE))
+		return 0;
+
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb) ||
+	    ip->i_d.di_version != 3)
+		return -EINVAL;
+
+	if (!S_ISREG(VFS_I(ip)->i_mode) && !S_ISDIR(VFS_I(ip)->i_mode))
+		return -EINVAL;
+
+	if (fa->fsx_cowextsize != 0) {
+		xfs_extlen_t    size;
+		xfs_fsblock_t   cowextsize_fsb;
+
+		cowextsize_fsb = XFS_B_TO_FSB(mp, fa->fsx_cowextsize);
+		if (cowextsize_fsb > MAXEXTLEN)
+			return -EINVAL;
+
+		size = mp->m_sb.sb_blocksize;
+		if (cowextsize_fsb > mp->m_sb.sb_agblocks / 2)
+			return -EINVAL;
+
+		if (fa->fsx_cowextsize % size)
+			return -EINVAL;
+	} else
+		fa->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE;
+
+	return 0;
+}
+
 static int
 xfs_ioctl_setattr_check_projid(
 	struct xfs_inode	*ip,
@@ -1312,6 +1365,10 @@ xfs_ioctl_setattr(
 	if (code)
 		goto error_trans_cancel;
 
+	code = xfs_ioctl_setattr_check_cowextsize(ip, fa);
+	if (code)
+		goto error_trans_cancel;
+
 	code = xfs_ioctl_setattr_xflags(tp, ip, fa);
 	if (code)
 		goto error_trans_cancel;
@@ -1347,6 +1404,12 @@ xfs_ioctl_setattr(
 		ip->i_d.di_extsize = fa->fsx_extsize >> mp->m_sb.sb_blocklog;
 	else
 		ip->i_d.di_extsize = 0;
+	if (ip->i_d.di_version == 3 &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
+		ip->i_d.di_cowextsize = fa->fsx_cowextsize >>
+				mp->m_sb.sb_blocklog;
+	else
+		ip->i_d.di_cowextsize = 0;
 
 	code = xfs_trans_commit(tp);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index e7e1346..3914f0f 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -589,7 +589,10 @@ __xfs_iomap_write_delay(
 	if (error)
 		return error;
 
-	extsz = xfs_get_extsz_hint(ip);
+	if (whichfork == XFS_COW_FORK)
+		extsz = xfs_get_cowextsz_hint(ip);
+	else
+		extsz = xfs_get_extsz_hint(ip);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 
 	if (whichfork == XFS_DATA_FORK) {
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index ce73eb3..6da964a 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -111,6 +111,11 @@ xfs_bulkstat_one_int(
 	buf->bs_aextents = dic->di_anextents;
 	buf->bs_forkoff = XFS_IFORK_BOFF(ip);
 
+	if (dic->di_version == 3) {
+		if (dic->di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
+			buf->bs_cowextsize = dic->di_cowextsize;
+	}
+
 	switch (dic->di_format) {
 	case XFS_DINODE_FMT_DEV:
 		buf->bs_rdev = ip->i_df.if_u2.if_rdev;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 7c64104..d2c1547 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1010,18 +1010,19 @@ out_error:
 }
 
 /*
- * Update destination inode size, if necessary.
+ * Update destination inode size & cowextsize hint, if necessary.
  */
 STATIC int
 xfs_reflink_update_dest(
 	struct xfs_inode	*dest,
-	xfs_off_t		newlen)
+	xfs_off_t		newlen,
+	xfs_extlen_t		cowextsize)
 {
 	struct xfs_mount	*mp = dest->i_mount;
 	struct xfs_trans	*tp;
 	int			error;
 
-	if (newlen <= i_size_read(VFS_I(dest)))
+	if (newlen <= i_size_read(VFS_I(dest)) && cowextsize == 0)
 		return 0;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
@@ -1031,9 +1032,17 @@ xfs_reflink_update_dest(
 	xfs_ilock(dest, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
 
-	trace_xfs_reflink_update_inode_size(dest, newlen);
-	i_size_write(VFS_I(dest), newlen);
-	dest->i_d.di_size = newlen;
+	if (newlen > i_size_read(VFS_I(dest))) {
+		trace_xfs_reflink_update_inode_size(dest, newlen);
+		i_size_write(VFS_I(dest), newlen);
+		dest->i_d.di_size = newlen;
+	}
+
+	if (cowextsize) {
+		dest->i_d.di_cowextsize = cowextsize;
+		dest->i_d.di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	}
+
 	xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
 
 	error = xfs_trans_commit(tp);
@@ -1351,6 +1360,7 @@ xfs_reflink_remap_range(
 	xfs_fileoff_t		sfsbno, dfsbno;
 	xfs_filblks_t		fsblen;
 	int			error;
+	xfs_extlen_t		cowextsize;
 	bool			is_same;
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
@@ -1411,7 +1421,19 @@ xfs_reflink_remap_range(
 	if (error)
 		goto out_error;
 
-	error = xfs_reflink_update_dest(dest, destoff + len);
+	/*
+	 * Carry the cowextsize hint from src to dest if we're sharing the
+	 * entire source file to the entire destination file, the source file
+	 * has a cowextsize hint, and the destination file does not.
+	 */
+	cowextsize = 0;
+	if (srcoff == 0 && len == i_size_read(VFS_I(src)) &&
+	    (src->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE) &&
+	    destoff == 0 && len >= i_size_read(VFS_I(dest)) &&
+	    !(dest->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
+		cowextsize = src->i_d.di_cowextsize;
+
+	error = xfs_reflink_update_dest(dest, destoff + len, cowextsize);
 	if (error)
 		goto out_error;
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 098/119] xfs: preallocate blocks for worst-case btree expansion
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (96 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 097/119] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 099/119] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
                   ` (20 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

To gracefully handle the situation where a CoW operation turns a
single refcount extent into a lot of tiny ones and then run out of
space when a tree split has to happen, use the per-AG reserved block
pool to pre-allocate all the space we'll ever need for a maximal
btree.  For a 4K block size, this only costs an overhead of 0.3% of
available disk space.

When reflink is enabled, we have an unfortunate problem with rmap --
since we can share a block billions of times, this means that the
reverse mapping btree can expand basically infinitely.  When an AG is
so full that there are no free blocks with which to expand the rmapbt,
the filesystem will shut down hard.

This is rather annoying to the user, so use the AG reservation code to
reserve a "reasonable" amount of space for rmap.  We'll prevent
reflinks and CoW operations if we think we're getting close to
exhausting an AG's free space rather than shutting down, but this
permanent reservation should be enough for "most" users.  Hopefully.

v2: Simplify the return value from xfs_perag_pool_free_block to a bool
so that we can easily call xfs_trans_binval for both the per-AG pool
and the real freeing case.  Without this we fail to invalidate the
btree buffer and will trip over the write verifier on a shrinking
refcount btree.

v3: Convert to the new per-AG reservation code.

v4: Combine this patch with the one that adds the rmapbt reservation,
since the rmapbt reservation is only needed for reflink filesystems.

v5: If we detect errors while counting the refcount or rmap btrees,
shut down the filesystem to avoid the scenario where the fs shuts down
mid-transaction due to btree corruption, repair refuses to run until
the log is clean, and the log cannot be cleaned because replay hits
btree corruption and shuts down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch@lst.de: ensure that we invalidate the freed btree buffer]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_ag_resv.c        |   11 +++++
 fs/xfs/libxfs/xfs_alloc.c          |    2 -
 fs/xfs/libxfs/xfs_refcount_btree.c |   61 +++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_refcount_btree.h |    3 +
 fs/xfs/libxfs/xfs_rmap_btree.c     |   80 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h     |    7 +++
 fs/xfs/xfs_fsops.c                 |   56 +++++++++++++++++++++++++
 fs/xfs/xfs_fsops.h                 |    3 +
 fs/xfs/xfs_mount.c                 |    4 ++
 fs/xfs/xfs_super.c                 |    5 ++
 10 files changed, 225 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 4d390b7..9cfe132 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -38,6 +38,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_btree.h"
+#include "xfs_refcount_btree.h"
 
 /*
  * Per-AG Block Reservations
@@ -225,6 +226,11 @@ xfs_ag_resv_init(
 	/* Create the metadata reservation. */
 	ask = used = 0;
 
+	err2 = xfs_refcountbt_calc_reserves(pag->pag_mount, pag->pag_agno,
+			&ask, &used);
+	if (err2 && !error)
+		error = err2;
+
 	err2 = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA, ask, used);
 	if (err2 && !error)
 		error = err2;
@@ -236,6 +242,11 @@ init_agfl:
 	/* Create the AGFL metadata reservation */
 	ask = used = 0;
 
+	err2 = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
+			&ask, &used);
+	if (err2 && !error)
+		error = err2;
+
 	err2 = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
 	if (err2 && !error)
 		error = err2;
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c46db76..188c359a 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -139,8 +139,6 @@ xfs_alloc_ag_max_usable(struct xfs_mount *mp)
 		/* rmap root block + full tree split on full AG */
 		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;
 	}
-	if (xfs_sb_version_hasreflink(&mp->m_sb))
-		blocks += xfs_refcountbt_max_size(mp);
 
 	return mp->m_sb.sb_agblocks - blocks;
 }
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index a944fca..abf1ebf 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -76,6 +76,8 @@ xfs_refcountbt_alloc_block(
 	struct xfs_alloc_arg	args;		/* block allocation args */
 	int			error;		/* error return value */
 
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
+
 	memset(&args, 0, sizeof(args));
 	args.tp = cur->bc_tp;
 	args.mp = cur->bc_mp;
@@ -85,6 +87,7 @@ xfs_refcountbt_alloc_block(
 	args.firstblock = args.fsbno;
 	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
 	args.minlen = args.maxlen = args.prod = 1;
+	args.resv = XFS_AG_RESV_METADATA;
 
 	error = xfs_alloc_vextent(&args);
 	if (error)
@@ -116,17 +119,20 @@ xfs_refcountbt_free_block(
 	struct xfs_buf		*bp)
 {
 	struct xfs_mount	*mp = cur->bc_mp;
-	struct xfs_trans	*tp = cur->bc_tp;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
 	struct xfs_owner_info	oinfo;
+	int			error;
 
 	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
 			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
-	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
-			&oinfo);
-	xfs_trans_binval(tp, bp);
-	return 0;
+	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
+			XFS_AG_RESV_METADATA);
+	if (error)
+		return error;
+
+	xfs_trans_binval(cur->bc_tp, bp);
+	return error;
 }
 
 STATIC int
@@ -396,3 +402,48 @@ xfs_refcountbt_max_size(
 
 	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
 }
+
+/* Count the blocks in the reference count tree. */
+static int
+xfs_refcountbt_count_blocks(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*tree_blocks)
+{
+	struct xfs_buf		*agbp;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
+	error = xfs_btree_count_blocks(cur, tree_blocks);
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(NULL, agbp);
+
+	return error;
+}
+
+/*
+ * Figure out how many blocks to reserve and how many are used by this btree.
+ */
+int
+xfs_refcountbt_calc_reserves(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*ask,
+	xfs_extlen_t		*used)
+{
+	xfs_extlen_t		tree_len = 0;
+	int			error;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	*ask += xfs_refcountbt_max_size(mp);
+	error = xfs_refcountbt_count_blocks(mp, agno, &tree_len);
+	*used += tree_len;
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index 780b02f..3be7768 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
 		unsigned long long len);
 extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
 
+extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
+		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 090dbbe..0b045a6 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -34,6 +34,7 @@
 #include "xfs_cksum.h"
 #include "xfs_error.h"
 #include "xfs_extent_busy.h"
+#include "xfs_ag_resv.h"
 
 /*
  * Reverse map btree.
@@ -60,6 +61,14 @@
  * try to recover tree and file data from corrupt primary metadata.
  */
 
+static bool
+xfs_rmapbt_need_reserve(
+	struct xfs_mount	*mp)
+{
+	return  xfs_sb_version_hasrmapbt(&mp->m_sb) &&
+		xfs_sb_version_hasreflink(&mp->m_sb);
+}
+
 static struct xfs_btree_cur *
 xfs_rmapbt_dup_cursor(
 	struct xfs_btree_cur	*cur)
@@ -481,3 +490,74 @@ xfs_rmapbt_compute_maxlevels(
 		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
 				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
 }
+
+/* Calculate the refcount btree size for some records. */
+xfs_extlen_t
+xfs_rmapbt_calc_size(
+	struct xfs_mount	*mp,
+	unsigned long long	len)
+{
+	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
+}
+
+/*
+ * Calculate the maximum refcount btree size.
+ */
+xfs_extlen_t
+xfs_rmapbt_max_size(
+	struct xfs_mount	*mp)
+{
+	/* Bail out if we're uninitialized, which can happen in mkfs. */
+	if (mp->m_rmap_mxr[0] == 0)
+		return 0;
+
+	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
+}
+
+/* Count the blocks in the reference count tree. */
+static int
+xfs_rmapbt_count_blocks(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*tree_blocks)
+{
+	struct xfs_buf		*agbp;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+	cur = xfs_rmapbt_init_cursor(mp, NULL, agbp, agno);
+	error = xfs_btree_count_blocks(cur, tree_blocks);
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(NULL, agbp);
+
+	return error;
+}
+
+/*
+ * Figure out how many blocks to reserve and how many are used by this btree.
+ */
+int
+xfs_rmapbt_calc_reserves(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*ask,
+	xfs_extlen_t		*used)
+{
+	xfs_extlen_t		pool_len;
+	xfs_extlen_t		tree_len = 0;
+	int			error;
+
+	if (!xfs_rmapbt_need_reserve(mp))
+		return 0;
+
+	/* Reserve 1% of the AG or enough for 1 block per record. */
+	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
+	*ask += pool_len;
+	error = xfs_rmapbt_count_blocks(mp, agno, &tree_len);
+	*used += tree_len;
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 5df406e..f398e8b 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -130,4 +130,11 @@ int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
 		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
 		xfs_exntst_t state, struct xfs_btree_cur **pcur);
 
+extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
+		unsigned long long len);
+extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
+
+extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
+		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
+
 #endif	/* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 84e7ba3..e76aefc 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -42,6 +42,8 @@
 #include "xfs_trace.h"
 #include "xfs_log.h"
 #include "xfs_filestream.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_ag_resv.h"
 
 /*
  * File system operations
@@ -677,6 +679,9 @@ xfs_growfs_data_private(
 			continue;
 		}
 	}
+
+	xfs_fs_reserve_ag_blocks(mp);
+
 	return saved_error ? saved_error : error;
 
  error0:
@@ -971,3 +976,54 @@ xfs_do_force_shutdown(
 	"Please umount the filesystem and rectify the problem(s)");
 	}
 }
+
+/*
+ * Reserve free space for per-AG metadata.
+ */
+void
+xfs_fs_reserve_ag_blocks(
+	struct xfs_mount	*mp)
+{
+	xfs_agnumber_t		agno;
+	struct xfs_perag	*pag;
+	int			error = 0;
+	int			err2;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		pag = xfs_perag_get(mp, agno);
+		err2 = xfs_ag_resv_init(pag);
+		xfs_perag_put(pag);
+		if (err2 && !error)
+			error = err2;
+	}
+
+	if (error) {
+		xfs_warn(mp, "Error %d reserving metadata blocks.", error);
+		xfs_force_shutdown(mp, (error == -EFSCORRUPTED) ?
+			SHUTDOWN_CORRUPT_INCORE : SHUTDOWN_META_IO_ERROR);
+	}
+}
+
+/*
+ * Free space reserved for per-AG metadata.
+ */
+void
+xfs_fs_unreserve_ag_blocks(
+	struct xfs_mount	*mp)
+{
+	xfs_agnumber_t		agno;
+	struct xfs_perag	*pag;
+	int			error = 0;
+	int			err2;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		pag = xfs_perag_get(mp, agno);
+		err2 = xfs_ag_resv_free(pag);
+		xfs_perag_put(pag);
+		if (err2 && !error)
+			error = err2;
+	}
+
+	if (error)
+		xfs_warn(mp, "Error %d unreserving metadata blocks.", error);
+}
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index f32713f..71e3248 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
 				xfs_fsop_resblks_t *outval);
 extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
 
+extern void xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
+extern void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
+
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 6351dce..db80832 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -45,6 +45,7 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
 #include "xfs_reflink.h"
+#include "xfs_refcount_btree.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -962,6 +963,8 @@ xfs_mountfs(
 			xfs_warn(mp,
 	"Unable to allocate reserve blocks. Continuing without reserve pool.");
 
+		xfs_fs_reserve_ag_blocks(mp);
+
 		/* Recover any CoW blocks that never got remapped. */
 		error = xfs_reflink_recover_cow(mp);
 		if (error && !XFS_FORCED_SHUTDOWN(mp))
@@ -1013,6 +1016,7 @@ xfs_unmountfs(
 
 	cancel_delayed_work_sync(&mp->m_eofblocks_work);
 
+	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
 	xfs_rtunmount_inodes(mp);
 	IRELE(mp->m_rootip);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 87e997a..10a0f721 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -51,6 +51,7 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_refcount_btree.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1306,6 +1307,7 @@ xfs_fs_remount(
 		 */
 		xfs_restore_resvblks(mp);
 		xfs_log_work_queue(mp);
+		xfs_fs_reserve_ag_blocks(mp);
 
 		/* Recover any CoW blocks that never got remapped. */
 		error = xfs_reflink_recover_cow(mp);
@@ -1323,6 +1325,9 @@ xfs_fs_remount(
 		 * reserve pool size so that if we get remounted rw, we can
 		 * return it to the same size.
 		 */
+
+		xfs_fs_unreserve_ag_blocks(mp);
+
 		xfs_save_resvblks(mp);
 		xfs_quiesce_attr(mp);
 		mp->m_flags |= XFS_MOUNT_RDONLY;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 099/119] xfs: don't allow reflink when the AG is low on space
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (97 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 098/119] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 100/119] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
                   ` (19 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

If the AG free space is down to the reserves, refuse to reflink our
way out of space.  Hopefully userspace will make a real copy and/or go
elsewhere.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d2c1547..f5195b7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -53,6 +53,8 @@
 #include "xfs_reflink.h"
 #include "xfs_iomap.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_sb.h"
+#include "xfs_ag_resv.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -1056,6 +1058,30 @@ out_error:
 }
 
 /*
+ * Do we have enough reserve in this AG to handle a reflink?  The refcount
+ * btree already reserved all the space it needs, but the rmap btree can grow
+ * infinitely, so we won't allow more reflinks when the AG is down to the
+ * btree reserves.
+ */
+static int
+xfs_reflink_ag_has_free_space(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_perag	*pag;
+	int			error = 0;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return 0;
+
+	pag = xfs_perag_get(mp, agno);
+	if (xfs_ag_resv_critical(pag, XFS_AG_RESV_AGFL))
+		error = -ENOSPC;
+	xfs_perag_put(pag);
+	return error;
+}
+
+/*
  * Unmap a range of blocks from a file, then map other blocks into the hole.
  * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
  * The extent irec is mapped into dest at irec->br_startoff.
@@ -1087,6 +1113,14 @@ xfs_reflink_remap_extent(
 			irec->br_startblock != DELAYSTARTBLOCK &&
 			!ISUNWRITTEN(irec));
 
+	/* No reflinking if we're low on space */
+	if (real_extent) {
+		error = xfs_reflink_ag_has_free_space(mp,
+				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
+		if (error)
+			goto out;
+	}
+
 	/* Start a rolling transaction to switch the mappings */
 	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 100/119] xfs: try other AGs to allocate a BMBT block
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (98 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 099/119] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 101/119] xfs: promote buffered writes to CoW when cowextsz is set Darrick J. Wong
                   ` (18 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Prior to the introduction of reflink, allocating a block and mapping
it into a file was performed in a single transaction with a single
block reservation, and the allocator was supposed to find enough
blocks to allocate the extent and any BMBT blocks that might be
necessary (unless we're low on space).

However, due to the way copy on write works, allocation and mapping
have been split into two transactions, which means that we must be
able to handle the case where we allocate an extent for CoW but that
AG runs out of free space before the blocks can be mapped into a file,
and the mapping requires a new BMBT block.  When this happens, look in
one of the other AGs for a BMBT block instead of taking the FS down.

The same applies to the functions that convert a data fork to extents
and later btree format.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       |   30 ++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap_btree.c |   17 +++++++++++++++++
 2 files changed, 47 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a6c08bf..62ac322 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -753,6 +753,7 @@ xfs_bmap_extents_to_btree(
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
 	} else if (dfops->dop_low) {
+try_another_ag:
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = *firstblock;
 	} else {
@@ -767,6 +768,21 @@ xfs_bmap_extents_to_btree(
 		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
 		return error;
 	}
+
+	/*
+	 * During a CoW operation, the allocation and bmbt updates occur in
+	 * different transactions.  The mapping code tries to put new bmbt
+	 * blocks near extents being mapped, but the only way to guarantee this
+	 * is if the alloc and the mapping happen in a single transaction that
+	 * has a block reservation.  That isn't the case here, so if we run out
+	 * of space we'll try again with another AG.
+	 */
+	if (xfs_sb_version_hasreflink(&cur->bc_mp->m_sb) &&
+	    args.fsbno == NULLFSBLOCK &&
+	    args.type == XFS_ALLOCTYPE_NEAR_BNO) {
+		dfops->dop_low = true;
+		goto try_another_ag;
+	}
 	/*
 	 * Allocation can't fail, the space was reserved.
 	 */
@@ -902,6 +918,7 @@ xfs_bmap_local_to_extents(
 	 * file currently fits in an inode.
 	 */
 	if (*firstblock == NULLFSBLOCK) {
+try_another_ag:
 		args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
 		args.type = XFS_ALLOCTYPE_START_BNO;
 	} else {
@@ -914,6 +931,19 @@ xfs_bmap_local_to_extents(
 	if (error)
 		goto done;
 
+	/*
+	 * During a CoW operation, the allocation and bmbt updates occur in
+	 * different transactions.  The mapping code tries to put new bmbt
+	 * blocks near extents being mapped, but the only way to guarantee this
+	 * is if the alloc and the mapping happen in a single transaction that
+	 * has a block reservation.  That isn't the case here, so if we run out
+	 * of space we'll try again with another AG.
+	 */
+	if (xfs_sb_version_hasreflink(&ip->i_mount->m_sb) &&
+	    args.fsbno == NULLFSBLOCK &&
+	    args.type == XFS_ALLOCTYPE_NEAR_BNO) {
+		goto try_another_ag;
+	}
 	/* Can't fail, the space was reserved. */
 	ASSERT(args.fsbno != NULLFSBLOCK);
 	ASSERT(args.len == 1);
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index a5a8d37..9e7df9d 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -452,6 +452,7 @@ xfs_bmbt_alloc_block(
 
 	if (args.fsbno == NULLFSBLOCK) {
 		args.fsbno = be64_to_cpu(start->l);
+try_another_ag:
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		/*
 		 * Make sure there is sufficient room left in the AG to
@@ -481,6 +482,22 @@ xfs_bmbt_alloc_block(
 	if (error)
 		goto error0;
 
+	/*
+	 * During a CoW operation, the allocation and bmbt updates occur in
+	 * different transactions.  The mapping code tries to put new bmbt
+	 * blocks near extents being mapped, but the only way to guarantee this
+	 * is if the alloc and the mapping happen in a single transaction that
+	 * has a block reservation.  That isn't the case here, so if we run out
+	 * of space we'll try again with another AG.
+	 */
+	if (xfs_sb_version_hasreflink(&cur->bc_mp->m_sb) &&
+	    args.fsbno == NULLFSBLOCK &&
+	    args.type == XFS_ALLOCTYPE_NEAR_BNO) {
+		cur->bc_private.b.dfops->dop_low = true;
+		args.fsbno = cur->bc_private.b.firstblock;
+		goto try_another_ag;
+	}
+
 	if (args.fsbno == NULLFSBLOCK && args.minleft) {
 		/*
 		 * Could not find an AG with enough free space to satisfy


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 101/119] xfs: promote buffered writes to CoW when cowextsz is set
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (99 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 100/119] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 102/119] xfs: garbage collect old cowextsz reservations Darrick J. Wong
                   ` (17 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When we're doing non-cow writes to a part of a file that already has a
CoW reservation by virtue of cowextsz being set, promote the write to
copy-on-write so that the entire extent can get written out as a single
block, thereby reducing post-CoW fragmentation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c |   39 ++++++++++++++++++---------------------
 1 file changed, 18 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 812bae5..31205fa 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -668,10 +668,12 @@ xfs_check_page_type(
 	bh = head = page_buffers(page);
 	do {
 		if (buffer_unwritten(bh)) {
-			if (type == XFS_IO_UNWRITTEN)
+			if (type == XFS_IO_UNWRITTEN ||
+			    type == XFS_IO_COW)
 				return true;
 		} else if (buffer_delay(bh)) {
-			if (type == XFS_IO_DELALLOC)
+			if (type == XFS_IO_DELALLOC ||
+			    type == XFS_IO_COW)
 				return true;
 		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
 			if (type == XFS_IO_OVERWRITE ||
@@ -836,25 +838,13 @@ xfs_writepage_map(
 			continue;
 		}
 
-		if (buffer_unwritten(bh)) {
-			if (wpc->io_type != XFS_IO_UNWRITTEN) {
-				wpc->io_type = XFS_IO_UNWRITTEN;
-				wpc->imap_valid = false;
-			}
-		} else if (buffer_delay(bh)) {
-			if (wpc->io_type != XFS_IO_DELALLOC) {
-				wpc->io_type = XFS_IO_DELALLOC;
-				wpc->imap_valid = false;
-			}
-		} else if (buffer_uptodate(bh)) {
-			new_type = xfs_is_cow_io(XFS_I(inode), offset) ?
-					XFS_IO_COW : XFS_IO_OVERWRITE;
-
-			if (wpc->io_type != new_type) {
-				wpc->io_type = new_type;
-				wpc->imap_valid = false;
-			}
-		} else {
+		if (buffer_unwritten(bh))
+			new_type = XFS_IO_UNWRITTEN;
+		else if (buffer_delay(bh))
+			new_type = XFS_IO_DELALLOC;
+		else if (buffer_uptodate(bh))
+			new_type = XFS_IO_OVERWRITE;
+		else {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
 			/*
@@ -867,6 +857,13 @@ xfs_writepage_map(
 			continue;
 		}
 
+		if (xfs_is_cow_io(XFS_I(inode), offset))
+			new_type = XFS_IO_COW;
+		if (wpc->io_type != new_type) {
+			wpc->io_type = new_type;
+			wpc->imap_valid = false;
+		}
+
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
 							 offset);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 102/119] xfs: garbage collect old cowextsz reservations
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (100 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 101/119] xfs: promote buffered writes to CoW when cowextsz is set Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:28 ` [PATCH 103/119] xfs: provide switch to force filesystem to copy-on-write all the time Darrick J. Wong
                   ` (16 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Trim CoW reservations made on behalf of a cowextsz hint if they get too
old or we run low on quota, so long as we don't have dirty data awaiting
writeback or directio operations in progress.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |    3 +
 fs/xfs/xfs_file.c      |    3 +
 fs/xfs/xfs_globals.c   |    5 +
 fs/xfs/xfs_icache.c    |  238 ++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_icache.h    |    7 +
 fs/xfs/xfs_inode.c     |    4 +
 fs/xfs/xfs_iomap.c     |    2 
 fs/xfs/xfs_linux.h     |    1 
 fs/xfs/xfs_mount.c     |    1 
 fs/xfs/xfs_mount.h     |    2 
 fs/xfs/xfs_reflink.c   |   35 +++++++
 fs/xfs/xfs_reflink.h   |    2 
 fs/xfs/xfs_super.c     |    1 
 fs/xfs/xfs_sysctl.c    |    9 ++
 fs/xfs/xfs_sysctl.h    |    1 
 fs/xfs/xfs_trace.h     |    5 +
 16 files changed, 287 insertions(+), 32 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index b0c2c6d5..584f276 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1222,6 +1222,7 @@ xfs_free_cow_space(
 	 */
 	if (ip->i_d.di_nblocks == 0) {
 		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		xfs_inode_clear_cowblocks_tag(ip);
 		xfs_trans_log_inode(*tpp, ip, XFS_ILOG_CORE);
 	}
 out:
@@ -1966,6 +1967,8 @@ xfs_swap_extents(
 		cowfp = ip->i_cowfp;
 		ip->i_cowfp = tip->i_cowfp;
 		tip->i_cowfp = cowfp;
+		xfs_inode_set_cowblocks_tag(ip);
+		xfs_inode_set_cowblocks_tag(tip);
 	}
 
 	xfs_trans_log_inode(tp, ip,  src_log_flags);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 618bd12..ad6a467 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -859,6 +859,9 @@ write_retry:
 		enospc = xfs_inode_free_quota_eofblocks(ip);
 		if (enospc)
 			goto write_retry;
+		enospc = xfs_inode_free_quota_cowblocks(ip);
+		if (enospc)
+			goto write_retry;
 	} else if (ret == -ENOSPC && !enospc) {
 		struct xfs_eofblocks eofb = {0};
 
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index 4d41b24..f3f6aa9 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -21,8 +21,8 @@
 /*
  * Tunable XFS parameters.  xfs_params is required even when CONFIG_SYSCTL=n,
  * other XFS code uses these values.  Times are measured in centisecs (i.e.
- * 100ths of a second) with the exception of eofb_timer, which is measured in
- * seconds.
+ * 100ths of a second) with the exception of eofb_timer and cowb_timer, which
+ * are measured in seconds.
  */
 xfs_param_t xfs_params = {
 			  /*	MIN		DFLT		MAX	*/
@@ -42,6 +42,7 @@ xfs_param_t xfs_params = {
 	.inherit_nodfrg	= {	0,		1,		1	},
 	.fstrm_timer	= {	1,		30*100,		3600*100},
 	.eofb_timer	= {	1,		300,		3600*24},
+	.cowb_timer	= {	1,		300,		3600*24},
 };
 
 struct xfs_globals xfs_globals = {
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 06f3b8c..884b570 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -33,6 +33,7 @@
 #include "xfs_bmap_util.h"
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
+#include "xfs_reflink.h"
 
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -792,6 +793,33 @@ xfs_eofblocks_worker(
 	xfs_queue_eofblocks(mp);
 }
 
+/*
+ * Background scanning to trim preallocated CoW space. This is queued
+ * based on the 'speculative_cow_prealloc_lifetime' tunable (5m by default).
+ * (We'll just piggyback on the post-EOF prealloc space workqueue.)
+ */
+STATIC void
+xfs_queue_cowblocks(
+	struct xfs_mount *mp)
+{
+	rcu_read_lock();
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_COWBLOCKS_TAG))
+		queue_delayed_work(mp->m_eofblocks_workqueue,
+				   &mp->m_cowblocks_work,
+				   msecs_to_jiffies(xfs_cowb_secs * 1000));
+	rcu_read_unlock();
+}
+
+void
+xfs_cowblocks_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+				struct xfs_mount, m_cowblocks_work);
+	xfs_icache_free_cowblocks(mp, NULL);
+	xfs_queue_cowblocks(mp);
+}
+
 int
 xfs_inode_ag_iterator(
 	struct xfs_mount	*mp,
@@ -1348,18 +1376,30 @@ xfs_inode_free_eofblocks(
 	return ret;
 }
 
-int
-xfs_icache_free_eofblocks(
+static int
+__xfs_icache_free_eofblocks(
 	struct xfs_mount	*mp,
-	struct xfs_eofblocks	*eofb)
+	struct xfs_eofblocks	*eofb,
+	int			(*execute)(struct xfs_inode *ip, int flags,
+					   void *args),
+	int			tag)
 {
 	int flags = SYNC_TRYLOCK;
 
 	if (eofb && (eofb->eof_flags & XFS_EOF_FLAGS_SYNC))
 		flags = SYNC_WAIT;
 
-	return xfs_inode_ag_iterator_tag(mp, xfs_inode_free_eofblocks, flags,
-					 eofb, XFS_ICI_EOFBLOCKS_TAG);
+	return xfs_inode_ag_iterator_tag(mp, execute, flags,
+					 eofb, tag);
+}
+
+int
+xfs_icache_free_eofblocks(
+	struct xfs_mount	*mp,
+	struct xfs_eofblocks	*eofb)
+{
+	return __xfs_icache_free_eofblocks(mp, eofb, xfs_inode_free_eofblocks,
+			XFS_ICI_EOFBLOCKS_TAG);
 }
 
 /*
@@ -1368,9 +1408,11 @@ xfs_icache_free_eofblocks(
  * failure. We make a best effort by including each quota under low free space
  * conditions (less than 1% free space) in the scan.
  */
-int
-xfs_inode_free_quota_eofblocks(
-	struct xfs_inode *ip)
+static int
+__xfs_inode_free_quota_eofblocks(
+	struct xfs_inode	*ip,
+	int			(*execute)(struct xfs_mount *mp,
+					   struct xfs_eofblocks	*eofb))
 {
 	int scan = 0;
 	struct xfs_eofblocks eofb = {0};
@@ -1406,14 +1448,25 @@ xfs_inode_free_quota_eofblocks(
 	}
 
 	if (scan)
-		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+		execute(ip->i_mount, &eofb);
 
 	return scan;
 }
 
-void
-xfs_inode_set_eofblocks_tag(
-	xfs_inode_t	*ip)
+int
+xfs_inode_free_quota_eofblocks(
+	struct xfs_inode *ip)
+{
+	return __xfs_inode_free_quota_eofblocks(ip, xfs_icache_free_eofblocks);
+}
+
+static void
+__xfs_inode_set_eofblocks_tag(
+	xfs_inode_t	*ip,
+	void		(*execute)(struct xfs_mount *mp),
+	void		(*set_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
+				  int error, unsigned long caller_ip),
+	int		tag)
 {
 	struct xfs_mount *mp = ip->i_mount;
 	struct xfs_perag *pag;
@@ -1421,26 +1474,22 @@ xfs_inode_set_eofblocks_tag(
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
-	trace_xfs_inode_set_eofblocks_tag(ip);
 
-	tagged = radix_tree_tagged(&pag->pag_ici_root,
-				   XFS_ICI_EOFBLOCKS_TAG);
+	tagged = radix_tree_tagged(&pag->pag_ici_root, tag);
 	radix_tree_tag_set(&pag->pag_ici_root,
-			   XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
-			   XFS_ICI_EOFBLOCKS_TAG);
+			   XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
 	if (!tagged) {
 		/* propagate the eofblocks tag up into the perag radix tree */
 		spin_lock(&ip->i_mount->m_perag_lock);
 		radix_tree_tag_set(&ip->i_mount->m_perag_tree,
 				   XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
-				   XFS_ICI_EOFBLOCKS_TAG);
+				   tag);
 		spin_unlock(&ip->i_mount->m_perag_lock);
 
 		/* kick off background trimming */
-		xfs_queue_eofblocks(ip->i_mount);
+		execute(ip->i_mount);
 
-		trace_xfs_perag_set_eofblocks(ip->i_mount, pag->pag_agno,
-					      -1, _RET_IP_);
+		set_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
 	}
 
 	spin_unlock(&pag->pag_ici_lock);
@@ -1448,31 +1497,162 @@ xfs_inode_set_eofblocks_tag(
 }
 
 void
-xfs_inode_clear_eofblocks_tag(
+xfs_inode_set_eofblocks_tag(
 	xfs_inode_t	*ip)
 {
+	trace_xfs_inode_set_eofblocks_tag(ip);
+	return __xfs_inode_set_eofblocks_tag(ip, xfs_queue_eofblocks,
+			trace_xfs_perag_set_eofblocks,
+			XFS_ICI_EOFBLOCKS_TAG);
+}
+
+static void
+__xfs_inode_clear_eofblocks_tag(
+	xfs_inode_t	*ip,
+	void		(*clear_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
+				    int error, unsigned long caller_ip),
+	int		tag)
+{
 	struct xfs_mount *mp = ip->i_mount;
 	struct xfs_perag *pag;
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
-	trace_xfs_inode_clear_eofblocks_tag(ip);
 
 	radix_tree_tag_clear(&pag->pag_ici_root,
-			     XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
-			     XFS_ICI_EOFBLOCKS_TAG);
-	if (!radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_EOFBLOCKS_TAG)) {
+			     XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
+	if (!radix_tree_tagged(&pag->pag_ici_root, tag)) {
 		/* clear the eofblocks tag from the perag radix tree */
 		spin_lock(&ip->i_mount->m_perag_lock);
 		radix_tree_tag_clear(&ip->i_mount->m_perag_tree,
 				     XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
-				     XFS_ICI_EOFBLOCKS_TAG);
+				     tag);
 		spin_unlock(&ip->i_mount->m_perag_lock);
-		trace_xfs_perag_clear_eofblocks(ip->i_mount, pag->pag_agno,
-					       -1, _RET_IP_);
+		clear_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
 	}
 
 	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
 }
 
+void
+xfs_inode_clear_eofblocks_tag(
+	xfs_inode_t	*ip)
+{
+	trace_xfs_inode_clear_eofblocks_tag(ip);
+	return __xfs_inode_clear_eofblocks_tag(ip,
+			trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
+}
+
+/*
+ * Automatic CoW Reservation Freeing
+ *
+ * These functions automatically garbage collect leftover CoW reservations
+ * that were made on behalf of a cowextsize hint when we start to run out
+ * of quota or when the reservations sit around for too long.  If the file
+ * has dirty pages or is undergoing writeback, its CoW reservations will
+ * be retained.
+ *
+ * The actual garbage collection piggybacks off the same code that runs
+ * the speculative EOF preallocation garbage collector.
+ */
+STATIC int
+xfs_inode_free_cowblocks(
+	struct xfs_inode	*ip,
+	int			flags,
+	void			*args)
+{
+	int ret;
+	struct xfs_eofblocks *eofb = args;
+	bool need_iolock = true;
+	int match;
+
+	ASSERT(!eofb || (eofb && eofb->eof_scan_owner != 0));
+
+	if (!xfs_reflink_has_real_cow_blocks(ip)) {
+		trace_xfs_inode_free_cowblocks_invalid(ip);
+		xfs_inode_clear_cowblocks_tag(ip);
+		return 0;
+	}
+
+	/*
+	 * If the mapping is dirty or under writeback we cannot touch the
+	 * CoW fork.  Leave it alone if we're in the midst of a directio.
+	 */
+	if (mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
+	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
+	    atomic_read(&VFS_I(ip)->i_dio_count))
+		return 0;
+
+	if (eofb) {
+		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
+			match = xfs_inode_match_id_union(ip, eofb);
+		else
+			match = xfs_inode_match_id(ip, eofb);
+		if (!match)
+			return 0;
+
+		/* skip the inode if the file size is too small */
+		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
+		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
+			return 0;
+
+		/*
+		 * A scan owner implies we already hold the iolock. Skip it in
+		 * xfs_free_eofblocks() to avoid deadlock. This also eliminates
+		 * the possibility of EAGAIN being returned.
+		 */
+		if (eofb->eof_scan_owner == ip->i_ino)
+			need_iolock = false;
+	}
+
+	/* Free the CoW blocks */
+	if (need_iolock) {
+		xfs_ilock(ip, XFS_IOLOCK_EXCL);
+		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	}
+
+	ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
+
+	if (need_iolock) {
+		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	}
+
+	return ret;
+}
+
+int
+xfs_icache_free_cowblocks(
+	struct xfs_mount	*mp,
+	struct xfs_eofblocks	*eofb)
+{
+	return __xfs_icache_free_eofblocks(mp, eofb, xfs_inode_free_cowblocks,
+			XFS_ICI_COWBLOCKS_TAG);
+}
+
+int
+xfs_inode_free_quota_cowblocks(
+	struct xfs_inode *ip)
+{
+	return __xfs_inode_free_quota_eofblocks(ip, xfs_icache_free_cowblocks);
+}
+
+void
+xfs_inode_set_cowblocks_tag(
+	xfs_inode_t	*ip)
+{
+	trace_xfs_inode_set_eofblocks_tag(ip);
+	return __xfs_inode_set_eofblocks_tag(ip, xfs_queue_cowblocks,
+			trace_xfs_perag_set_eofblocks,
+			XFS_ICI_COWBLOCKS_TAG);
+}
+
+void
+xfs_inode_clear_cowblocks_tag(
+	xfs_inode_t	*ip)
+{
+	trace_xfs_inode_clear_eofblocks_tag(ip);
+	return __xfs_inode_clear_eofblocks_tag(ip,
+			trace_xfs_perag_clear_eofblocks, XFS_ICI_COWBLOCKS_TAG);
+}
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 62f1f91..a22ac92 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -40,6 +40,7 @@ struct xfs_eofblocks {
 					   in xfs_inode_ag_iterator */
 #define XFS_ICI_RECLAIM_TAG	0	/* inode is to be reclaimed */
 #define XFS_ICI_EOFBLOCKS_TAG	1	/* inode has blocks beyond EOF */
+#define XFS_ICI_COWBLOCKS_TAG	2	/* inode can have cow blocks to gc */
 
 /*
  * Flags for xfs_iget()
@@ -69,6 +70,12 @@ int xfs_icache_free_eofblocks(struct xfs_mount *, struct xfs_eofblocks *);
 int xfs_inode_free_quota_eofblocks(struct xfs_inode *ip);
 void xfs_eofblocks_worker(struct work_struct *);
 
+void xfs_inode_set_cowblocks_tag(struct xfs_inode *ip);
+void xfs_inode_clear_cowblocks_tag(struct xfs_inode *ip);
+int xfs_icache_free_cowblocks(struct xfs_mount *, struct xfs_eofblocks *);
+int xfs_inode_free_quota_cowblocks(struct xfs_inode *ip);
+void xfs_cowblocks_worker(struct work_struct *);
+
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, int flags, void *args),
 	int flags, void *args);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 480e48a..fb9c2d7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1631,8 +1631,10 @@ xfs_itruncate_extents(
 	/*
 	 * Clear the reflink flag if we truncated everything.
 	 */
-	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip))
+	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
 		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		xfs_inode_clear_cowblocks_tag(ip);
+	}
 
 	/*
 	 * Always re-log the inode so that our permanent transaction can keep
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 3914f0f..58240b5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -668,6 +668,8 @@ retry:
 	 */
 	if (prealloc)
 		xfs_inode_set_eofblocks_tag(ip);
+	if (whichfork == XFS_COW_FORK && extsz > 0)
+		xfs_inode_set_cowblocks_tag(ip);
 
 	*ret_imap = imap[0];
 	return 0;
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index a8192dc..f05e2cf5 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -116,6 +116,7 @@ typedef __u32			xfs_nlink_t;
 #define xfs_inherit_nodefrag	xfs_params.inherit_nodfrg.val
 #define xfs_fstrm_centisecs	xfs_params.fstrm_timer.val
 #define xfs_eofb_secs		xfs_params.eofb_timer.val
+#define xfs_cowb_secs		xfs_params.cowb_timer.val
 
 #define current_cpu()		(raw_smp_processor_id())
 #define current_pid()		(current->pid)
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index db80832..e53853d 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1015,6 +1015,7 @@ xfs_unmountfs(
 	int			error;
 
 	cancel_delayed_work_sync(&mp->m_eofblocks_work);
+	cancel_delayed_work_sync(&mp->m_cowblocks_work);
 
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a516a1f..6b06d24 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -158,6 +158,8 @@ typedef struct xfs_mount {
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
 						     trimming */
+	struct delayed_work	m_cowblocks_work; /* background cow blocks
+						     trimming */
 	bool			m_update_sb;	/* sb needs update in mount */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
 						/* low free space thresholds */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5195b7..9b14eb5 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1727,6 +1727,7 @@ next:
 	/* Clear the inode flag. */
 	trace_xfs_reflink_unset_inode_flag(ip);
 	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+	xfs_inode_clear_cowblocks_tag(ip);
 	xfs_trans_ijoin(tp, ip, 0);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 
@@ -1864,3 +1865,37 @@ xfs_reflink_check_flag_adjust(
 	}
 	return 0;
 }
+
+/*
+ * Does this inode have any real CoW reservations?
+ */
+bool
+xfs_reflink_has_real_cow_blocks(
+	struct xfs_inode		*ip)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	xfs_extnum_t			idx;
+
+	if (!xfs_is_reflink_inode(ip))
+		return false;
+
+	/* Go find the old extent in the CoW fork. */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	gotp = xfs_iext_bno_to_ext(ifp, 0, &idx);
+	while (gotp) {
+		xfs_bmbt_get_all(gotp, &irec);
+
+		if (!isnullstartblock(irec.br_startblock))
+			return true;
+
+		/* Roll on... */
+		idx++;
+		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
+			break;
+		gotp = xfs_iext_get_ext(ifp, idx);
+	}
+
+	return false;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 97e8705..12c2bc6 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -56,6 +56,8 @@ extern void xfs_reflink_get_lxflags(struct xfs_inode *ip, unsigned int *flags);
 extern int xfs_reflink_check_flag_adjust(struct xfs_inode *ip,
 		unsigned int *xflags);
 
+extern bool xfs_reflink_has_real_cow_blocks(struct xfs_inode *ip);
+
 /* xfs_aops.c */
 extern int xfs_map_cow_blocks(struct inode *inode, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 10a0f721..93d159a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1504,6 +1504,7 @@ xfs_fs_fill_super(
 	atomic_set(&mp->m_active_trans, 0);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
+	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 
 	mp->m_super = sb;
diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c
index aed74d3..afe1f66 100644
--- a/fs/xfs/xfs_sysctl.c
+++ b/fs/xfs/xfs_sysctl.c
@@ -184,6 +184,15 @@ static struct ctl_table xfs_table[] = {
 		.extra1		= &xfs_params.eofb_timer.min,
 		.extra2		= &xfs_params.eofb_timer.max,
 	},
+	{
+		.procname	= "speculative_cow_prealloc_lifetime",
+		.data		= &xfs_params.cowb_timer.val,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &xfs_params.cowb_timer.min,
+		.extra2		= &xfs_params.cowb_timer.max,
+	},
 	/* please keep this the last entry */
 #ifdef CONFIG_PROC_FS
 	{
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index ffef453..984a349 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -48,6 +48,7 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
+	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
 } xfs_param_t;
 
 /*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fe4a5be..1d89f8f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -136,6 +136,8 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_set_reclaim);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_reclaim);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_eofblocks);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_eofblocks);
+DEFINE_PERAG_REF_EVENT(xfs_perag_set_cowblocks);
+DEFINE_PERAG_REF_EVENT(xfs_perag_clear_cowblocks);
 
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
@@ -687,6 +689,9 @@ DEFINE_INODE_EVENT(xfs_dquot_dqdetach);
 DEFINE_INODE_EVENT(xfs_inode_set_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
+DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
+DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
 
 DEFINE_INODE_EVENT(xfs_filemap_fault);
 DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 103/119] xfs: provide switch to force filesystem to copy-on-write all the time
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (101 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 102/119] xfs: garbage collect old cowextsz reservations Darrick J. Wong
@ 2016-06-17  1:28 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 104/119] xfs: increase log reservations for reflink Darrick J. Wong
                   ` (15 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:28 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Make it possible to force XFS to use copy on write all the time, at
least if reflink is turned on.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_refcount.c |    6 ++++++
 fs/xfs/xfs_globals.c         |    1 +
 fs/xfs/xfs_linux.h           |    1 +
 fs/xfs/xfs_sysctl.c          |   11 +++++++++++
 fs/xfs/xfs_sysctl.h          |    1 +
 5 files changed, 20 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 88f91d5..fd4369f 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1190,6 +1190,12 @@ xfs_refcount_find_shared(
 
 	trace_xfs_refcount_find_shared(mp, agno, agbno, aglen);
 
+	if (xfs_always_cow) {
+		*fbno = agbno;
+		*flen = aglen;
+		return 0;
+	}
+
 	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
 	if (error)
 		goto out;
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index f3f6aa9..9a55966 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -43,6 +43,7 @@ xfs_param_t xfs_params = {
 	.fstrm_timer	= {	1,		30*100,		3600*100},
 	.eofb_timer	= {	1,		300,		3600*24},
 	.cowb_timer	= {	1,		300,		3600*24},
+	.always_cow	= {	0,		0,		1	},
 };
 
 struct xfs_globals xfs_globals = {
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index f05e2cf5..b70abad 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -117,6 +117,7 @@ typedef __u32			xfs_nlink_t;
 #define xfs_fstrm_centisecs	xfs_params.fstrm_timer.val
 #define xfs_eofb_secs		xfs_params.eofb_timer.val
 #define xfs_cowb_secs		xfs_params.cowb_timer.val
+#define xfs_always_cow		xfs_params.always_cow.val
 
 #define current_cpu()		(raw_smp_processor_id())
 #define current_pid()		(current->pid)
diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c
index afe1f66..650b8d5 100644
--- a/fs/xfs/xfs_sysctl.c
+++ b/fs/xfs/xfs_sysctl.c
@@ -193,6 +193,17 @@ static struct ctl_table xfs_table[] = {
 		.extra1		= &xfs_params.cowb_timer.min,
 		.extra2		= &xfs_params.cowb_timer.max,
 	},
+#ifdef DEBUG
+	{
+		.procname	= "always_cow",
+		.data		= &xfs_params.always_cow.val,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &xfs_params.always_cow.min,
+		.extra2		= &xfs_params.always_cow.max,
+	},
+#endif
 	/* please keep this the last entry */
 #ifdef CONFIG_PROC_FS
 	{
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index 984a349..16099dc 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -49,6 +49,7 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
 	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
+	xfs_sysctl_val_t always_cow;	/* Always copy on write? */
 } xfs_param_t;
 
 /*


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 104/119] xfs: increase log reservations for reflink
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (102 preceding siblings ...)
  2016-06-17  1:28 ` [PATCH 103/119] xfs: provide switch to force filesystem to copy-on-write all the time Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 105/119] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
                   ` (14 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Increase the log reservations to handle the increased rolling that
happens at the end of copy-on-write operations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |   16 +++++++++++++---
 fs/xfs/libxfs/xfs_trans_resv.h |    2 ++
 2 files changed, 15 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index a59838f..b456cca 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -812,11 +812,18 @@ xfs_trans_resv_calc(
 	 * require a permanent reservation on space.
 	 */
 	resp->tr_write.tr_logres = xfs_calc_write_reservation(mp);
-	resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
+	else
+		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
-	resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		resp->tr_itruncate.tr_logcount =
+				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
+	else
+		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
 	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
@@ -873,7 +880,10 @@ xfs_trans_resv_calc(
 	resp->tr_growrtalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_qm_dqalloc.tr_logres = xfs_calc_qm_dqalloc_reservation(mp);
-	resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
+	else
+		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	/*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 36a1511..b7e5357 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -87,6 +87,7 @@ struct xfs_trans_resv {
 #define	XFS_DEFAULT_LOG_COUNT		1
 #define	XFS_DEFAULT_PERM_LOG_COUNT	2
 #define	XFS_ITRUNCATE_LOG_COUNT		2
+#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
 #define XFS_INACTIVE_LOG_COUNT		2
 #define	XFS_CREATE_LOG_COUNT		2
 #define	XFS_CREATE_TMPFILE_LOG_COUNT	2
@@ -96,6 +97,7 @@ struct xfs_trans_resv {
 #define	XFS_LINK_LOG_COUNT		2
 #define	XFS_RENAME_LOG_COUNT		2
 #define	XFS_WRITE_LOG_COUNT		2
+#define	XFS_WRITE_LOG_COUNT_REFLINK	8
 #define	XFS_ADDAFORK_LOG_COUNT		2
 #define	XFS_ATTRINVAL_LOG_COUNT		1
 #define	XFS_ATTRSET_LOG_COUNT		3


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 105/119] xfs: use interval query for rmap alloc operations on shared files
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (103 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 104/119] xfs: increase log reservations for reflink Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 106/119] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
                   ` (13 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

When it's possible for reverse mappings to overlap (data fork extents
of files on reflink filesystems), use the interval query function to
find the left neighbor of an extent we're trying to add; and be
careful to use the lookup functions to update the neighbors and/or
add new extents.

v2: xfs_rmap_find_left_neighbor() needs to calculate the high key of a
query range correctly.  We can also add a few shortcuts -- there are
no left neighbors of a query at offset zero.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap.c       |  483 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h |    7 +
 fs/xfs/xfs_log_recover.c       |    6 
 fs/xfs/xfs_trace.h             |   14 +
 4 files changed, 507 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 611107c..a9bd522 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -211,6 +211,160 @@ xfs_rmap_get_rec(
 	return xfs_rmapbt_btrec_to_irec(rec, irec);
 }
 
+struct xfs_find_left_neighbor_info {
+	struct xfs_rmap_irec	high;
+	struct xfs_rmap_irec	*irec;
+	int			*stat;
+};
+
+/* For each rmap given, figure out if it matches the key we want. */
+STATIC int
+xfs_rmap_find_left_neighbor_helper(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xfs_find_left_neighbor_info	*info = priv;
+
+	trace_xfs_rmap_find_left_neighbor_candidate(cur->bc_mp,
+			cur->bc_private.a.agno, rec->rm_startblock,
+			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
+			rec->rm_flags);
+
+	if (rec->rm_owner != info->high.rm_owner)
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+	if (!XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) &&
+	    !(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
+	    rec->rm_offset + rec->rm_blockcount - 1 != info->high.rm_offset)
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+
+	*info->irec = *rec;
+	*info->stat = 1;
+	return XFS_BTREE_QUERY_RANGE_ABORT;
+}
+
+/*
+ * Find the record to the left of the given extent, being careful only to
+ * return a match with the same owner and adjacent physical and logical
+ * block ranges.
+ */
+int
+xfs_rmap_find_left_neighbor(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags,
+	struct xfs_rmap_irec	*irec,
+	int			*stat)
+{
+	struct xfs_find_left_neighbor_info	info;
+	int			error;
+
+	*stat = 0;
+	if (bno == 0)
+		return 0;
+	info.high.rm_startblock = bno - 1;
+	info.high.rm_owner = owner;
+	if (!XFS_RMAP_NON_INODE_OWNER(owner) &&
+	    !(flags & XFS_RMAP_BMBT_BLOCK)) {
+		if (offset == 0)
+			return 0;
+		info.high.rm_offset = offset - 1;
+	} else
+		info.high.rm_offset = 0;
+	info.high.rm_flags = flags;
+	info.high.rm_blockcount = 0;
+	info.irec = irec;
+	info.stat = stat;
+
+	trace_xfs_rmap_find_left_neighbor_query(cur->bc_mp,
+			cur->bc_private.a.agno, bno, 0, owner, offset, flags);
+
+	error = xfs_rmapbt_query_range(cur, &info.high, &info.high,
+			xfs_rmap_find_left_neighbor_helper, &info);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+	if (*stat)
+		trace_xfs_rmap_find_left_neighbor_result(cur->bc_mp,
+				cur->bc_private.a.agno, irec->rm_startblock,
+				irec->rm_blockcount, irec->rm_owner,
+				irec->rm_offset, irec->rm_flags);
+	return error;
+}
+
+/* For each rmap given, figure out if it matches the key we want. */
+STATIC int
+xfs_rmap_lookup_le_range_helper(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xfs_find_left_neighbor_info	*info = priv;
+
+	trace_xfs_rmap_lookup_le_range_candidate(cur->bc_mp,
+			cur->bc_private.a.agno, rec->rm_startblock,
+			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
+			rec->rm_flags);
+
+	if (rec->rm_owner != info->high.rm_owner)
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+	if (!XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) &&
+	    !(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
+	    (rec->rm_offset > info->high.rm_offset ||
+	     rec->rm_offset + rec->rm_blockcount <= info->high.rm_offset))
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+
+	*info->irec = *rec;
+	*info->stat = 1;
+	return XFS_BTREE_QUERY_RANGE_ABORT;
+}
+
+/*
+ * Find the record to the left of the given extent, being careful only to
+ * return a match with the same owner and overlapping physical and logical
+ * block ranges.  This is the overlapping-interval version of
+ * xfs_rmap_lookup_le.
+ */
+int
+xfs_rmap_lookup_le_range(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags,
+	struct xfs_rmap_irec	*irec,
+	int			*stat)
+{
+	struct xfs_find_left_neighbor_info	info;
+	int			error;
+
+	info.high.rm_startblock = bno;
+	info.high.rm_owner = owner;
+	if (!XFS_RMAP_NON_INODE_OWNER(owner) && !(flags & XFS_RMAP_BMBT_BLOCK))
+		info.high.rm_offset = offset;
+	else
+		info.high.rm_offset = 0;
+	info.high.rm_flags = flags;
+	info.high.rm_blockcount = 0;
+	*stat = 0;
+	info.irec = irec;
+	info.stat = stat;
+
+	trace_xfs_rmap_lookup_le_range(cur->bc_mp,
+			cur->bc_private.a.agno, bno, 0, owner, offset, flags);
+	error = xfs_rmapbt_query_range(cur, &info.high, &info.high,
+			xfs_rmap_lookup_le_range_helper, &info);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+	if (*stat)
+		trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
+				cur->bc_private.a.agno, irec->rm_startblock,
+				irec->rm_blockcount, irec->rm_owner,
+				irec->rm_offset, irec->rm_flags);
+	return error;
+}
+
 /*
  * Find the extent in the rmap btree and remove it.
  *
@@ -1159,6 +1313,168 @@ xfs_rmap_unmap(
 }
 
 /*
+ * Find an extent in the rmap btree and unmap it.  For rmap extent types that
+ * can overlap (data fork rmaps on reflink filesystems) we must be careful
+ * that the prev/next records in the btree might belong to another owner.
+ * Therefore we must use delete+insert to alter any of the key fields.
+ *
+ * For every other situation there can only be one owner for a given extent,
+ * so we can call the regular _free function.
+ */
+STATIC int
+xfs_rmap_unmap_shared(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	ltrec;
+	uint64_t		ltoff;
+	int			error = 0;
+	int			i;
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	if (unwritten)
+		flags |= XFS_RMAP_UNWRITTEN;
+	trace_xfs_rmap_unmap(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/*
+	 * We should always have a left record because there's a static record
+	 * for the AG headers at rm_startblock == 0 created by mkfs/growfs that
+	 * will not ever be removed from the tree.
+	 */
+	error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags,
+			&ltrec, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+	ltoff = ltrec.rm_offset;
+
+	/* Make sure the extent we found covers the entire freeing range. */
+	XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_startblock <= bno &&
+		ltrec.rm_startblock + ltrec.rm_blockcount >=
+		bno + len, out_error);
+
+	/* Make sure the owner matches what we expect to find in the tree. */
+	XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner, out_error);
+
+	/* Make sure the unwritten flag matches. */
+	XFS_WANT_CORRUPTED_GOTO(mp, (flags & XFS_RMAP_UNWRITTEN) ==
+			(ltrec.rm_flags & XFS_RMAP_UNWRITTEN), out_error);
+
+	/* Check the offset. */
+	XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_offset <= offset, out_error);
+	XFS_WANT_CORRUPTED_GOTO(mp, offset <= ltoff + ltrec.rm_blockcount,
+			out_error);
+
+	if (ltrec.rm_startblock == bno && ltrec.rm_blockcount == len) {
+		/* Exact match, simply remove the record from rmap tree. */
+		error = xfs_rmapbt_delete(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags);
+		if (error)
+			goto out_error;
+	} else if (ltrec.rm_startblock == bno) {
+		/*
+		 * Overlap left hand side of extent: move the start, trim the
+		 * length and update the current record.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing: |fffffffff|
+		 * Result:            |rrrrrrrrrr|
+		 *         bno       len
+		 */
+
+		/* Delete prev rmap. */
+		error = xfs_rmapbt_delete(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags);
+		if (error)
+			goto out_error;
+
+		/* Add an rmap at the new offset. */
+		ltrec.rm_startblock += len;
+		ltrec.rm_blockcount -= len;
+		ltrec.rm_offset += len;
+		error = xfs_rmapbt_insert(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags);
+		if (error)
+			goto out_error;
+	} else if (ltrec.rm_startblock + ltrec.rm_blockcount == bno + len) {
+		/*
+		 * Overlap right hand side of extent: trim the length and
+		 * update the current record.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing:            |fffffffff|
+		 * Result:  |rrrrrrrrrr|
+		 *                    bno       len
+		 */
+		error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+		ltrec.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+	} else {
+		/*
+		 * Overlap middle of extent: trim the length of the existing
+		 * record to the length of the new left-extent size, increment
+		 * the insertion position so we can insert a new record
+		 * containing the remaining right-extent space.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing:       |fffffffff|
+		 * Result:  |rrrrr|         |rrrr|
+		 *               bno       len
+		 */
+		xfs_extlen_t	orig_len = ltrec.rm_blockcount;
+
+		/* Shrink the left side of the rmap */
+		error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+		ltrec.rm_blockcount = bno - ltrec.rm_startblock;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+
+		/* Add an rmap at the new offset */
+		error = xfs_rmapbt_insert(cur, bno + len,
+				orig_len - len - ltrec.rm_blockcount,
+				ltrec.rm_owner, offset + len,
+				ltrec.rm_flags);
+		if (error)
+			goto out_error;
+	}
+
+	trace_xfs_rmap_unmap_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+out_error:
+	if (error)
+		trace_xfs_rmap_unmap_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
  * Find an extent in the rmap btree and map it.
  */
 STATIC int
@@ -1172,6 +1488,159 @@ xfs_rmap_map(
 	return __xfs_rmap_alloc(cur, bno, len, unwritten, oinfo);
 }
 
+/*
+ * Find an extent in the rmap btree and map it.  For rmap extent types that
+ * can overlap (data fork rmaps on reflink filesystems) we must be careful
+ * that the prev/next records in the btree might belong to another owner.
+ * Therefore we must use delete+insert to alter any of the key fields.
+ *
+ * For every other situation there can only be one owner for a given extent,
+ * so we can call the regular _alloc function.
+ */
+STATIC int
+xfs_rmap_map_shared(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	ltrec;
+	struct xfs_rmap_irec	gtrec;
+	int			have_gt;
+	int			have_lt;
+	int			error = 0;
+	int			i;
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags = 0;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	if (unwritten)
+		flags |= XFS_RMAP_UNWRITTEN;
+	trace_xfs_rmap_map(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/* Is there a left record that abuts our range? */
+	error = xfs_rmap_find_left_neighbor(cur, bno, owner, offset, flags,
+			&ltrec, &have_lt);
+	if (error)
+		goto out_error;
+	if (have_lt &&
+	    !xfs_rmap_is_mergeable(&ltrec, owner, offset, len, flags))
+		have_lt = 0;
+
+	/* Is there a right record that abuts our range? */
+	error = xfs_rmap_lookup_eq(cur, bno + len, len, owner, offset + len,
+			flags, &have_gt);
+	if (error)
+		goto out_error;
+	if (have_gt) {
+		error = xfs_rmap_get_rec(cur, &gtrec, &have_gt);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
+		trace_xfs_rmap_map_gtrec(cur->bc_mp,
+			cur->bc_private.a.agno, gtrec.rm_startblock,
+			gtrec.rm_blockcount, gtrec.rm_owner,
+			gtrec.rm_offset, gtrec.rm_flags);
+
+		if (!xfs_rmap_is_mergeable(&gtrec, owner, offset, len, flags))
+			have_gt = 0;
+	}
+
+	if (have_lt &&
+	    ltrec.rm_startblock + ltrec.rm_blockcount == bno &&
+	    ltrec.rm_offset + ltrec.rm_blockcount == offset) {
+		/*
+		 * Left edge contiguous, merge into left record.
+		 *
+		 *       ltbno     ltlen
+		 * orig:   |ooooooooo|
+		 * adding:           |aaaaaaaaa|
+		 * result: |rrrrrrrrrrrrrrrrrrr|
+		 *                  bno       len
+		 */
+		ltrec.rm_blockcount += len;
+		if (have_gt &&
+		    bno + len == gtrec.rm_startblock &&
+		    offset + len == gtrec.rm_offset) {
+			/*
+			 * Right edge also contiguous, delete right record
+			 * and merge into left record.
+			 *
+			 *       ltbno     ltlen    gtbno     gtlen
+			 * orig:   |ooooooooo|         |ooooooooo|
+			 * adding:           |aaaaaaaaa|
+			 * result: |rrrrrrrrrrrrrrrrrrrrrrrrrrrrr|
+			 */
+			ltrec.rm_blockcount += gtrec.rm_blockcount;
+			error = xfs_rmapbt_delete(cur, gtrec.rm_startblock,
+					gtrec.rm_blockcount, gtrec.rm_owner,
+					gtrec.rm_offset, gtrec.rm_flags);
+			if (error)
+				goto out_error;
+		}
+
+		/* Point the cursor back to the left record and update. */
+		error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+	} else if (have_gt &&
+		   bno + len == gtrec.rm_startblock &&
+		   offset + len == gtrec.rm_offset) {
+		/*
+		 * Right edge contiguous, merge into right record.
+		 *
+		 *                 gtbno     gtlen
+		 * Orig:             |ooooooooo|
+		 * adding: |aaaaaaaaa|
+		 * Result: |rrrrrrrrrrrrrrrrrrr|
+		 *        bno       len
+		 */
+		/* Delete the old record. */
+		error = xfs_rmapbt_delete(cur, gtrec.rm_startblock,
+				gtrec.rm_blockcount, gtrec.rm_owner,
+				gtrec.rm_offset, gtrec.rm_flags);
+		if (error)
+			goto out_error;
+
+		/* Move the start and re-add it. */
+		gtrec.rm_startblock = bno;
+		gtrec.rm_blockcount += len;
+		gtrec.rm_offset = offset;
+		error = xfs_rmapbt_insert(cur, gtrec.rm_startblock,
+				gtrec.rm_blockcount, gtrec.rm_owner,
+				gtrec.rm_offset, gtrec.rm_flags);
+		if (error)
+			goto out_error;
+	} else {
+		/*
+		 * No contiguous edge with identical owner, insert
+		 * new record at current cursor position.
+		 */
+		error = xfs_rmapbt_insert(cur, bno, len, owner, offset, flags);
+		if (error)
+			goto out_error;
+	}
+
+	trace_xfs_rmap_map_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+out_error:
+	if (error)
+		trace_xfs_rmap_map_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
 struct xfs_rmapbt_query_range_info {
 	xfs_rmapbt_query_range_fn	fn;
 	void				*priv;
@@ -1304,10 +1773,18 @@ xfs_rmap_finish_one(
 	case XFS_RMAP_MAP:
 		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
 		break;
+	case XFS_RMAP_MAP_SHARED:
+		error = xfs_rmap_map_shared(rcur, bno, blockcount, unwritten,
+				&oinfo);
+		break;
 	case XFS_RMAP_UNMAP:
 		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
 				&oinfo);
 		break;
+	case XFS_RMAP_UNMAP_SHARED:
+		error = xfs_rmap_unmap_shared(rcur, bno, blockcount, unwritten,
+				&oinfo);
+		break;
 	case XFS_RMAP_CONVERT:
 		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
 				&oinfo);
@@ -1375,7 +1852,8 @@ xfs_rmap_map_extent(
 {
 	struct xfs_rmap_intent	ri;
 
-	ri.ri_type = XFS_RMAP_MAP;
+	ri.ri_type = xfs_is_reflink_inode(ip) ? XFS_RMAP_MAP_SHARED :
+			XFS_RMAP_MAP;
 	ri.ri_owner = ip->i_ino;
 	ri.ri_whichfork = whichfork;
 	ri.ri_bmap = *PREV;
@@ -1394,7 +1872,8 @@ xfs_rmap_unmap_extent(
 {
 	struct xfs_rmap_intent	ri;
 
-	ri.ri_type = XFS_RMAP_UNMAP;
+	ri.ri_type = xfs_is_reflink_inode(ip) ? XFS_RMAP_UNMAP_SHARED :
+			XFS_RMAP_UNMAP;
 	ri.ri_owner = ip->i_ino;
 	ri.ri_whichfork = whichfork;
 	ri.ri_bmap = *PREV;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index f398e8b..5baa81f 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -70,6 +70,13 @@ int xfs_rmapbt_insert(struct xfs_btree_cur *rcur, xfs_agblock_t agbno,
 int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
 		int *stat);
 
+int xfs_rmap_find_left_neighbor(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		uint64_t owner, uint64_t offset, unsigned int flags,
+		struct xfs_rmap_irec *irec, int	*stat);
+int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		uint64_t owner, uint64_t offset, unsigned int flags,
+		struct xfs_rmap_irec *irec, int	*stat);
+
 /* functions for updating the rmapbt for bmbt blocks and AG btree blocks */
 int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
 		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 58a700b..b2d2e0a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -4774,9 +4774,15 @@ xlog_recover_process_rui(
 		case XFS_RMAP_EXTENT_MAP:
 			type = XFS_RMAP_MAP;
 			break;
+		case XFS_RMAP_EXTENT_MAP_SHARED:
+			type = XFS_RMAP_MAP_SHARED;
+			break;
 		case XFS_RMAP_EXTENT_UNMAP:
 			type = XFS_RMAP_UNMAP;
 			break;
+		case XFS_RMAP_EXTENT_UNMAP_SHARED:
+			type = XFS_RMAP_UNMAP_SHARED;
+			break;
 		case XFS_RMAP_EXTENT_CONVERT:
 			type = XFS_RMAP_CONVERT;
 			break;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 1d89f8f..d64bab7 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2509,6 +2509,13 @@ DEFINE_RMAP_EVENT(xfs_rmap_convert_done);
 DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_state);
 
+DEFINE_RMAP_EVENT(xfs_rmap_unmap);
+DEFINE_RMAP_EVENT(xfs_rmap_unmap_done);
+DEFINE_AG_ERROR_EVENT(xfs_rmap_unmap_error);
+DEFINE_RMAP_EVENT(xfs_rmap_map);
+DEFINE_RMAP_EVENT(xfs_rmap_map_done);
+DEFINE_AG_ERROR_EVENT(xfs_rmap_map_error);
+
 DECLARE_EVENT_CLASS(xfs_rmapbt_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
 		 xfs_agblock_t agbno, xfs_extlen_t len,
@@ -2560,10 +2567,15 @@ DEFINE_RMAPBT_EVENT(xfs_rmapbt_delete);
 DEFINE_AG_ERROR_EVENT(xfs_rmapbt_insert_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
+
+DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_candidate);
+DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
+DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_query);
+DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_candidate);
 DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
+DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range);
 DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
 DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
-DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
 /* deferred bmbt updates */
 #define DEFINE_BMAP_DEFERRED_EVENT	DEFINE_RMAP_DEFERRED_EVENT


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 106/119] xfs: convert unwritten status of reverse mappings for shared files
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (104 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 105/119] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 107/119] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
                   ` (12 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Provide a function to convert an unwritten extent to a real one and
vice versa when shared extents are possible.

v2: Move rmap unwritten bit to rm_offset.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap.c |  385 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_log_recover.c |    3 
 2 files changed, 387 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index a9bd522..29d08fc 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -1293,6 +1293,384 @@ xfs_rmap_convert(
 	return __xfs_rmap_convert(cur, bno, len, unwritten, oinfo);
 }
 
+/*
+ * Convert an unwritten extent to a real extent or vice versa.  If there is no
+ * possibility of overlapping extents, delegate to the simpler convert
+ * function.
+ */
+STATIC int
+xfs_rmap_convert_shared(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	r[4];	/* neighbor extent entries */
+					/* left is 0, right is 1, prev is 2 */
+					/* new is 3 */
+	uint64_t		owner;
+	uint64_t		offset;
+	uint64_t		new_endoff;
+	unsigned int		oldext;
+	unsigned int		newext;
+	unsigned int		flags = 0;
+	int			i;
+	int			state = 0;
+	int			error;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	ASSERT(!(XFS_RMAP_NON_INODE_OWNER(owner) ||
+			(flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))));
+	oldext = unwritten ? XFS_RMAP_UNWRITTEN : 0;
+	new_endoff = offset + len;
+	trace_xfs_rmap_convert(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/*
+	 * For the initial lookup, look for and exact match or the left-adjacent
+	 * record for our insertion point. This will also give us the record for
+	 * start block contiguity tests.
+	 */
+	error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags,
+			&PREV, &i);
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+
+	ASSERT(PREV.rm_offset <= offset);
+	ASSERT(PREV.rm_offset + PREV.rm_blockcount >= new_endoff);
+	ASSERT((PREV.rm_flags & XFS_RMAP_UNWRITTEN) == oldext);
+	newext = ~oldext & XFS_RMAP_UNWRITTEN;
+
+	/*
+	 * Set flags determining what part of the previous oldext allocation
+	 * extent is being replaced by a newext allocation.
+	 */
+	if (PREV.rm_offset == offset)
+		state |= RMAP_LEFT_FILLING;
+	if (PREV.rm_offset + PREV.rm_blockcount == new_endoff)
+		state |= RMAP_RIGHT_FILLING;
+
+	/* Is there a left record that abuts our range? */
+	error = xfs_rmap_find_left_neighbor(cur, bno, owner, offset, newext,
+			&LEFT, &i);
+	if (error)
+		goto done;
+	if (i) {
+		state |= RMAP_LEFT_VALID;
+		XFS_WANT_CORRUPTED_GOTO(mp,
+				LEFT.rm_startblock + LEFT.rm_blockcount <= bno,
+				done);
+		if (xfs_rmap_is_mergeable(&LEFT, owner, offset, len, newext))
+			state |= RMAP_LEFT_CONTIG;
+	}
+
+	/* Is there a right record that abuts our range? */
+	error = xfs_rmap_lookup_eq(cur, bno + len, len, owner, offset + len,
+			newext, &i);
+	if (error)
+		goto done;
+	if (i) {
+		state |= RMAP_RIGHT_VALID;
+		error = xfs_rmap_get_rec(cur, &RIGHT, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= RIGHT.rm_startblock,
+				done);
+		trace_xfs_rmap_convert_gtrec(cur->bc_mp,
+				cur->bc_private.a.agno, RIGHT.rm_startblock,
+				RIGHT.rm_blockcount, RIGHT.rm_owner,
+				RIGHT.rm_offset, RIGHT.rm_flags);
+		if (xfs_rmap_is_mergeable(&RIGHT, owner, offset, len, newext))
+			state |= RMAP_RIGHT_CONTIG;
+	}
+
+	/* check that left + prev + right is not too long */
+	if ((state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) ==
+	    (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG) &&
+	    (unsigned long)LEFT.rm_blockcount + len +
+	     RIGHT.rm_blockcount > XFS_RMAP_LEN_MAX)
+		state &= ~RMAP_RIGHT_CONTIG;
+
+	trace_xfs_rmap_convert_state(mp, cur->bc_private.a.agno, state,
+			_RET_IP_);
+	/*
+	 * Switch out based on the FILLING and CONTIG state bits.
+	 */
+	switch (state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) {
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The left and right neighbors are both contiguous with new.
+		 */
+		error = xfs_rmapbt_delete(cur, RIGHT.rm_startblock,
+				RIGHT.rm_blockcount, RIGHT.rm_owner,
+				RIGHT.rm_offset, RIGHT.rm_flags);
+		if (error)
+			goto done;
+		error = xfs_rmapbt_delete(cur, PREV.rm_startblock,
+				PREV.rm_blockcount, PREV.rm_owner,
+				PREV.rm_offset, PREV.rm_flags);
+		if (error)
+			goto done;
+		NEW = LEFT;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += PREV.rm_blockcount + RIGHT.rm_blockcount;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The left neighbor is contiguous, the right is not.
+		 */
+		error = xfs_rmapbt_delete(cur, PREV.rm_startblock,
+				PREV.rm_blockcount, PREV.rm_owner,
+				PREV.rm_offset, PREV.rm_flags);
+		if (error)
+			goto done;
+		NEW = LEFT;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += PREV.rm_blockcount;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The right neighbor is contiguous, the left is not.
+		 */
+		error = xfs_rmapbt_delete(cur, RIGHT.rm_startblock,
+				RIGHT.rm_blockcount, RIGHT.rm_owner,
+				RIGHT.rm_offset, RIGHT.rm_flags);
+		if (error)
+			goto done;
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += RIGHT.rm_blockcount;
+		NEW.rm_flags = RIGHT.rm_flags;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * Neither the left nor right neighbors are contiguous with
+		 * the new one.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_flags = newext;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG:
+		/*
+		 * Setting the first part of a previous oldext extent to newext.
+		 * The left neighbor is contiguous.
+		 */
+		NEW = PREV;
+		error = xfs_rmapbt_delete(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW.rm_offset += len;
+		NEW.rm_startblock += len;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmapbt_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW = LEFT;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING:
+		/*
+		 * Setting the first part of a previous oldext extent to newext.
+		 * The left neighbor is not contiguous.
+		 */
+		NEW = PREV;
+		error = xfs_rmapbt_delete(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW.rm_offset += len;
+		NEW.rm_startblock += len;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmapbt_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		error = xfs_rmapbt_insert(cur, bno, len, owner, offset, newext);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting the last part of a previous oldext extent to newext.
+		 * The right neighbor is contiguous with the new allocation.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount = offset - NEW.rm_offset;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		NEW = RIGHT;
+		error = xfs_rmapbt_delete(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW.rm_offset = offset;
+		NEW.rm_startblock = bno;
+		NEW.rm_blockcount += len;
+		error = xfs_rmapbt_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_RIGHT_FILLING:
+		/*
+		 * Setting the last part of a previous oldext extent to newext.
+		 * The right neighbor is not contiguous.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		error = xfs_rmapbt_insert(cur, bno, len, owner, offset, newext);
+		if (error)
+			goto done;
+		break;
+
+	case 0:
+		/*
+		 * Setting the middle part of a previous oldext extent to
+		 * newext.  Contiguity is impossible here.
+		 * One extent becomes three extents.
+		 */
+		/* new right extent - oldext */
+		NEW.rm_startblock = bno + len;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = new_endoff;
+		NEW.rm_blockcount = PREV.rm_offset + PREV.rm_blockcount -
+				new_endoff;
+		NEW.rm_flags = PREV.rm_flags;
+		error = xfs_rmapbt_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner, NEW.rm_offset,
+				NEW.rm_flags);
+		if (error)
+			goto done;
+		/* new left extent - oldext */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount = offset - NEW.rm_offset;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		/* new middle extent - newext */
+		NEW.rm_startblock = bno;
+		NEW.rm_blockcount = len;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = offset;
+		NEW.rm_flags = newext;
+		error = xfs_rmapbt_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner, NEW.rm_offset,
+				NEW.rm_flags);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_CONTIG:
+	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
+	case RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_LEFT_CONTIG:
+	case RMAP_RIGHT_CONTIG:
+		/*
+		 * These cases are all impossible.
+		 */
+		ASSERT(0);
+	}
+
+	trace_xfs_rmap_convert_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+done:
+	if (error)
+		trace_xfs_rmap_convert_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
 #undef	NEW
 #undef	LEFT
 #undef	RIGHT
@@ -1789,6 +2167,10 @@ xfs_rmap_finish_one(
 		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
 				&oinfo);
 		break;
+	case XFS_RMAP_CONVERT_SHARED:
+		error = xfs_rmap_convert_shared(rcur, bno, blockcount,
+				!unwritten, &oinfo);
+		break;
 	case XFS_RMAP_ALLOC:
 		error = __xfs_rmap_alloc(rcur, bno, blockcount, unwritten,
 				&oinfo);
@@ -1892,7 +2274,8 @@ xfs_rmap_convert_extent(
 {
 	struct xfs_rmap_intent	ri;
 
-	ri.ri_type = XFS_RMAP_CONVERT;
+	ri.ri_type = xfs_is_reflink_inode(ip) ? XFS_RMAP_CONVERT_SHARED :
+			XFS_RMAP_CONVERT;
 	ri.ri_owner = ip->i_ino;
 	ri.ri_whichfork = whichfork;
 	ri.ri_bmap = *PREV;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index b2d2e0a..9372fc5 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -4786,6 +4786,9 @@ xlog_recover_process_rui(
 		case XFS_RMAP_EXTENT_CONVERT:
 			type = XFS_RMAP_CONVERT;
 			break;
+		case XFS_RMAP_EXTENT_CONVERT_SHARED:
+			type = XFS_RMAP_CONVERT_SHARED;
+			break;
 		case XFS_RMAP_EXTENT_ALLOC:
 			type = XFS_RMAP_ALLOC;
 			break;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 107/119] xfs: set a default CoW extent size of 32 blocks
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (105 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 106/119] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 108/119] xfs: don't allow realtime and reflinked files to mix Darrick J. Wong
                   ` (11 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

If the admin doesn't set a CoW extent size or a regular extent size
hint, default to creating CoW reservations 32 blocks long to reduce
fragmentation.

Signed-off-by: DarricK J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index fb9c2d7..f2971e2 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -80,7 +80,8 @@ xfs_get_extsz_hint(
 /*
  * Helper function to extract CoW extent size hint from inode.
  * Between the extent size hint and the CoW extent size hint, we
- * return the greater of the two.
+ * return the greater of the two.  If the value is zero (automatic),
+ * default to 32 blocks.
  */
 xfs_extlen_t
 xfs_get_cowextsz_hint(
@@ -93,9 +94,10 @@ xfs_get_cowextsz_hint(
 		a = ip->i_d.di_cowextsize;
 	b = xfs_get_extsz_hint(ip);
 
-	if (a > b)
-		return a;
-	return b;
+	a = max(a, b);
+	if (a == 0)
+		return 32;
+	return a;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 108/119] xfs: don't allow realtime and reflinked files to mix
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (106 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 107/119] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 109/119] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
                   ` (10 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

We don't support sharing blocks on the realtime device.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |   10 ++++++++++
 fs/xfs/xfs_ioctl.c            |    4 ++++
 2 files changed, 14 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 2efa42c..c4cbd2b 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -386,6 +386,9 @@ xfs_dinode_verify(
 	xfs_ino_t		ino,
 	struct xfs_dinode	*dip)
 {
+	uint16_t		flags;
+	uint64_t		flags2;
+
 	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
 		return false;
 
@@ -402,6 +405,13 @@ xfs_dinode_verify(
 		return false;
 	if (!uuid_equal(&dip->di_uuid, &mp->m_sb.sb_meta_uuid))
 		return false;
+
+	/* don't let reflink and realtime mix */
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+	if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags & XFS_DIFLAG_REALTIME))
+		return false;
+
 	return true;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d2b4e81..f103b15 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1031,6 +1031,10 @@ xfs_ioctl_setattr_xflags(
 			return -EINVAL;
 	}
 
+	/* Don't allow us to set realtime mode for a reflinked file. */
+	if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
+		return -EINVAL;
+
 	/*
 	 * Can't modify an immutable/append-only file unless
 	 * we have appropriate permission.


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 109/119] xfs: don't mix reflink and DAX mode for now
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (107 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 108/119] xfs: don't allow realtime and reflinked files to mix Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 110/119] xfs: fail ->bmap for reflink inodes Darrick J. Wong
                   ` (9 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Since we don't have a strategy for handling both DAX and reflink,
for now we'll just prohibit both being set at the same time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |    4 ++++
 fs/xfs/xfs_file.c             |    4 ++++
 fs/xfs/xfs_ioctl.c            |    4 ++++
 fs/xfs/xfs_iops.c             |    1 +
 4 files changed, 13 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index c4cbd2b..3f7053a 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -412,6 +412,10 @@ xfs_dinode_verify(
 	if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags & XFS_DIFLAG_REALTIME))
 		return false;
 
+	/* don't let reflink and dax mix */
+	if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags2 & XFS_DIFLAG2_DAX))
+		return false;
+
 	return true;
 }
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ad6a467..e8e93f8 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1126,6 +1126,10 @@ xfs_file_share_range(
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
+	/* Don't share DAX file data for now. */
+	if (IS_DAX(inode_in) || IS_DAX(inode_out))
+		return -EINVAL;
+
 	/* Are we going all the way to the end? */
 	isize = i_size_read(inode_in);
 	if (isize == 0)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index f103b15..aa9645c 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1035,6 +1035,10 @@ xfs_ioctl_setattr_xflags(
 	if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
 		return -EINVAL;
 
+	/* Don't allow us to set DAX mode for a reflinked file for now. */
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
+		return -EINVAL;
+
 	/*
 	 * Can't modify an immutable/append-only file unless
 	 * we have appropriate permission.
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 0fa86bd..95073db 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1217,6 +1217,7 @@ xfs_diflags_to_iflags(
 		inode->i_flags |= S_NOATIME;
 	if (S_ISREG(inode->i_mode) &&
 	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
+	    !xfs_is_reflink_inode(ip) &&
 	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
 	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		inode->i_flags |= S_DAX;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 110/119] xfs: fail ->bmap for reflink inodes
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (108 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 109/119] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 111/119] xfs: recognize the reflink feature bit Darrick J. Wong
                   ` (8 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, Christoph Hellwig, xfs

From: Christoph Hellwig <hch@lst.de>

Have xfs_vm_bmap return zero for reflinked files.  This hack prevents
using a file with shared blocks as a swap file, because we don't want
to deal with CoW when we're (probably) low on memory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add a more descriptive changelog]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_aops.c |   11 +++++++++++
 1 file changed, 11 insertions(+)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 31205fa..83fd028 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1827,6 +1827,17 @@ xfs_vm_bmap(
 
 	trace_xfs_vm_bmap(XFS_I(inode));
 	xfs_ilock(ip, XFS_IOLOCK_SHARED);
+
+	/*
+	 * The swap code (ab-)uses ->bmap to get a block mapping and then
+	 * bypasseѕ the file system for actual I/O.  We really can't allow
+	 * that on reflinks inodes, so we have to skip out here.  And yes,
+	 * 0 is the magic code for a bmap error..
+	 */
+	if (xfs_is_reflink_inode(ip)) {
+		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+		return 0;
+	}
 	filemap_write_and_wait(mapping);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 	return generic_block_bmap(mapping, block, xfs_get_blocks);


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 111/119] xfs: recognize the reflink feature bit
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (109 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 110/119] xfs: fail ->bmap for reflink inodes Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:29 ` [PATCH 112/119] xfs: introduce the XFS_IOC_GETFSMAPX ioctl Darrick J. Wong
                   ` (7 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Add the reflink feature flag to the set of recognized feature flags.
This enables users to write to reflink filesystems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h |    3 ++-
 fs/xfs/xfs_super.c         |    7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index a35f4e5..211a8b5 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -459,7 +459,8 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
-		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
+		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 93d159a..5fdf2e7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1587,6 +1587,9 @@ xfs_fs_fill_super(
 			"DAX unsupported by block device. Turning off DAX.");
 			mp->m_flags &= ~XFS_MOUNT_DAX;
 		}
+		if (xfs_sb_version_hasreflink(&mp->m_sb))
+			xfs_alert(mp,
+		"DAX and reflink have not been tested together!");
 	}
 
 	if (xfs_sb_version_hassparseinodes(&mp->m_sb))
@@ -1597,6 +1600,10 @@ xfs_fs_fill_super(
 		xfs_alert(mp,
 	"EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!");
 
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		xfs_alert(mp,
+	"EXPERIMENTAL reflink feature enabled. Use at your own risk!");
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 112/119] xfs: introduce the XFS_IOC_GETFSMAPX ioctl
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (110 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 111/119] xfs: recognize the reflink feature bit Darrick J. Wong
@ 2016-06-17  1:29 ` Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 113/119] xfs: scrub btree records and pointers while querying Darrick J. Wong
                   ` (6 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:29 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.

v2: shorten the device field to u32 since that's all we need for
dev_t.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h       |   62 ++++++++
 fs/xfs/libxfs/xfs_refcount.c |   51 +++++-
 fs/xfs/libxfs/xfs_refcount.h |    4 
 fs/xfs/xfs_fsops.c           |  338 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsops.h           |    6 +
 fs/xfs/xfs_ioctl.c           |   76 +++++++++
 fs/xfs/xfs_ioctl32.c         |    1 
 fs/xfs/xfs_trace.h           |   76 +++++++++
 8 files changed, 600 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 10ebf99..8a1b96a 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -93,6 +93,67 @@ struct getbmapx {
 #define BMV_OF_SHARED		0x8	/* segment shared with another file */
 
 /*
+ *	Structure for XFS_IOC_GETFSMAPX.
+ *
+ *	Similar to XFS_IOC_GETBMAPX, the first two elements in the array are
+ *	used to constrain the output.  The first element in the array should
+ *	represent the lowest disk address that the user wants to learn about.
+ *	The second element in the array should represent the highest disk
+ *	address to query.  Subsequent array elements will be filled out by the
+ *	command.
+ *
+ *	The fmv_iflags field is only used in the first structure.  The
+ *	fmv_oflags field is filled in for each returned structure after the
+ *	second structure.  The fmv_unused1 fields in the first two array
+ *	elements must be zero.
+ *
+ *	The fmv_count, fmv_entries, and fmv_iflags fields in the second array
+ *	element must be zero.
+ *
+ *	fmv_block, fmv_offset, and fmv_length are expressed in units of 512
+ *	byte sectors.
+ */
+#ifndef HAVE_GETFSMAPX
+struct getfsmapx {
+	__u32		fmv_device;	/* device id */
+	__u32		fmv_unused1;	/* future use, must be zero */
+	__u64		fmv_block;	/* starting block */
+	__u64		fmv_owner;	/* owner id */
+	__u64		fmv_offset;	/* file offset of segment */
+	__u64		fmv_length;	/* length of segment, blocks */
+	__u32		fmv_oflags;	/* mapping flags */
+	__u32		fmv_iflags;	/* control flags (1st structure) */
+	__u32		fmv_count;	/* # of entries in array incl. input */
+	__u32		fmv_entries;	/* # of entries filled in (output). */
+	__u64		fmv_unused2;	/* future use, must be zero */
+};
+#endif
+
+/*	fmv_flags values - set by XFS_IOC_GETFSMAPX caller.	*/
+/* no flags defined yet */
+#define FMV_IF_VALID	0
+
+/*	fmv_flags values - returned for each non-header segment */
+#define FMV_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
+#define FMV_OF_ATTR_FORK	0x2	/* segment = attribute fork */
+#define FMV_OF_EXTENT_MAP	0x4	/* segment = extent map */
+#define FMV_OF_SHARED		0x8	/* segment = shared with another file */
+#define FMV_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
+#define FMV_OF_LAST		0x20	/* segment is the last in the FS */
+
+/*	fmv_owner special values */
+#define	FMV_OWN_FREE		(-1ULL)	/* free space */
+#define FMV_OWN_UNKNOWN		(-2ULL)	/* unknown owner */
+#define FMV_OWN_FS		(-3ULL)	/* static fs metadata */
+#define FMV_OWN_LOG		(-4ULL)	/* journalling log */
+#define FMV_OWN_AG		(-5ULL)	/* per-AG metadata */
+#define FMV_OWN_INOBT		(-6ULL)	/* inode btree blocks */
+#define FMV_OWN_INODES		(-7ULL)	/* inodes */
+#define FMV_OWN_REFC		(-8ULL) /* refcount tree */
+#define FMV_OWN_COW		(-9ULL) /* cow allocations */
+#define FMV_OWN_DEFECTIVE	(-10ULL) /* bad blocks */
+
+/*
  * Structure for XFS_IOC_FSSETDM.
  * For use by backup and restore programs to set the XFS on-disk inode
  * fields di_dmevmask and di_dmstate.  These must be set to exactly and
@@ -502,6 +563,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_GETBMAPX	_IOWR('X', 56, struct getbmap)
 #define XFS_IOC_ZERO_RANGE	_IOW ('X', 57, struct xfs_flock64)
 #define XFS_IOC_FREE_EOFBLOCKS	_IOR ('X', 58, struct xfs_fs_eofblocks)
+#define XFS_IOC_GETFSMAPX	_IOWR('X', 59, struct getfsmapx)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index fd4369f..e8d8702 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1172,8 +1172,9 @@ xfs_refcount_decrease_extent(
  * extent we find.  If no shared blocks are found, flen will be set to zero.
  */
 int
-xfs_refcount_find_shared(
+__xfs_refcount_find_shared(
 	struct xfs_mount	*mp,
+	struct xfs_buf		*agbp,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		aglen,
@@ -1182,23 +1183,13 @@ xfs_refcount_find_shared(
 	bool			find_maximal)
 {
 	struct xfs_btree_cur	*cur;
-	struct xfs_buf		*agbp;
 	struct xfs_refcount_irec	tmp;
-	int			error;
 	int			i, have;
 	int			bt_error = XFS_BTREE_ERROR;
+	int			error;
 
 	trace_xfs_refcount_find_shared(mp, agno, agbno, aglen);
 
-	if (xfs_always_cow) {
-		*fbno = agbno;
-		*flen = aglen;
-		return 0;
-	}
-
-	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
-	if (error)
-		goto out;
 	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
 
 	/* By default, skip the whole range */
@@ -1273,14 +1264,46 @@ done:
 
 out_error:
 	xfs_btree_del_cursor(cur, bt_error);
-	xfs_buf_relse(agbp);
-out:
 	if (error)
 		trace_xfs_refcount_find_shared_error(mp, agno, error, _RET_IP_);
 	return error;
 }
 
 /*
+ * Given an AG extent, find the lowest-numbered run of shared blocks within
+ * that range and return the range in fbno/flen.
+ */
+int
+xfs_refcount_find_shared(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	xfs_agblock_t		*fbno,
+	xfs_extlen_t		*flen,
+	bool			find_maximal)
+{
+	struct xfs_buf		*agbp;
+	int			error;
+
+	if (xfs_always_cow) {
+		*fbno = agbno;
+		*flen = aglen;
+		return 0;
+	}
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	error = __xfs_refcount_find_shared(mp, agbp, agno, agbno, aglen,
+			fbno, flen, find_maximal);
+
+	xfs_buf_relse(agbp);
+	return error;
+}
+
+/*
  * Recovering CoW Blocks After a Crash
  *
  * Due to the way that the copy on write mechanism works, there's a window of
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 6665eeb..44b0346 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -53,6 +53,10 @@ extern int xfs_refcount_finish_one(struct xfs_trans *tp,
 		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
 		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
 
+extern int __xfs_refcount_find_shared(struct xfs_mount *mp,
+		struct xfs_buf *agbp, xfs_agnumber_t agno, xfs_agblock_t agbno,
+		xfs_extlen_t aglen, xfs_agblock_t *fbno, xfs_extlen_t *flen,
+		bool find_maximal);
 extern int xfs_refcount_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
 		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
 		xfs_extlen_t *flen, bool find_maximal);
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index e76aefc..e69d9cf 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -44,6 +44,8 @@
 #include "xfs_filestream.h"
 #include "xfs_refcount_btree.h"
 #include "xfs_ag_resv.h"
+#include "xfs_bit.h"
+#include "xfs_refcount.h"
 
 /*
  * File system operations
@@ -1027,3 +1029,339 @@ xfs_fs_unreserve_ag_blocks(
 	if (error)
 		xfs_warn(mp, "Error %d unreserving metadata blocks.", error);
 }
+
+struct xfs_getfsmap_info {
+	struct getfsmapx	*fmv;
+	xfs_fsmap_format_t	formatter;
+	void			*format_arg;
+	xfs_daddr_t		next_daddr;
+	bool			last;
+	xfs_agnumber_t		start_ag;
+	struct xfs_rmap_irec	low;
+};
+
+/* Compare a record against our starting point */
+static bool
+xfs_getfsmap_compare(
+	xfs_agnumber_t			agno,
+	struct xfs_getfsmap_info	*info,
+	struct xfs_rmap_irec		*rec)
+{
+	uint64_t			x, y;
+
+	if (rec->rm_startblock < info->low.rm_startblock)
+		return true;
+	if (rec->rm_startblock > info->low.rm_startblock)
+		return false;
+
+	if (rec->rm_owner < info->low.rm_owner)
+		return true;
+	if (rec->rm_owner > info->low.rm_owner)
+		return false;
+
+	x = xfs_rmap_irec_offset_pack(rec);
+	y = xfs_rmap_irec_offset_pack(&info->low);
+	if (x < y)
+		return true;
+	return false;
+}
+
+STATIC bool
+xfs_getfsmap_is_shared(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec)
+{
+	xfs_agblock_t		fbno;
+	xfs_extlen_t		flen;
+	int			error;
+
+	if (!xfs_sb_version_hasreflink(&cur->bc_mp->m_sb))
+		return false;
+
+	/* Are there any shared blocks here? */
+	flen = 0;
+	error = __xfs_refcount_find_shared(cur->bc_mp, cur->bc_private.a.agbp,
+			cur->bc_private.a.agno, rec->rm_startblock,
+			rec->rm_blockcount, &fbno, &flen, false);
+	return error == 0 && flen > 0;
+}
+
+/* Transform a rmap irec into a fsmapx */
+STATIC int
+xfs_getfsmap_helper(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_getfsmap_info	*info = priv;
+	xfs_fsblock_t			fsb;
+	struct getfsmapx		fmv;
+	xfs_daddr_t			rec_daddr;
+	int				error;
+
+	fsb = XFS_AGB_TO_FSB(mp, cur->bc_private.a.agno,
+			rec->rm_startblock);
+	rec_daddr = XFS_FSB_TO_DADDR(mp, fsb);
+
+	/*
+	 * Filter out records that start before our startpoint, if the caller
+	 * requested that.
+	 */
+	if (info->fmv->fmv_length &&
+	    xfs_getfsmap_compare(cur->bc_private.a.agno, info, rec)) {
+		rec_daddr = XFS_FSB_TO_DADDR(mp, fsb +
+				rec->rm_blockcount);
+		if (info->next_daddr < rec_daddr)
+			info->next_daddr = rec_daddr;
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+	}
+
+	/* We're just counting mappings */
+	if (info->fmv->fmv_count == 2) {
+		if (rec_daddr > info->next_daddr)
+			info->fmv->fmv_entries++;
+
+		if (info->last)
+			return XFS_BTREE_QUERY_RANGE_CONTINUE;
+
+		info->fmv->fmv_entries++;
+
+		rec_daddr = XFS_FSB_TO_DADDR(mp, fsb +
+				rec->rm_blockcount);
+		if (info->next_daddr < rec_daddr)
+			info->next_daddr = rec_daddr;
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+	}
+
+	/* Did we find some free space? */
+	if (rec_daddr > info->next_daddr) {
+		if (info->fmv->fmv_entries >= info->fmv->fmv_count - 2)
+			return XFS_BTREE_QUERY_RANGE_ABORT;
+
+		trace_xfs_fsmap_mapping(mp, cur->bc_private.a.agno,
+				XFS_DADDR_TO_FSB(mp, info->next_daddr),
+				XFS_DADDR_TO_FSB(mp, rec_daddr -
+						info->next_daddr),
+				FMV_OWN_FREE, 0);
+
+		fmv.fmv_device = new_encode_dev(mp->m_ddev_targp->bt_dev);
+		fmv.fmv_block = info->next_daddr;
+		fmv.fmv_owner = FMV_OWN_FREE;
+		fmv.fmv_offset = 0;
+		fmv.fmv_length = rec_daddr - info->next_daddr;
+		fmv.fmv_oflags = FMV_OF_SPECIAL_OWNER;
+		fmv.fmv_count = 0;
+		fmv.fmv_entries = 0;
+		fmv.fmv_unused1 = 0;
+		fmv.fmv_unused2 = 0;
+		error = info->formatter(&fmv, info->format_arg);
+		if (error)
+			return error;
+		info->fmv->fmv_entries++;
+	}
+
+	if (info->last)
+		goto out;
+
+	/* Fill out the extent we found */
+	if (info->fmv->fmv_entries >= info->fmv->fmv_count - 2)
+		return XFS_BTREE_QUERY_RANGE_ABORT;
+
+	trace_xfs_fsmap_mapping(mp, cur->bc_private.a.agno,
+			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
+			rec->rm_offset);
+
+	fmv.fmv_device = new_encode_dev(mp->m_ddev_targp->bt_dev);
+	fmv.fmv_block = rec_daddr;
+	fmv.fmv_owner = rec->rm_owner;
+	fmv.fmv_offset = XFS_FSB_TO_BB(mp, rec->rm_offset);
+	fmv.fmv_length = XFS_FSB_TO_BB(mp, rec->rm_blockcount);
+	fmv.fmv_oflags = 0;
+	fmv.fmv_count = 0;
+	fmv.fmv_entries = 0;
+	fmv.fmv_unused1 = 0;
+	fmv.fmv_unused2 = 0;
+	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner))
+		fmv.fmv_oflags |= FMV_OF_SPECIAL_OWNER;
+	if (rec->rm_flags & XFS_RMAP_UNWRITTEN)
+		fmv.fmv_oflags |= FMV_OF_PREALLOC;
+	if (rec->rm_flags & XFS_RMAP_ATTR_FORK)
+		fmv.fmv_oflags |= FMV_OF_ATTR_FORK;
+	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK)
+		fmv.fmv_oflags |= FMV_OF_EXTENT_MAP;
+	if (fmv.fmv_oflags == 0 && xfs_getfsmap_is_shared(cur, rec))
+		fmv.fmv_oflags |= FMV_OF_SHARED;
+	error = info->formatter(&fmv, info->format_arg);
+	if (error)
+		return error;
+	info->fmv->fmv_entries++;
+
+out:
+	rec_daddr = XFS_FSB_TO_DADDR(mp, fsb + rec->rm_blockcount);
+	if (info->next_daddr < rec_daddr)
+		info->next_daddr = rec_daddr;
+	return XFS_BTREE_QUERY_RANGE_CONTINUE;
+}
+
+/* Do we recognize the device? */
+STATIC bool
+xfs_getfsmap_is_valid_device(
+	struct xfs_mount	*mp,
+	struct getfsmapx	*fmv)
+{
+	return fmv->fmv_device == 0 || fmv->fmv_device == UINT_MAX ||
+	       fmv->fmv_device == new_encode_dev(mp->m_ddev_targp->bt_dev);
+}
+
+/*
+ * Get filesystem's extents as described in fmv, and format for
+ * output.  Calls formatter to fill the user's buffer until all
+ * extents are mapped, until the passed-in fmv->fmv_count slots have
+ * been filled, or until the formatter short-circuits the loop, if it
+ * is tracking filled-in extents on its own.
+ */
+int
+xfs_getfsmap(
+	struct xfs_mount	*mp,
+	struct getfsmapx	*fmv,
+	xfs_fsmap_format_t	formatter,
+	void			*arg)
+{
+	struct xfs_getfsmap_info	info;
+	struct xfs_buf		*agbp = NULL;
+	struct xfs_btree_cur	*bt_cur = NULL;
+	struct getfsmapx	*fmv_low;
+	struct getfsmapx	*fmv_high;
+	struct xfs_rmap_irec	high;
+	xfs_fsblock_t		start_fsb;
+	xfs_fsblock_t		end_fsb;
+	xfs_agnumber_t		end_ag;
+	xfs_agnumber_t		agno;
+	xfs_daddr_t		eofs;
+	xfs_extlen_t		extlen;
+	int			error = 0;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+	if (fmv->fmv_count < 2)
+		return -EINVAL;
+	if (fmv->fmv_iflags & (~FMV_IF_VALID))
+		return -EINVAL;
+	fmv_low = fmv;
+	fmv_high = fmv + 1;
+	if (!xfs_getfsmap_is_valid_device(mp, fmv) ||
+	    !xfs_getfsmap_is_valid_device(mp, fmv_high) ||
+	    fmv_high->fmv_iflags || fmv_high->fmv_count ||
+	    fmv_high->fmv_length || fmv_high->fmv_entries ||
+	    fmv_high->fmv_unused1 || fmv->fmv_unused1 ||
+	    fmv_high->fmv_unused2 || fmv->fmv_unused2)
+		return -EINVAL;
+
+	fmv->fmv_entries = 0;
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
+	if (fmv->fmv_block >= eofs)
+		return 0;
+	if (fmv_high->fmv_block >= eofs)
+		fmv_high->fmv_block = eofs - 1;
+	start_fsb = XFS_DADDR_TO_FSB(mp, fmv->fmv_block);
+	end_fsb = XFS_DADDR_TO_FSB(mp, fmv_high->fmv_block);
+
+	/* Set up search keys */
+	info.low.rm_startblock = XFS_FSB_TO_AGBNO(mp, start_fsb);
+	info.low.rm_offset = XFS_DADDR_TO_FSB(mp, fmv->fmv_offset);
+	info.low.rm_owner = fmv->fmv_owner;
+	info.low.rm_blockcount = 0;
+	extlen = XFS_DADDR_TO_FSB(mp, fmv->fmv_length);
+	if (fmv->fmv_oflags & (FMV_OF_SPECIAL_OWNER | FMV_OF_EXTENT_MAP)) {
+		info.low.rm_startblock += extlen;
+		info.low.rm_owner = 0;
+		info.low.rm_offset = 0;
+	} else
+		info.low.rm_offset += extlen;
+	if (fmv->fmv_oflags & FMV_OF_ATTR_FORK)
+		info.low.rm_flags |= XFS_RMAP_ATTR_FORK;
+	if (fmv->fmv_oflags & FMV_OF_EXTENT_MAP)
+		info.low.rm_flags |= XFS_RMAP_BMBT_BLOCK;
+	if (fmv->fmv_oflags & FMV_OF_PREALLOC)
+		info.low.rm_flags |= XFS_RMAP_UNWRITTEN;
+
+	high.rm_startblock = -1U;
+	high.rm_owner = ULLONG_MAX;
+	high.rm_offset = ULLONG_MAX;
+	high.rm_blockcount = 0;
+	high.rm_flags = XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+			XFS_RMAP_UNWRITTEN;
+	info.fmv = fmv;
+	info.formatter = formatter;
+	info.format_arg = arg;
+	info.last = false;
+
+	info.start_ag = XFS_FSB_TO_AGNO(mp, start_fsb);
+	end_ag = XFS_FSB_TO_AGNO(mp, end_fsb);
+	info.next_daddr = XFS_FSB_TO_DADDR(mp, XFS_AGB_TO_FSB(mp, info.start_ag,
+			info.low.rm_startblock));
+
+	/* Query each AG */
+	for (agno = info.start_ag; agno <= end_ag; agno++) {
+		if (agno == end_ag) {
+			high.rm_startblock = XFS_FSB_TO_AGBNO(mp, end_fsb);
+			high.rm_offset = XFS_DADDR_TO_FSB(mp,
+					fmv_high->fmv_offset);
+			high.rm_owner = fmv_high->fmv_owner;
+			if (fmv_high->fmv_oflags & FMV_OF_ATTR_FORK)
+				high.rm_flags |= XFS_RMAP_ATTR_FORK;
+			if (fmv_high->fmv_oflags & FMV_OF_EXTENT_MAP)
+				high.rm_flags |= XFS_RMAP_BMBT_BLOCK;
+			if (fmv_high->fmv_oflags & FMV_OF_PREALLOC)
+				high.rm_flags |= XFS_RMAP_UNWRITTEN;
+		}
+
+		if (bt_cur) {
+			xfs_btree_del_cursor(bt_cur, XFS_BTREE_NOERROR);
+			xfs_trans_brelse(NULL, agbp);
+			bt_cur = NULL;
+			agbp = NULL;
+		}
+
+		error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+		if (error)
+			goto err;
+
+		trace_xfs_fsmap_low_key(mp, agno, info.low.rm_startblock,
+				info.low.rm_blockcount, info.low.rm_owner,
+				info.low.rm_offset);
+
+		trace_xfs_fsmap_high_key(mp, agno, high.rm_startblock,
+				high.rm_blockcount, high.rm_owner,
+				high.rm_offset);
+
+		bt_cur = xfs_rmapbt_init_cursor(mp, NULL, agbp, agno);
+		error = xfs_rmapbt_query_range(bt_cur, &info.low, &high,
+				xfs_getfsmap_helper, &info);
+		if (error)
+			goto err;
+
+		if (agno == info.start_ag) {
+			info.low.rm_startblock = 0;
+			info.low.rm_owner = 0;
+			info.low.rm_offset = 0;
+			info.low.rm_flags = 0;
+		}
+	}
+
+	/* Report any free space at the end of the AG */
+	info.last = true;
+	error = xfs_getfsmap_helper(bt_cur, &high, &info);
+	if (error)
+		goto err;
+
+err:
+	if (bt_cur)
+		xfs_btree_del_cursor(bt_cur, error < 0 ? XFS_BTREE_ERROR :
+							 XFS_BTREE_NOERROR);
+	if (agbp)
+		xfs_trans_brelse(NULL, agbp);
+
+	return error;
+}
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 71e3248..8101fb9 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -29,4 +29,10 @@ extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
 extern void xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
 extern void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
 
+/* fsmap to userspace formatter - copy to user & advance pointer */
+typedef int (*xfs_fsmap_format_t)(struct getfsmapx *, void *);
+
+int	xfs_getfsmap(struct xfs_mount *mp, struct getfsmapx *fmv,
+		xfs_fsmap_format_t formatter, void *arg);
+
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa9645c..736e747 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -42,6 +42,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_acl.h"
 #include "xfs_reflink.h"
+#include "xfs_btree.h"
 
 #include <linux/capability.h>
 #include <linux/dcache.h>
@@ -1611,6 +1612,76 @@ xfs_ioc_getbmapx(
 	return 0;
 }
 
+struct getfsmapx_info {
+	struct xfs_mount	*mp;
+	struct getfsmapx __user	*data;
+	__s64			last_flags;
+};
+
+STATIC int
+xfs_getfsmapx_format(struct getfsmapx *fmv, void *priv)
+{
+	struct getfsmapx_info	*info = priv;
+
+	trace_xfs_getfsmap_mapping(info->mp, fmv->fmv_block,
+			fmv->fmv_length, fmv->fmv_owner,
+			fmv->fmv_offset, fmv->fmv_oflags);
+
+	info->last_flags = fmv->fmv_oflags;
+	if (copy_to_user(info->data, fmv, sizeof(struct getfsmapx)))
+		return -EFAULT;
+
+	info->data++;
+	return 0;
+}
+
+STATIC int
+xfs_ioc_getfsmapx(
+	struct xfs_inode	*ip,
+	void			__user *arg)
+{
+	struct getfsmapx_info	info;
+	struct getfsmapx	fmx[2];
+	bool			aborted = false;
+	int			error;
+
+	if (copy_from_user(&fmx, arg, 2 * sizeof(struct getfsmapx)))
+		return -EFAULT;
+
+	trace_xfs_getfsmap_low_key(ip->i_mount, fmx[0].fmv_block,
+			fmx[0].fmv_length, fmx[0].fmv_owner,
+			fmx[0].fmv_offset, fmx[0].fmv_oflags);
+
+	trace_xfs_getfsmap_high_key(ip->i_mount, fmx[1].fmv_block,
+			fmx[1].fmv_length, fmx[1].fmv_owner,
+			fmx[1].fmv_offset, fmx[1].fmv_oflags);
+
+	info.mp = ip->i_mount;
+	info.data = (__force struct getfsmapx *)arg + 2;
+	error = xfs_getfsmap(ip->i_mount, fmx, xfs_getfsmapx_format, &info);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT) {
+		error = 0;
+		aborted = true;
+	}
+	if (error)
+		return error;
+
+	/* If we didn't abort, set the "last" flag in the last fmx */
+	if (!aborted && fmx[0].fmv_entries) {
+		info.data--;
+		info.last_flags |= FMV_OF_LAST;
+		if (copy_to_user(&info.data->fmv_oflags, &info.last_flags,
+				sizeof(info.last_flags)))
+			return -EFAULT;
+	}
+
+	/* copy back header */
+	if (copy_to_user(arg, fmx, 2 * sizeof(struct getfsmapx)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int
 xfs_ioc_swapext(
 	xfs_swapext_t	*sxp)
@@ -1784,6 +1855,11 @@ xfs_file_ioctl(
 	case XFS_IOC_GETBMAPX:
 		return xfs_ioc_getbmapx(ip, arg);
 
+	case XFS_IOC_GETFSMAPX:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		return xfs_ioc_getfsmapx(ip, arg);
+
 	case XFS_IOC_FD_TO_HANDLE:
 	case XFS_IOC_PATH_TO_HANDLE:
 	case XFS_IOC_PATH_TO_FSHANDLE: {
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 1a05d8a..337e436 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -558,6 +558,7 @@ xfs_file_compat_ioctl(
 	case XFS_IOC_GOINGDOWN:
 	case XFS_IOC_ERROR_INJECTION:
 	case XFS_IOC_ERROR_CLEARALL:
+	case XFS_IOC_GETFSMAPX:
 		return xfs_file_ioctl(filp, cmd, p);
 #ifndef BROKEN_X86_ALIGNMENT
 	/* These are handled fine if no alignment issues */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d64bab7..9fe812f 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3352,6 +3352,82 @@ DEFINE_INODE_EVENT(xfs_reflink_cancel_pending_cow);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_pending_cow_error);
 
+/* fsmap traces */
+DECLARE_EVENT_CLASS(xfs_fsmap_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
+		 xfs_extlen_t len, __uint64_t owner, __uint64_t offset),
+	TP_ARGS(mp, agno, agbno, len, owner, offset),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(__uint64_t, owner)
+		__field(__uint64_t, offset)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agbno = agbno;
+		__entry->len = len;
+		__entry->owner = owner;
+		__entry->offset = offset;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u owner %lld offset 0x%llx\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->owner,
+		  __entry->offset)
+)
+#define DEFINE_FSMAP_EVENT(name) \
+DEFINE_EVENT(xfs_fsmap_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 xfs_agblock_t agbno, xfs_extlen_t len, __uint64_t owner, \
+		 __uint64_t offset), \
+	TP_ARGS(mp, agno, agbno, len, owner, offset))
+DEFINE_FSMAP_EVENT(xfs_fsmap_low_key);
+DEFINE_FSMAP_EVENT(xfs_fsmap_high_key);
+DEFINE_FSMAP_EVENT(xfs_fsmap_mapping);
+
+DECLARE_EVENT_CLASS(xfs_getfsmap_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_daddr_t block, xfs_daddr_t len,
+		 __uint64_t owner, __uint64_t offset, __uint64_t flags),
+	TP_ARGS(mp, block, len, owner, offset, flags),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_daddr_t, block)
+		__field(xfs_daddr_t, len)
+		__field(__uint64_t, owner)
+		__field(__uint64_t, offset)
+		__field(__uint64_t, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->block = block;
+		__entry->len = len;
+		__entry->owner = owner;
+		__entry->offset = offset;
+		__entry->flags = flags;
+	),
+	TP_printk("dev %d:%d block %llu len %llu owner %lld offset %llu flags 0x%llx\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->block,
+		  __entry->len,
+		  __entry->owner,
+		  __entry->offset,
+		  __entry->flags)
+)
+#define DEFINE_GETFSMAP_EVENT(name) \
+DEFINE_EVENT(xfs_getfsmap_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_daddr_t block, xfs_daddr_t len, \
+		 __uint64_t owner, __uint64_t offset, __uint64_t flags), \
+	TP_ARGS(mp, block, len, owner, offset, flags))
+DEFINE_GETFSMAP_EVENT(xfs_getfsmap_low_key);
+DEFINE_GETFSMAP_EVENT(xfs_getfsmap_high_key);
+DEFINE_GETFSMAP_EVENT(xfs_getfsmap_mapping);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 113/119] xfs: scrub btree records and pointers while querying
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (111 preceding siblings ...)
  2016-06-17  1:29 ` [PATCH 112/119] xfs: introduce the XFS_IOC_GETFSMAPX ioctl Darrick J. Wong
@ 2016-06-17  1:30 ` Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 114/119] xfs: create sysfs hooks to scrub various files Darrick J. Wong
                   ` (5 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:30 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create a function that walks a btree, checking the integrity of each
btree block (headers, keys, records) and calling back to the caller
to perform further checks on the records.

v2: Prefix function names with xfs_

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_alloc.c      |   33 +++
 fs/xfs/libxfs/xfs_alloc.h      |    3 
 fs/xfs/libxfs/xfs_btree.c      |   12 +
 fs/xfs/libxfs/xfs_btree.h      |   15 +-
 fs/xfs/libxfs/xfs_format.h     |    2 
 fs/xfs/libxfs/xfs_rmap.c       |   39 ++++
 fs/xfs/libxfs/xfs_rmap_btree.h |    3 
 fs/xfs/libxfs/xfs_scrub.c      |  396 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_scrub.h      |   76 ++++++++
 10 files changed, 571 insertions(+), 9 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_scrub.c
 create mode 100644 fs/xfs/libxfs/xfs_scrub.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 56c384b..8942390 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -58,6 +58,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_refcount.o \
 				   xfs_refcount_btree.o \
 				   xfs_sb.o \
+				   xfs_scrub.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_resv.o \
 				   )
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 188c359a..6fc1981 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2924,3 +2924,36 @@ err:
 	xfs_trans_brelse(tp, agbp);
 	return error;
 }
+
+/* Is there a record covering a given extent? */
+int
+xfs_alloc_record_exists(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			*is_freesp)
+{
+	int			stat;
+	xfs_agblock_t		fbno;
+	xfs_extlen_t		flen;
+	int			error;
+
+	error = xfs_alloc_lookup_le(cur, bno, len, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*is_freesp = false;
+		return 0;
+	}
+
+	error = xfs_alloc_get_rec(cur, &fbno, &flen, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*is_freesp = false;
+		return 0;
+	}
+
+	*is_freesp = (fbno <= bno && fbno + flen >= bno + len);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 9f6373a4..4f2ce38 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -210,4 +210,7 @@ int xfs_free_extent_fix_freelist(struct xfs_trans *tp, xfs_agnumber_t agno,
 
 xfs_extlen_t xfs_prealloc_blocks(struct xfs_mount *mp);
 
+int xfs_alloc_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, bool *is_freesp);
+
 #endif	/* __XFS_ALLOC_H__ */
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 9c84184..5260085 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -549,7 +549,7 @@ xfs_btree_ptr_offset(
 /*
  * Return a pointer to the n-th record in the btree block.
  */
-STATIC union xfs_btree_rec *
+union xfs_btree_rec *
 xfs_btree_rec_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -562,7 +562,7 @@ xfs_btree_rec_addr(
 /*
  * Return a pointer to the n-th key in the btree block.
  */
-STATIC union xfs_btree_key *
+union xfs_btree_key *
 xfs_btree_key_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -575,7 +575,7 @@ xfs_btree_key_addr(
 /*
  * Return a pointer to the n-th high key in the btree block.
  */
-STATIC union xfs_btree_key *
+union xfs_btree_key *
 xfs_btree_high_key_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -588,7 +588,7 @@ xfs_btree_high_key_addr(
 /*
  * Return a pointer to the n-th block pointer in the btree block.
  */
-STATIC union xfs_btree_ptr *
+union xfs_btree_ptr *
 xfs_btree_ptr_addr(
 	struct xfs_btree_cur	*cur,
 	int			n,
@@ -622,7 +622,7 @@ xfs_btree_get_iroot(
  * Retrieve the block pointer from the cursor at the given level.
  * This may be an inode btree root or from a buffer.
  */
-STATIC struct xfs_btree_block *		/* generic btree block pointer */
+struct xfs_btree_block *		/* generic btree block pointer */
 xfs_btree_get_block(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level,	/* level in btree */
@@ -1733,7 +1733,7 @@ error0:
 	return error;
 }
 
-STATIC int
+int
 xfs_btree_lookup_get_block(
 	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level,	/* level in the btree */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index dbf299f..6f22cb0 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -194,7 +194,6 @@ struct xfs_btree_ops {
 
 	const struct xfs_buf_ops	*buf_ops;
 
-#if defined(DEBUG) || defined(XFS_WARN)
 	/* check that k1 is lower than k2 */
 	int	(*keys_inorder)(struct xfs_btree_cur *cur,
 				union xfs_btree_key *k1,
@@ -204,7 +203,6 @@ struct xfs_btree_ops {
 	int	(*recs_inorder)(struct xfs_btree_cur *cur,
 				union xfs_btree_rec *r1,
 				union xfs_btree_rec *r2);
-#endif
 };
 
 /* btree ops flags */
@@ -537,4 +535,17 @@ int xfs_btree_visit_blocks(struct xfs_btree_cur *cur,
 
 int xfs_btree_count_blocks(struct xfs_btree_cur *cur, xfs_extlen_t *blocks);
 
+union xfs_btree_rec *xfs_btree_rec_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+union xfs_btree_key *xfs_btree_key_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+union xfs_btree_key *xfs_btree_high_key_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+union xfs_btree_ptr *xfs_btree_ptr_addr(struct xfs_btree_cur *cur, int n,
+		struct xfs_btree_block *block);
+int xfs_btree_lookup_get_block(struct xfs_btree_cur *cur, int level,
+		union xfs_btree_ptr *pp, struct xfs_btree_block **blkp);
+struct xfs_btree_block *xfs_btree_get_block(struct xfs_btree_cur *cur,
+		int level, struct xfs_buf **bpp);
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 211a8b5..6ea8a84 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -518,7 +518,7 @@ static inline int xfs_sb_version_hasftype(struct xfs_sb *sbp)
 		 (sbp->sb_features2 & XFS_SB_VERSION2_FTYPE));
 }
 
-static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
+static inline bool xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
 {
 	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 29d08fc..e7673ec 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2328,3 +2328,42 @@ xfs_rmap_free_defer(
 
 	return __xfs_rmap_add(mp, dfops, &ri);
 }
+
+/* Is there a record covering a given extent? */
+int
+xfs_rmap_record_exists(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	struct xfs_owner_info	*oinfo,
+	bool			*has_rmap)
+{
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags;
+	int			stat;
+	struct xfs_rmap_irec	irec;
+	int			error;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+
+	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, flags, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*has_rmap = false;
+		return 0;
+	}
+
+	error = xfs_rmap_get_rec(cur, &irec, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*has_rmap = false;
+		return 0;
+	}
+
+	*has_rmap = (irec.rm_startblock <= bno &&
+		     irec.rm_startblock + irec.rm_blockcount >= bno + len);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 5baa81f..2f072c8 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -144,4 +144,7 @@ extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
 extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
 		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
 
+extern int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, struct xfs_owner_info *oinfo, bool *has_rmap);
+
 #endif	/* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_scrub.c b/fs/xfs/libxfs/xfs_scrub.c
new file mode 100644
index 0000000..d43d5c5
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_scrub.c
@@ -0,0 +1,396 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_ialloc.h"
+#include "xfs_refcount.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_scrub.h"
+
+static const char * const btree_types[] = {
+	[XFS_BTNUM_BNO]		= "bnobt",
+	[XFS_BTNUM_CNT]		= "cntbt",
+	[XFS_BTNUM_RMAP]	= "rmapbt",
+	[XFS_BTNUM_BMAP]	= "bmapbt",
+	[XFS_BTNUM_INO]		= "inobt",
+	[XFS_BTNUM_FINO]	= "finobt",
+	[XFS_BTNUM_REFC]	= "refcountbt",
+};
+
+/* Report a scrub corruption in dmesg. */
+void
+xfs_btree_scrub_error(
+	struct xfs_btree_cur		*cur,
+	int				level,
+	const char			*file,
+	int				line,
+	const char			*check)
+{
+	char				buf[16];
+	xfs_fsblock_t			fsbno;
+
+	if (cur->bc_ptrs[level] >= 1)
+		snprintf(buf, 16, " ptr %d", cur->bc_ptrs[level]);
+	else
+		buf[0] = 0;
+
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, cur->bc_bufs[level]->b_bn);
+	xfs_alert(cur->bc_mp, "scrub: %s btree corruption in block %u/%u%s: %s, file: %s, line: %d",
+			btree_types[cur->bc_btnum],
+			XFS_FSB_TO_AGNO(cur->bc_mp, fsbno),
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno),
+			buf, check, file, line);
+}
+
+/* AG metadata scrubbing */
+
+/*
+ * Make sure this record is in order and doesn't stray outside of the parent
+ * keys.
+ */
+static int
+xfs_btree_scrub_rec(
+	struct xfs_btree_scrub	*bs)
+{
+	struct xfs_btree_cur	*cur = bs->cur;
+	union xfs_btree_rec	*rec;
+	union xfs_btree_key	key;
+	union xfs_btree_key	*keyp;
+	struct xfs_btree_block	*block;
+	struct xfs_btree_block	*keyblock;
+
+	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[0]);
+	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+
+	/* If this isn't the first record, are they in order? */
+	XFS_BTREC_SCRUB_CHECK(bs, bs->firstrec ||
+			cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec));
+	bs->firstrec = false;
+	bs->lastrec = *rec;
+
+	if (cur->bc_nlevels == 1)
+		return 0;
+
+	/* Is this at least as large as the parent low key? */
+	cur->bc_ops->init_key_from_rec(&key, rec);
+	keyblock = XFS_BUF_TO_BLOCK(cur->bc_bufs[1]);
+	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[1], keyblock);
+
+	XFS_BTKEY_SCRUB_CHECK(bs, 0,
+			cur->bc_ops->diff_two_keys(cur, keyp, &key) >= 0);
+
+	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
+		return 0;
+
+	/* Is this no larger than the parent high key? */
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[1], keyblock);
+
+	XFS_BTKEY_SCRUB_CHECK(bs, 0,
+			cur->bc_ops->diff_two_keys(cur, &key, keyp) >= 0);
+
+	return 0;
+}
+
+/*
+ * Make sure this key is in order and doesn't stray outside of the parent
+ * keys.
+ */
+static int
+xfs_btree_scrub_key(
+	struct xfs_btree_scrub	*bs,
+	int			level)
+{
+	struct xfs_btree_cur	*cur = bs->cur;
+	union xfs_btree_key	*key;
+	union xfs_btree_key	*keyp;
+	struct xfs_btree_block	*block;
+	struct xfs_btree_block	*keyblock;
+
+	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
+	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
+
+	/* If this isn't the first key, are they in order? */
+	XFS_BTKEY_SCRUB_CHECK(bs, level, bs->firstkey[level] ||
+			cur->bc_ops->keys_inorder(cur, &bs->lastkey[level],
+					key));
+	bs->firstkey[level] = false;
+	bs->lastkey[level] = *key;
+
+	if (level + 1 >= cur->bc_nlevels)
+		return 0;
+
+	/* Is this at least as large as the parent low key? */
+	keyblock = XFS_BUF_TO_BLOCK(cur->bc_bufs[level + 1]);
+	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+
+	XFS_BTKEY_SCRUB_CHECK(bs, level,
+			cur->bc_ops->diff_two_keys(cur, keyp, key) >= 0);
+
+	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
+		return 0;
+
+	/* Is this no larger than the parent high key? */
+	key = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+
+	XFS_BTKEY_SCRUB_CHECK(bs, level,
+			cur->bc_ops->diff_two_keys(cur, key, keyp) >= 0);
+
+	return 0;
+}
+
+struct check_owner {
+	struct list_head	list;
+	xfs_agblock_t		bno;
+};
+
+/*
+ * Make sure this btree block isn't in the free list and that there's
+ * an rmap record for it.
+ */
+static int
+xfs_btree_block_check_owner(
+	struct xfs_btree_scrub		*bs,
+	xfs_agblock_t			bno)
+{
+	bool				has_rmap;
+	bool				is_freesp;
+	int				error;
+
+	/* Check that this block isn't free */
+	error = xfs_alloc_record_exists(bs->bno_cur, bno, 1, &is_freesp);
+	if (error)
+		goto err;
+	XFS_BTREC_SCRUB_CHECK(bs, !is_freesp);
+
+	if (!bs->rmap_cur)
+		return 0;
+
+	/* Check that there's an rmap record for this */
+	error = xfs_rmap_record_exists(bs->rmap_cur, bno, 1, &bs->oinfo,
+			&has_rmap);
+	if (error)
+		goto err;
+	XFS_BTREC_SCRUB_CHECK(bs, has_rmap);
+err:
+	return error;
+}
+
+/* Check the owner of a btree block. */
+static int
+xfs_btree_scrub_check_owner(
+	struct xfs_btree_scrub		*bs,
+	struct xfs_buf			*bp)
+{
+	struct xfs_btree_cur		*cur = bs->cur;
+	xfs_agblock_t			bno;
+	xfs_fsblock_t			fsbno;
+	struct check_owner		*co;
+
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+	bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+
+	/* Do we need to defer this one? */
+	if ((!bs->rmap_cur && xfs_sb_version_hasrmapbt(&cur->bc_mp->m_sb)) ||
+	    !bs->bno_cur) {
+		co = kmem_alloc(sizeof(struct check_owner), KM_SLEEP | KM_NOFS);
+		co->bno = bno;
+		list_add_tail(&co->list, &bs->to_check);
+		return 0;
+	}
+
+	return xfs_btree_block_check_owner(bs, bno);
+}
+
+/*
+ * Visit all nodes and leaves of a btree.  Check that all pointers and
+ * records are in order, that the keys reflect the records, and use a callback
+ * so that the caller can verify individual records.  The callback is the same
+ * as the one for xfs_btree_query_range, so therefore this function also
+ * returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a negative error code.
+ */
+int
+xfs_btree_scrub(
+	struct xfs_btree_scrub		*bs)
+{
+	struct xfs_btree_cur		*cur = bs->cur;
+	union xfs_btree_ptr		ptr;
+	union xfs_btree_ptr		*pp;
+	union xfs_btree_rec		*recp;
+	struct xfs_btree_block		*block;
+	int				level;
+	struct xfs_buf			*bp;
+	int				i;
+	struct check_owner		*co, *n;
+	int				error;
+
+	/* Finish filling out the scrub state */
+	bs->error = 0;
+	bs->firstrec = true;
+	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
+		bs->firstkey[i] = true;
+	bs->bno_cur = bs->rmap_cur = NULL;
+	INIT_LIST_HEAD(&bs->to_check);
+	if (bs->cur->bc_btnum != XFS_BTNUM_BNO)
+		bs->bno_cur = xfs_allocbt_init_cursor(cur->bc_mp, NULL,
+				bs->agf_bp, bs->cur->bc_private.a.agno,
+				XFS_BTNUM_BNO);
+	if (bs->cur->bc_btnum != XFS_BTNUM_RMAP &&
+	    xfs_sb_version_hasrmapbt(&cur->bc_mp->m_sb))
+		bs->rmap_cur = xfs_rmapbt_init_cursor(cur->bc_mp, NULL,
+				bs->agf_bp, bs->cur->bc_private.a.agno);
+
+	/* Load the root of the btree. */
+	level = cur->bc_nlevels - 1;
+	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
+	error = xfs_btree_lookup_get_block(cur, level, &ptr, &block);
+	if (error)
+		goto out;
+
+	xfs_btree_get_block(cur, level, &bp);
+	error = xfs_btree_check_block(cur, block, level, bp);
+	if (error)
+		goto out;
+	error = xfs_btree_scrub_check_owner(bs, bp);
+	if (error)
+		goto out;
+
+	cur->bc_ptrs[level] = 1;
+
+	while (level < cur->bc_nlevels) {
+		block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
+
+		if (level == 0) {
+			/* End of leaf, pop back towards the root. */
+			if (cur->bc_ptrs[level] >
+			    be16_to_cpu(block->bb_numrecs)) {
+				if (level < cur->bc_nlevels - 1)
+					cur->bc_ptrs[level + 1]++;
+				level++;
+				continue;
+			}
+
+			/* Records in order for scrub? */
+			error = xfs_btree_scrub_rec(bs);
+			if (error)
+				goto out;
+
+			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			error = bs->scrub_rec(bs, recp);
+			if (error < 0 ||
+			    error == XFS_BTREE_QUERY_RANGE_ABORT)
+				break;
+
+			cur->bc_ptrs[level]++;
+			continue;
+		}
+
+		/* End of node, pop back towards the root. */
+		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+			if (level < cur->bc_nlevels - 1)
+				cur->bc_ptrs[level + 1]++;
+			level++;
+			continue;
+		}
+
+		/* Keys in order for scrub? */
+		error = xfs_btree_scrub_key(bs, level);
+		if (error)
+			goto out;
+
+		/* Drill another level deeper. */
+		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		level--;
+		error = xfs_btree_lookup_get_block(cur, level, pp,
+				&block);
+		if (error)
+			goto out;
+
+		xfs_btree_get_block(cur, level, &bp);
+		error = xfs_btree_check_block(cur, block, level, bp);
+		if (error)
+			goto out;
+
+		error = xfs_btree_scrub_check_owner(bs, bp);
+		if (error)
+			goto out;
+
+		cur->bc_ptrs[level] = 1;
+	}
+
+out:
+	/*
+	 * If we don't end this function with the cursor pointing at a record
+	 * block, a subsequent non-error cursor deletion will not release
+	 * node-level buffers, causing a buffer leak.  This is quite possible
+	 * with a zero-results range query, so release the buffers if we
+	 * failed to return any results.
+	 */
+	if (cur->bc_bufs[0] == NULL) {
+		for (i = 0; i < cur->bc_nlevels; i++) {
+			if (cur->bc_bufs[i]) {
+				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
+				cur->bc_bufs[i] = NULL;
+				cur->bc_ptrs[i] = 0;
+				cur->bc_ra[i] = 0;
+			}
+		}
+	}
+
+	/* Check the deferred stuff */
+	if (!error) {
+		if (bs->cur->bc_btnum == XFS_BTNUM_BNO)
+			bs->bno_cur = bs->cur;
+		else if (bs->cur->bc_btnum == XFS_BTNUM_RMAP)
+			bs->rmap_cur = bs->cur;
+		list_for_each_entry(co, &bs->to_check, list) {
+			error = xfs_btree_block_check_owner(bs, co->bno);
+			if (error)
+				break;
+		}
+	}
+	list_for_each_entry_safe(co, n, &bs->to_check, list) {
+		list_del(&co->list);
+		kmem_free(co);
+	}
+
+	if (bs->bno_cur && bs->bno_cur != bs->cur)
+		xfs_btree_del_cursor(bs->bno_cur, XFS_BTREE_ERROR);
+	if (bs->rmap_cur && bs->rmap_cur != bs->cur)
+		xfs_btree_del_cursor(bs->rmap_cur, XFS_BTREE_ERROR);
+
+	if (error || bs->error)
+		xfs_alert(cur->bc_mp,
+			"Corruption detected. Unmount and run xfs_repair.");
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_scrub.h b/fs/xfs/libxfs/xfs_scrub.h
new file mode 100644
index 0000000..af80a9d
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_scrub.h
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_SCRUB_H__
+#define	__XFS_SCRUB_H__
+
+/* btree scrub */
+struct xfs_btree_scrub;
+
+typedef int (*xfs_btree_scrub_rec_fn)(
+	struct xfs_btree_scrub	*bs,
+	union xfs_btree_rec	*rec);
+
+struct xfs_btree_scrub {
+	/* caller-provided scrub state */
+	struct xfs_btree_cur		*cur;
+	xfs_btree_scrub_rec_fn		scrub_rec;
+	struct xfs_buf			*agi_bp;
+	struct xfs_buf			*agf_bp;
+	struct xfs_buf			*agfl_bp;
+	struct xfs_owner_info		oinfo;
+
+	/* internal scrub state */
+	union xfs_btree_rec		lastrec;
+	bool				firstrec;
+	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
+	bool				firstkey[XFS_BTREE_MAXLEVELS];
+	struct xfs_btree_cur		*rmap_cur;
+	struct xfs_btree_cur		*bno_cur;
+	struct list_head		to_check;
+	int				error;
+};
+
+int xfs_btree_scrub(struct xfs_btree_scrub *bs);
+void xfs_btree_scrub_error(struct xfs_btree_cur *cur, int level,
+		const char *file, int line, const char *check);
+#define XFS_BTREC_SCRUB_CHECK(bs, fs_ok) \
+	if (!(fs_ok)) { \
+		xfs_btree_scrub_error((bs)->cur, 0, __FILE__, __LINE__, #fs_ok); \
+		(bs)->error = -EFSCORRUPTED; \
+	}
+#define XFS_BTREC_SCRUB_GOTO(bs, fs_ok, label) \
+	if (!(fs_ok)) { \
+		xfs_btree_scrub_error((bs)->cur, 0, __FILE__, __LINE__, #fs_ok); \
+		(bs)->error = -EFSCORRUPTED; \
+		goto label; \
+	}
+#define XFS_BTKEY_SCRUB_CHECK(bs, level, fs_ok) \
+	if (!(fs_ok)) { \
+		xfs_btree_scrub_error((bs)->cur, (level), __FILE__, __LINE__, #fs_ok); \
+		(bs)->error = -EFSCORRUPTED; \
+	}
+#define XFS_BTKEY_SCRUB_GOTO(bs, level, fs_ok, label) \
+	if (!(fs_ok)) { \
+		xfs_btree_scrub_error((bs)->cur, 0, __FILE__, __LINE__, #fs_ok); \
+		(bs)->error = -EFSCORRUPTED; \
+		goto label; \
+	}
+
+#endif	/* __XFS_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 114/119] xfs: create sysfs hooks to scrub various files
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (112 preceding siblings ...)
  2016-06-17  1:30 ` [PATCH 113/119] xfs: scrub btree records and pointers while querying Darrick J. Wong
@ 2016-06-17  1:30 ` Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 115/119] xfs: support scrubbing free space btrees Darrick J. Wong
                   ` (4 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:30 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Create some sysfs files so that we can scrub various AG metadata
structures.  The interface will be as follows:

# cat /sys/fs/xfs/$dev/check/rmapbt
0:3
# echo 3 > /sys/fs/xfs/$dev/check/rmapbt
-bash: echo: write error: <some error code>

(or it'll just return 0 if the metadata is fine)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile          |    1 
 fs/xfs/xfs_mount.c       |    9 ++
 fs/xfs/xfs_mount.h       |    1 
 fs/xfs/xfs_scrub_sysfs.c |  214 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_scrub_sysfs.h |   26 ++++++
 5 files changed, 250 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/xfs_scrub_sysfs.c
 create mode 100644 fs/xfs/xfs_scrub_sysfs.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 8942390..7d93af2 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -93,6 +93,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
 				   xfs_reflink.o \
+				   xfs_scrub_sysfs.o \
 				   xfs_stats.o \
 				   xfs_super.o \
 				   xfs_symlink.o \
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index e53853d..1f74f72 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -46,6 +46,7 @@
 #include "xfs_refcount_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_scrub_sysfs.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -705,10 +706,13 @@ xfs_mountfs(
 	if (error)
 		goto out_del_stats;
 
+	error = xfs_scrub_init(mp);
+	if (error)
+		goto out_remove_error_sysfs;
 
 	error = xfs_uuid_mount(mp);
 	if (error)
-		goto out_remove_error_sysfs;
+		goto out_remove_scrub;
 
 	/*
 	 * Set the minimum read and write sizes
@@ -993,6 +997,8 @@ xfs_mountfs(
 	xfs_da_unmount(mp);
  out_remove_uuid:
 	xfs_uuid_unmount(mp);
+ out_remove_scrub:
+	xfs_scrub_free(mp);
  out_remove_error_sysfs:
 	xfs_error_sysfs_del(mp);
  out_del_stats:
@@ -1093,6 +1099,7 @@ xfs_unmountfs(
 #endif
 	xfs_free_perag(mp);
 
+	xfs_scrub_free(mp);
 	xfs_error_sysfs_del(mp);
 	xfs_sysfs_del(&mp->m_stats.xs_kobj);
 	xfs_sysfs_del(&mp->m_kobj);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 6b06d24..0e222d2 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -167,6 +167,7 @@ typedef struct xfs_mount {
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
 	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
+	struct xfs_kobj		m_scrub_kobj;
 	struct xstats		m_stats;	/* per-fs stats */
 
 	struct workqueue_struct *m_buf_workqueue;
diff --git a/fs/xfs/xfs_scrub_sysfs.c b/fs/xfs/xfs_scrub_sysfs.c
new file mode 100644
index 0000000..9942d55
--- /dev/null
+++ b/fs/xfs/xfs_scrub_sysfs.c
@@ -0,0 +1,214 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_refcount.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_sysfs.h"
+#include <linux/kernel.h>
+
+/* general scrub attributes */
+struct xfs_scrub_attr {
+	struct attribute attr;
+	bool (*is_visible)(struct xfs_mount *mp, struct xfs_scrub_attr *attr);
+	ssize_t (*show)(struct xfs_mount *mp, struct xfs_scrub_attr *attr,
+			char *buf);
+	ssize_t (*store)(struct xfs_mount *mp, struct xfs_scrub_attr *attr,
+			const char *buf, size_t count);
+};
+
+static inline struct xfs_scrub_attr *
+to_scrub_attr(struct attribute *attr)
+{
+	return container_of(attr, struct xfs_scrub_attr, attr);
+}
+
+static inline struct xfs_mount *to_mount(struct kobject	*kobj)
+{
+	struct xfs_kobj *k = container_of(kobj, struct xfs_kobj, kobject);
+
+	return container_of(k, struct xfs_mount, m_scrub_kobj);
+}
+
+STATIC ssize_t
+xfs_scrub_attr_show(
+	struct kobject		*kobject,
+	struct attribute	*attr,
+	char			*buf)
+{
+	struct xfs_scrub_attr	*sa = to_scrub_attr(attr);
+	struct xfs_mount	*mp = to_mount(kobject);
+
+	return sa->show ? sa->show(mp, sa, buf) : 0;
+}
+
+STATIC ssize_t
+xfs_scrub_attr_store(
+	struct kobject		*kobject,
+	struct attribute	*attr,
+	const char		*buf,
+	size_t			count)
+{
+	struct xfs_scrub_attr	*sa = to_scrub_attr(attr);
+	struct xfs_mount	*mp = to_mount(kobject);
+
+	return sa->store ? sa->store(mp, sa, buf, count) : 0;
+}
+
+STATIC umode_t
+xfs_scrub_attr_visible(
+	struct kobject		*kobject,
+	struct attribute	*attr,
+	int unused)
+{
+	struct xfs_scrub_attr	*sa = to_scrub_attr(attr);
+	struct xfs_mount	*mp = to_mount(kobject);
+
+	if (!sa->is_visible || sa->is_visible(mp, sa))
+		return attr->mode;
+	return 0;
+}
+
+static const struct sysfs_ops xfs_scrub_ops = {
+	.show = xfs_scrub_attr_show,
+	.store = xfs_scrub_attr_store,
+};
+
+static struct kobj_type xfs_scrub_ktype = {
+	.release = xfs_sysfs_release,
+	.sysfs_ops = &xfs_scrub_ops,
+};
+
+/* per-AG scrub attributes */
+struct xfs_agdata_scrub_attr {
+	struct xfs_scrub_attr sa;
+	bool (*has_feature)(struct xfs_sb *);
+	int (*scrub)(struct xfs_mount *mp, xfs_agnumber_t agno);
+};
+
+static inline struct xfs_agdata_scrub_attr *
+to_agdata_scrub_attr(struct xfs_scrub_attr *sa)
+{
+	return container_of(sa, struct xfs_agdata_scrub_attr, sa);
+}
+
+STATIC bool
+xfs_agdata_scrub_visible(
+	struct xfs_mount		*mp,
+	struct xfs_scrub_attr		*sa)
+{
+	struct xfs_agdata_scrub_attr	*asa = to_agdata_scrub_attr(sa);
+
+	return (!asa->has_feature || asa->has_feature(&mp->m_sb));
+}
+
+STATIC ssize_t
+xfs_agdata_scrub_show(
+	struct xfs_mount		*mp,
+	struct xfs_scrub_attr		*sa,
+	char				*buf)
+{
+	return snprintf(buf, PAGE_SIZE, "0:%u\n", mp->m_sb.sb_agcount - 1);
+}
+
+STATIC ssize_t
+xfs_agdata_scrub_store(
+	struct xfs_mount		*mp,
+	struct xfs_scrub_attr		*sa,
+	const char			*buf,
+	size_t				count)
+{
+	unsigned long			val;
+	xfs_agnumber_t			agno;
+	struct xfs_agdata_scrub_attr	*asa = to_agdata_scrub_attr(sa);
+	int				error;
+
+	error = kstrtoul(buf, 0, &val);
+	if (error)
+		return error;
+	agno = val;
+	if (agno >= mp->m_sb.sb_agcount)
+		return -EINVAL;
+	error = asa->scrub(mp, agno);
+	if (error)
+		return error;
+	return count;
+}
+
+#define XFS_AGDATA_SCRUB_ATTR(_name, _fn)	     \
+static struct xfs_agdata_scrub_attr xfs_agdata_scrub_attr_##_name = {	     \
+	.sa = {								     \
+		.attr = {.name = __stringify(_name), .mode = 0600 },	     \
+		.is_visible = xfs_agdata_scrub_visible,			     \
+		.show = xfs_agdata_scrub_show,				     \
+		.store = xfs_agdata_scrub_store,			     \
+	},								     \
+	.has_feature = _fn,						     \
+	.scrub = xfs_##_name##_scrub,				     \
+}
+#define XFS_AGDATA_SCRUB_LIST(name)	&xfs_agdata_scrub_attr_##name.sa.attr
+
+static struct attribute *xfs_agdata_scrub_attrs[] = {
+	NULL,
+};
+
+static const struct attribute_group xfs_agdata_scrub_attr_group = {
+	.is_visible = xfs_scrub_attr_visible,
+	.attrs = xfs_agdata_scrub_attrs,
+};
+
+int
+xfs_scrub_init(
+	struct xfs_mount	*mp)
+{
+	int			error;
+
+	error = xfs_sysfs_init(&mp->m_scrub_kobj, &xfs_scrub_ktype,
+			&mp->m_kobj, "check");
+	if (error)
+		return error;
+
+	error = sysfs_create_group(&mp->m_scrub_kobj.kobject,
+			&xfs_agdata_scrub_attr_group);
+	if (error)
+		goto err;
+	return error;
+err:
+	xfs_sysfs_del(&mp->m_scrub_kobj);
+	return error;
+}
+
+void
+xfs_scrub_free(
+	struct xfs_mount	*mp)
+{
+	sysfs_remove_group(&mp->m_scrub_kobj.kobject,
+			&xfs_agdata_scrub_attr_group);
+	xfs_sysfs_del(&mp->m_scrub_kobj);
+}
diff --git a/fs/xfs/xfs_scrub_sysfs.h b/fs/xfs/xfs_scrub_sysfs.h
new file mode 100644
index 0000000..d9a58f5
--- /dev/null
+++ b/fs/xfs/xfs_scrub_sysfs.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_SCRUB_H
+#define __XFS_SCRUB_H
+
+int xfs_scrub_init(struct xfs_mount *mp);
+void xfs_scrub_free(struct xfs_mount *mp);
+
+#endif /* __XFS_SCRUB_H */


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 115/119] xfs: support scrubbing free space btrees
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (113 preceding siblings ...)
  2016-06-17  1:30 ` [PATCH 114/119] xfs: create sysfs hooks to scrub various files Darrick J. Wong
@ 2016-06-17  1:30 ` Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 116/119] xfs: support scrubbing inode btrees Darrick J. Wong
                   ` (3 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:30 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Plumb in the pieces necessary to check the free space btrees.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_alloc.c       |   98 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_alloc.h       |    3 +
 fs/xfs/libxfs/xfs_alloc_btree.c |   51 ++++++++++++++++++--
 fs/xfs/xfs_scrub_sysfs.c        |    5 ++
 4 files changed, 151 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 6fc1981..bc2a1b1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -39,6 +39,7 @@
 #include "xfs_log.h"
 #include "xfs_ag_resv.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_scrub.h"
 
 struct workqueue_struct *xfs_alloc_wq;
 
@@ -2957,3 +2958,100 @@ xfs_alloc_record_exists(
 	*is_freesp = (fbno <= bno && fbno + flen >= bno + len);
 	return 0;
 }
+
+STATIC int
+xfs_allocbt_scrub_rmap_check(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	xfs_err(cur->bc_mp, "%s: freespace in rmapbt! %u/%u %u %lld %lld %x",
+			__func__, cur->bc_private.a.agno, rec->rm_startblock,
+			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
+			rec->rm_flags);
+	return XFS_BTREE_QUERY_RANGE_ABORT;
+}
+
+STATIC int
+xfs_allocbt_scrub_helper(
+	struct xfs_btree_scrub		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	xfs_agblock_t			bno;
+	xfs_extlen_t			len;
+	struct xfs_rmap_irec		low;
+	struct xfs_rmap_irec		high;
+	bool				no_rmap;
+	int				error;
+
+	bno = be32_to_cpu(rec->alloc.ar_startblock);
+	len = be32_to_cpu(rec->alloc.ar_blockcount);
+
+	XFS_BTREC_SCRUB_CHECK(bs, bno <= mp->m_sb.sb_agblocks);
+	XFS_BTREC_SCRUB_CHECK(bs, bno < bno + len);
+	XFS_BTREC_SCRUB_CHECK(bs, (unsigned long long)bno + len <=
+			mp->m_sb.sb_agblocks);
+
+	/* if rmapbt, make sure there's no record */
+	if (!bs->rmap_cur)
+		return 0;
+
+	memset(&low, 0, sizeof(low));
+	low.rm_startblock = bno;
+	memset(&high, 0xFF, sizeof(high));
+	high.rm_startblock = bno + len - 1;
+
+	error = xfs_rmapbt_query_range(bs->rmap_cur, &low, &high,
+			&xfs_allocbt_scrub_rmap_check, NULL);
+	if (error && error != XFS_BTREE_QUERY_RANGE_ABORT)
+		goto err;
+	no_rmap = error == 0;
+	XFS_BTREC_SCRUB_CHECK(bs, no_rmap);
+err:
+	return error;
+}
+
+/* Scrub the freespace btrees for some AG. */
+STATIC int
+xfs_allocbt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	int			which)
+{
+	struct xfs_btree_scrub	bs;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &bs.agf_bp);
+	if (error)
+		return error;
+
+	bs.cur = xfs_allocbt_init_cursor(mp, NULL, bs.agf_bp, agno, which);
+	bs.scrub_rec = xfs_allocbt_scrub_helper;
+	xfs_rmap_ag_owner(&bs.oinfo, XFS_RMAP_OWN_AG);
+	error = xfs_btree_scrub(&bs);
+	xfs_btree_del_cursor(bs.cur,
+			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(NULL, bs.agf_bp);
+
+	if (!error && bs.error)
+		error = bs.error;
+
+	return error;
+}
+
+int
+xfs_bnobt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	return xfs_allocbt_scrub(mp, agno, XFS_BTNUM_BNO);
+}
+
+int
+xfs_cntbt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	return xfs_allocbt_scrub(mp, agno, XFS_BTNUM_CNT);
+}
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 4f2ce38..f1fcc7e 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -213,4 +213,7 @@ xfs_extlen_t xfs_prealloc_blocks(struct xfs_mount *mp);
 int xfs_alloc_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, bool *is_freesp);
 
+int xfs_bnobt_scrub(struct xfs_mount *mp, xfs_agnumber_t agno);
+int xfs_cntbt_scrub(struct xfs_mount *mp, xfs_agnumber_t agno);
+
 #endif	/* __XFS_ALLOC_H__ */
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 5ba2dac..f9859e8 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -256,6 +256,26 @@ xfs_allocbt_key_diff(
 	return (__int64_t)be32_to_cpu(kp->ar_startblock) - rec->ar_startblock;
 }
 
+STATIC __int64_t
+xfs_bnobt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be32_to_cpu(k2->alloc.ar_startblock) -
+			  be32_to_cpu(k1->alloc.ar_startblock);
+}
+
+STATIC __int64_t
+xfs_cntbt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be32_to_cpu(k2->alloc.ar_blockcount) -
+			  be32_to_cpu(k1->alloc.ar_blockcount);
+}
+
 static bool
 xfs_allocbt_verify(
 	struct xfs_buf		*bp)
@@ -344,7 +364,6 @@ const struct xfs_buf_ops xfs_allocbt_buf_ops = {
 };
 
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_allocbt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -381,9 +400,29 @@ xfs_allocbt_recs_inorder(
 			 be32_to_cpu(r2->alloc.ar_startblock));
 	}
 }
-#endif	/* DEBUG */
 
-static const struct xfs_btree_ops xfs_allocbt_ops = {
+static const struct xfs_btree_ops xfs_bnobt_ops = {
+	.rec_len		= sizeof(xfs_alloc_rec_t),
+	.key_len		= sizeof(xfs_alloc_key_t),
+
+	.dup_cursor		= xfs_allocbt_dup_cursor,
+	.set_root		= xfs_allocbt_set_root,
+	.alloc_block		= xfs_allocbt_alloc_block,
+	.free_block		= xfs_allocbt_free_block,
+	.update_lastrec		= xfs_allocbt_update_lastrec,
+	.get_minrecs		= xfs_allocbt_get_minrecs,
+	.get_maxrecs		= xfs_allocbt_get_maxrecs,
+	.init_key_from_rec	= xfs_allocbt_init_key_from_rec,
+	.init_rec_from_cur	= xfs_allocbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_allocbt_init_ptr_from_cur,
+	.key_diff		= xfs_allocbt_key_diff,
+	.buf_ops		= &xfs_allocbt_buf_ops,
+	.diff_two_keys		= xfs_bnobt_diff_two_keys,
+	.keys_inorder		= xfs_allocbt_keys_inorder,
+	.recs_inorder		= xfs_allocbt_recs_inorder,
+};
+
+static const struct xfs_btree_ops xfs_cntbt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
 	.key_len		= sizeof(xfs_alloc_key_t),
 
@@ -399,10 +438,9 @@ static const struct xfs_btree_ops xfs_allocbt_ops = {
 	.init_ptr_from_cur	= xfs_allocbt_init_ptr_from_cur,
 	.key_diff		= xfs_allocbt_key_diff,
 	.buf_ops		= &xfs_allocbt_buf_ops,
-#if defined(DEBUG) || defined(XFS_WARN)
+	.diff_two_keys		= xfs_cntbt_diff_two_keys,
 	.keys_inorder		= xfs_allocbt_keys_inorder,
 	.recs_inorder		= xfs_allocbt_recs_inorder,
-#endif
 };
 
 /*
@@ -427,12 +465,13 @@ xfs_allocbt_init_cursor(
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
-	cur->bc_ops = &xfs_allocbt_ops;
 
 	if (btnum == XFS_BTNUM_CNT) {
+		cur->bc_ops = &xfs_cntbt_ops;
 		cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
 		cur->bc_flags = XFS_BTREE_LASTREC_UPDATE;
 	} else {
+		cur->bc_ops = &xfs_bnobt_ops;
 		cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
 	}
 
diff --git a/fs/xfs/xfs_scrub_sysfs.c b/fs/xfs/xfs_scrub_sysfs.c
index 9942d55..efaa635 100644
--- a/fs/xfs/xfs_scrub_sysfs.c
+++ b/fs/xfs/xfs_scrub_sysfs.c
@@ -174,7 +174,12 @@ static struct xfs_agdata_scrub_attr xfs_agdata_scrub_attr_##_name = {	     \
 }
 #define XFS_AGDATA_SCRUB_LIST(name)	&xfs_agdata_scrub_attr_##name.sa.attr
 
+XFS_AGDATA_SCRUB_ATTR(bnobt, NULL);
+XFS_AGDATA_SCRUB_ATTR(cntbt, NULL);
+
 static struct attribute *xfs_agdata_scrub_attrs[] = {
+	XFS_AGDATA_SCRUB_LIST(bnobt),
+	XFS_AGDATA_SCRUB_LIST(cntbt),
 	NULL,
 };
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 116/119] xfs: support scrubbing inode btrees
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (114 preceding siblings ...)
  2016-06-17  1:30 ` [PATCH 115/119] xfs: support scrubbing free space btrees Darrick J. Wong
@ 2016-06-17  1:30 ` Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 117/119] xfs: support scrubbing rmap btree Darrick J. Wong
                   ` (2 subsequent siblings)
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:30 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Plumb in the pieces necessary to check the inode btrees.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_ialloc.c       |  178 +++++++++++++++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_ialloc.h       |    2 
 fs/xfs/libxfs/xfs_ialloc_btree.c |   18 +++-
 fs/xfs/xfs_scrub_sysfs.c         |    4 +
 4 files changed, 180 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 1982561..c496dd4 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -40,6 +40,8 @@
 #include "xfs_icache.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_scrub.h"
 
 
 /*
@@ -98,24 +100,14 @@ xfs_inobt_update(
 	return xfs_btree_update(cur, &rec);
 }
 
-/*
- * Get the data from the pointed-to record.
- */
-int					/* error */
-xfs_inobt_get_rec(
-	struct xfs_btree_cur	*cur,	/* btree cursor */
-	xfs_inobt_rec_incore_t	*irec,	/* btree record */
-	int			*stat)	/* output: success/failure */
+STATIC void
+xfs_inobt_btrec_to_irec(
+	struct xfs_mount		*mp,
+	union xfs_btree_rec		*rec,
+	struct xfs_inobt_rec_incore	*irec)
 {
-	union xfs_btree_rec	*rec;
-	int			error;
-
-	error = xfs_btree_get_rec(cur, &rec, stat);
-	if (error || *stat == 0)
-		return error;
-
 	irec->ir_startino = be32_to_cpu(rec->inobt.ir_startino);
-	if (xfs_sb_version_hassparseinodes(&cur->bc_mp->m_sb)) {
+	if (xfs_sb_version_hassparseinodes(&mp->m_sb)) {
 		irec->ir_holemask = be16_to_cpu(rec->inobt.ir_u.sp.ir_holemask);
 		irec->ir_count = rec->inobt.ir_u.sp.ir_count;
 		irec->ir_freecount = rec->inobt.ir_u.sp.ir_freecount;
@@ -130,6 +122,25 @@ xfs_inobt_get_rec(
 				be32_to_cpu(rec->inobt.ir_u.f.ir_freecount);
 	}
 	irec->ir_free = be64_to_cpu(rec->inobt.ir_free);
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int					/* error */
+xfs_inobt_get_rec(
+	struct xfs_btree_cur	*cur,	/* btree cursor */
+	xfs_inobt_rec_incore_t	*irec,	/* btree record */
+	int			*stat)	/* output: success/failure */
+{
+	union xfs_btree_rec	*rec;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (error || *stat == 0)
+		return error;
+
+	xfs_inobt_btrec_to_irec(cur->bc_mp, rec, irec);
 
 	return 0;
 }
@@ -2650,3 +2661,138 @@ xfs_ialloc_pagi_init(
 		xfs_trans_brelse(tp, bp);
 	return 0;
 }
+
+STATIC int
+xfs_iallocbt_scrub_helper(
+	struct xfs_btree_scrub		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_inobt_rec_incore	irec;
+	__uint16_t			holemask;
+	xfs_agino_t			agino;
+	xfs_agblock_t			bno;
+	xfs_extlen_t			len;
+	int				holecount;
+	int				i;
+	bool				has_rmap = false;
+	struct xfs_owner_info		oinfo;
+	int				error = 0;
+	uint64_t			holes;
+
+	xfs_inobt_btrec_to_irec(mp, rec, &irec);
+
+	XFS_BTREC_SCRUB_CHECK(bs, irec.ir_count <= XFS_INODES_PER_CHUNK);
+	XFS_BTREC_SCRUB_CHECK(bs, irec.ir_freecount <= XFS_INODES_PER_CHUNK);
+	agino = irec.ir_startino;
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INODES);
+
+	/* Handle non-sparse inodes */
+	if (!xfs_inobt_issparse(irec.ir_holemask)) {
+		len = XFS_B_TO_FSB(mp,
+				XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize);
+		bno = XFS_AGINO_TO_AGBNO(mp, agino);
+
+		XFS_BTREC_SCRUB_CHECK(bs, bno < mp->m_sb.sb_agblocks)
+		XFS_BTREC_SCRUB_CHECK(bs, bno < bno + len);
+		XFS_BTREC_SCRUB_CHECK(bs, (unsigned long long)bno + len <=
+				mp->m_sb.sb_agblocks);
+
+		if (!bs->rmap_cur)
+			return error;
+		error = xfs_rmap_record_exists(bs->rmap_cur, bno, len, &oinfo,
+				&has_rmap);
+		if (error)
+			return error;
+		XFS_BTREC_SCRUB_CHECK(bs, has_rmap);
+		return 0;
+	}
+
+	/* Check each chunk of a sparse inode cluster. */
+	holemask = irec.ir_holemask;
+	holecount = 0;
+	len = XFS_B_TO_FSB(mp,
+			XFS_INODES_PER_HOLEMASK_BIT * mp->m_sb.sb_inodesize);
+	holes = ~xfs_inobt_irec_to_allocmask(&irec);
+	XFS_BTREC_SCRUB_CHECK(bs, (holes & irec.ir_free) == holes);
+	XFS_BTREC_SCRUB_CHECK(bs, irec.ir_freecount <= irec.ir_count);
+
+	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1,
+			i++, agino += XFS_INODES_PER_HOLEMASK_BIT) {
+		if (holemask & 1) {
+			holecount += XFS_INODES_PER_HOLEMASK_BIT;
+			continue;
+		}
+		bno = XFS_AGINO_TO_AGBNO(mp, agino);
+
+		XFS_BTREC_SCRUB_CHECK(bs, bno < mp->m_sb.sb_agblocks)
+		XFS_BTREC_SCRUB_CHECK(bs, bno < bno + len);
+		XFS_BTREC_SCRUB_CHECK(bs, (unsigned long long)bno + len <=
+				mp->m_sb.sb_agblocks);
+
+		if (!bs->rmap_cur)
+			continue;
+		error = xfs_rmap_record_exists(bs->rmap_cur, bno, len, &oinfo,
+				&has_rmap);
+		if (error)
+			break;
+		XFS_BTREC_SCRUB_CHECK(bs, has_rmap);
+	}
+
+	XFS_BTREC_SCRUB_CHECK(bs, holecount <= XFS_INODES_PER_CHUNK);
+	XFS_BTREC_SCRUB_CHECK(bs, holecount + irec.ir_count ==
+			XFS_INODES_PER_CHUNK);
+
+	return error;
+}
+
+/* Scrub the inode btrees for some AG. */
+STATIC int
+xfs_iallocbt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_btnum_t		which)
+{
+	struct xfs_btree_scrub	bs;
+	int			error;
+
+	error = xfs_ialloc_read_agi(mp, NULL, agno, &bs.agi_bp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &bs.agf_bp);
+	if (error) {
+		xfs_trans_brelse(NULL, bs.agi_bp);
+		return error;
+	}
+
+	bs.cur = xfs_inobt_init_cursor(mp, NULL, bs.agi_bp, agno, which);
+	bs.scrub_rec = xfs_iallocbt_scrub_helper;
+	xfs_rmap_ag_owner(&bs.oinfo, XFS_RMAP_OWN_INOBT);
+	error = xfs_btree_scrub(&bs);
+	xfs_btree_del_cursor(bs.cur,
+			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(NULL, bs.agf_bp);
+	xfs_trans_brelse(NULL, bs.agi_bp);
+
+	if (!error && bs.error)
+		error = bs.error;
+
+	return error;
+}
+
+int
+xfs_inobt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	return xfs_iallocbt_scrub(mp, agno, XFS_BTNUM_INO);
+}
+
+int
+xfs_finobt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	return xfs_iallocbt_scrub(mp, agno, XFS_BTNUM_FINO);
+}
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 0bb8966..7ea6ff3 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -168,5 +168,7 @@ int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
 int xfs_read_agi(struct xfs_mount *mp, struct xfs_trans *tp,
 		xfs_agnumber_t agno, struct xfs_buf **bpp);
 
+extern int xfs_inobt_scrub(struct xfs_mount *mp, xfs_agnumber_t agno);
+extern int xfs_finobt_scrub(struct xfs_mount *mp, xfs_agnumber_t agno);
 
 #endif	/* __XFS_IALLOC_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index fd26550..81c673c 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -204,6 +204,16 @@ xfs_inobt_key_diff(
 			  cur->bc_rec.i.ir_startino;
 }
 
+STATIC __int64_t
+xfs_inobt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be32_to_cpu(k2->inobt.ir_startino) -
+			  be32_to_cpu(k1->inobt.ir_startino);
+}
+
 static int
 xfs_inobt_verify(
 	struct xfs_buf		*bp)
@@ -278,7 +288,6 @@ const struct xfs_buf_ops xfs_inobt_buf_ops = {
 	.verify_write = xfs_inobt_write_verify,
 };
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_inobt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -298,7 +307,6 @@ xfs_inobt_recs_inorder(
 	return be32_to_cpu(r1->inobt.ir_startino) + XFS_INODES_PER_CHUNK <=
 		be32_to_cpu(r2->inobt.ir_startino);
 }
-#endif	/* DEBUG */
 
 static const struct xfs_btree_ops xfs_inobt_ops = {
 	.rec_len		= sizeof(xfs_inobt_rec_t),
@@ -315,10 +323,9 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 	.init_ptr_from_cur	= xfs_inobt_init_ptr_from_cur,
 	.key_diff		= xfs_inobt_key_diff,
 	.buf_ops		= &xfs_inobt_buf_ops,
-#if defined(DEBUG) || defined(XFS_WARN)
+	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
-#endif
 };
 
 static const struct xfs_btree_ops xfs_finobt_ops = {
@@ -336,10 +343,9 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
 	.key_diff		= xfs_inobt_key_diff,
 	.buf_ops		= &xfs_inobt_buf_ops,
-#if defined(DEBUG) || defined(XFS_WARN)
+	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
-#endif
 };
 
 /*
diff --git a/fs/xfs/xfs_scrub_sysfs.c b/fs/xfs/xfs_scrub_sysfs.c
index efaa635..cb7812f 100644
--- a/fs/xfs/xfs_scrub_sysfs.c
+++ b/fs/xfs/xfs_scrub_sysfs.c
@@ -176,10 +176,14 @@ static struct xfs_agdata_scrub_attr xfs_agdata_scrub_attr_##_name = {	     \
 
 XFS_AGDATA_SCRUB_ATTR(bnobt, NULL);
 XFS_AGDATA_SCRUB_ATTR(cntbt, NULL);
+XFS_AGDATA_SCRUB_ATTR(inobt, NULL);
+XFS_AGDATA_SCRUB_ATTR(finobt, xfs_sb_version_hasfinobt);
 
 static struct attribute *xfs_agdata_scrub_attrs[] = {
 	XFS_AGDATA_SCRUB_LIST(bnobt),
 	XFS_AGDATA_SCRUB_LIST(cntbt),
+	XFS_AGDATA_SCRUB_LIST(inobt),
+	XFS_AGDATA_SCRUB_LIST(finobt),
 	NULL,
 };
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 117/119] xfs: support scrubbing rmap btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (115 preceding siblings ...)
  2016-06-17  1:30 ` [PATCH 116/119] xfs: support scrubbing inode btrees Darrick J. Wong
@ 2016-06-17  1:30 ` Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 118/119] xfs: support scrubbing refcount btree Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 119/119] xfs: add btree scrub tracepoints Darrick J. Wong
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:30 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Plumb in the pieces necessary to check the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_rmap.c       |   77 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.c |    4 --
 fs/xfs/libxfs/xfs_rmap_btree.h |    2 +
 fs/xfs/xfs_scrub_sysfs.c       |    2 +
 4 files changed, 81 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index e7673ec..0c8a236 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -38,6 +38,7 @@
 #include "xfs_extent_busy.h"
 #include "xfs_bmap.h"
 #include "xfs_inode.h"
+#include "xfs_scrub.h"
 
 /*
  * Lookup the first record less than or equal to [bno, len, owner, offset]
@@ -2367,3 +2368,79 @@ xfs_rmap_record_exists(
 		     irec.rm_startblock + irec.rm_blockcount >= bno + len);
 	return 0;
 }
+
+STATIC int
+xfs_rmapbt_scrub_helper(
+	struct xfs_btree_scrub		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_rmap_irec		irec;
+	bool				is_freesp;
+	bool				non_inode;
+	bool				is_unwritten;
+	bool				is_bmbt;
+	bool				is_attr;
+	int				error;
+
+	error = xfs_rmapbt_btrec_to_irec(rec, &irec);
+	if (error)
+		return error;
+
+	XFS_BTREC_SCRUB_CHECK(bs, irec.rm_startblock < mp->m_sb.sb_agblocks)
+	XFS_BTREC_SCRUB_CHECK(bs, irec.rm_startblock < irec.rm_startblock +
+			irec.rm_blockcount);
+	XFS_BTREC_SCRUB_CHECK(bs, (unsigned long long)irec.rm_startblock +
+			irec.rm_blockcount <= mp->m_sb.sb_agblocks)
+
+	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
+	is_bmbt = irec.rm_flags & XFS_RMAP_ATTR_FORK;
+	is_attr = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
+	is_unwritten = irec.rm_flags & XFS_RMAP_UNWRITTEN;
+
+	XFS_BTREC_SCRUB_CHECK(bs, !is_bmbt || irec.rm_offset == 0);
+	XFS_BTREC_SCRUB_CHECK(bs, !non_inode || irec.rm_offset == 0);
+	XFS_BTREC_SCRUB_CHECK(bs, !is_unwritten || !(is_bmbt || non_inode ||
+			is_attr));
+	XFS_BTREC_SCRUB_CHECK(bs, !non_inode || !(is_bmbt || is_unwritten ||
+			is_attr));
+
+	/* check there's no record in freesp btrees */
+	error = xfs_alloc_record_exists(bs->bno_cur, irec.rm_startblock,
+			irec.rm_blockcount, &is_freesp);
+	if (error)
+		goto err;
+	XFS_BTREC_SCRUB_CHECK(bs, !is_freesp);
+
+	/* XXX: check with the owner */
+
+err:
+	return error;
+}
+
+/* Scrub the rmap btree for some AG. */
+int
+xfs_rmapbt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_btree_scrub	bs;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &bs.agf_bp);
+	if (error)
+		return error;
+
+	bs.cur = xfs_rmapbt_init_cursor(mp, NULL, bs.agf_bp, agno);
+	bs.scrub_rec = xfs_rmapbt_scrub_helper;
+	xfs_rmap_ag_owner(&bs.oinfo, XFS_RMAP_OWN_AG);
+	error = xfs_btree_scrub(&bs);
+	xfs_btree_del_cursor(bs.cur,
+			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(NULL, bs.agf_bp);
+
+	if (!error && bs.error)
+		error = bs.error;
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 0b045a6..9861e49 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -372,7 +372,6 @@ const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
 	.verify_write		= xfs_rmapbt_write_verify,
 };
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_rmapbt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -408,7 +407,6 @@ xfs_rmapbt_recs_inorder(
 		return 1;
 	return 0;
 }
-#endif	/* DEBUG */
 
 static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
@@ -428,10 +426,8 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.key_diff		= xfs_rmapbt_key_diff,
 	.buf_ops		= &xfs_rmapbt_buf_ops,
 	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
-#if defined(DEBUG) || defined(XFS_WARN)
 	.keys_inorder		= xfs_rmapbt_keys_inorder,
 	.recs_inorder		= xfs_rmapbt_recs_inorder,
-#endif
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 2f072c8..3f8742d 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -147,4 +147,6 @@ extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
 extern int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, struct xfs_owner_info *oinfo, bool *has_rmap);
 
+int xfs_rmapbt_scrub(struct xfs_mount *mp, xfs_agnumber_t agno);
+
 #endif	/* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_scrub_sysfs.c b/fs/xfs/xfs_scrub_sysfs.c
index cb7812f..c2256f3 100644
--- a/fs/xfs/xfs_scrub_sysfs.c
+++ b/fs/xfs/xfs_scrub_sysfs.c
@@ -178,12 +178,14 @@ XFS_AGDATA_SCRUB_ATTR(bnobt, NULL);
 XFS_AGDATA_SCRUB_ATTR(cntbt, NULL);
 XFS_AGDATA_SCRUB_ATTR(inobt, NULL);
 XFS_AGDATA_SCRUB_ATTR(finobt, xfs_sb_version_hasfinobt);
+XFS_AGDATA_SCRUB_ATTR(rmapbt, xfs_sb_version_hasrmapbt);
 
 static struct attribute *xfs_agdata_scrub_attrs[] = {
 	XFS_AGDATA_SCRUB_LIST(bnobt),
 	XFS_AGDATA_SCRUB_LIST(cntbt),
 	XFS_AGDATA_SCRUB_LIST(inobt),
 	XFS_AGDATA_SCRUB_LIST(finobt),
+	XFS_AGDATA_SCRUB_LIST(rmapbt),
 	NULL,
 };
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 118/119] xfs: support scrubbing refcount btree
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (116 preceding siblings ...)
  2016-06-17  1:30 ` [PATCH 117/119] xfs: support scrubbing rmap btree Darrick J. Wong
@ 2016-06-17  1:30 ` Darrick J. Wong
  2016-06-17  1:30 ` [PATCH 119/119] xfs: add btree scrub tracepoints Darrick J. Wong
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:30 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Plumb in the pieces necessary to check the refcount btree.  If rmap is
available, check the reference count by performing an interval query
against the rmapbt.

v2: Handle the case where the rmap records are not all at least the
length of the refcount extent.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_refcount.c       |  224 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h       |    2 
 fs/xfs/libxfs/xfs_refcount_btree.c |   16 ++-
 fs/xfs/xfs_scrub_sysfs.c           |    2 
 4 files changed, 240 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index e8d8702..126dd57 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -37,6 +37,7 @@
 #include "xfs_bit.h"
 #include "xfs_refcount.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_scrub.h"
 
 /* Allowable refcount adjustment amounts. */
 enum xfs_refc_adjust_op {
@@ -1578,3 +1579,226 @@ xfs_refcount_free_cow_extent(
 
 	return __xfs_refcount_add(mp, dfops, &ri);
 }
+
+struct xfs_refcountbt_scrub_fragment {
+	struct xfs_rmap_irec		rm;
+	struct list_head		list;
+};
+
+struct xfs_refcountbt_scrub_rmap_check_info {
+	xfs_nlink_t			nr;
+	struct xfs_refcount_irec	rc;
+	struct list_head		fragments;
+};
+
+static int
+xfs_refcountbt_scrub_rmap_check(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_refcountbt_scrub_rmap_check_info	*rsrci = priv;
+	struct xfs_refcountbt_scrub_fragment		*frag;
+	xfs_agblock_t			rm_last;
+	xfs_agblock_t			rc_last;
+
+	rm_last = rec->rm_startblock + rec->rm_blockcount;
+	rc_last = rsrci->rc.rc_startblock + rsrci->rc.rc_blockcount;
+	if (rec->rm_startblock <= rsrci->rc.rc_startblock && rm_last >= rc_last)
+		rsrci->nr++;
+	else {
+		frag = kmem_zalloc(sizeof(struct xfs_refcountbt_scrub_fragment),
+				KM_SLEEP);
+		frag->rm = *rec;
+		list_add_tail(&frag->list, &rsrci->fragments);
+	}
+
+	return 0;
+}
+
+STATIC void
+xfs_refcountbt_process_rmap_fragments(
+	struct xfs_mount				*mp,
+	struct xfs_refcountbt_scrub_rmap_check_info	*rsrci)
+{
+	struct list_head				worklist;
+	struct xfs_refcountbt_scrub_fragment		*cur;
+	struct xfs_refcountbt_scrub_fragment		*n;
+	xfs_agblock_t					bno;
+	xfs_agblock_t					rbno;
+	xfs_agblock_t					next_rbno;
+	xfs_nlink_t					nr;
+	xfs_nlink_t					target_nr;
+
+	target_nr = rsrci->rc.rc_refcount - rsrci->nr;
+	if (target_nr == 0)
+		return;
+
+	/*
+	 * There are (rsrci->rc.rc_refcount - rsrci->nr refcount)
+	 * references we haven't found yet.  Pull that many off the
+	 * fragment list and figure out where the smallest rmap ends
+	 * (and therefore the next rmap should start).  All the rmaps
+	 * we pull off should start at or before the beginning of the
+	 * refcount record's range.
+	 */
+	INIT_LIST_HEAD(&worklist);
+	rbno = NULLAGBLOCK;
+	nr = 1;
+	list_for_each_entry_safe(cur, n, &rsrci->fragments, list) {
+		if (cur->rm.rm_startblock > rsrci->rc.rc_startblock)
+			goto fail;
+		bno = cur->rm.rm_startblock + cur->rm.rm_blockcount;
+		if (rbno > bno)
+			rbno = bno;
+		list_del(&cur->list);
+		list_add_tail(&cur->list, &worklist);
+		if (nr == target_nr)
+			break;
+		nr++;
+	}
+
+	if (nr != target_nr)
+		goto fail;
+
+	while (!list_empty(&rsrci->fragments)) {
+		/* Discard any fragments ending at rbno. */
+		nr = 0;
+		next_rbno = NULLAGBLOCK;
+		list_for_each_entry_safe(cur, n, &worklist, list) {
+			bno = cur->rm.rm_startblock + cur->rm.rm_blockcount;
+			if (bno != rbno) {
+				if (next_rbno > bno)
+					next_rbno = bno;
+				continue;
+			}
+			list_del(&cur->list);
+			kmem_free(cur);
+			nr++;
+		}
+
+		/* Empty list?  We're done. */
+		if (list_empty(&rsrci->fragments))
+			break;
+
+		/* Try to add nr rmaps starting at rbno to the worklist. */
+		list_for_each_entry_safe(cur, n, &rsrci->fragments, list) {
+			bno = cur->rm.rm_startblock + cur->rm.rm_blockcount;
+			if (cur->rm.rm_startblock != rbno)
+				goto fail;
+			list_del(&cur->list);
+			list_add_tail(&cur->list, &worklist);
+			if (next_rbno > bno)
+				next_rbno = bno;
+			nr--;
+			if (nr == 0)
+				break;
+		}
+
+		rbno = next_rbno;
+	}
+
+	/*
+	 * Make sure the last extent we processed ends at or beyond
+	 * the end of the refcount extent.
+	 */
+	if (rbno < rsrci->rc.rc_startblock + rsrci->rc.rc_blockcount)
+		goto fail;
+
+	rsrci->nr = rsrci->rc.rc_refcount;
+fail:
+	/* Delete fragments and work list. */
+	while (!list_empty(&worklist)) {
+		cur = list_first_entry(&worklist,
+				struct xfs_refcountbt_scrub_fragment, list);
+		list_del(&cur->list);
+		kmem_free(cur);
+	}
+	while (!list_empty(&rsrci->fragments)) {
+		cur = list_first_entry(&rsrci->fragments,
+				struct xfs_refcountbt_scrub_fragment, list);
+		list_del(&cur->list);
+		kmem_free(cur);
+	}
+}
+
+STATIC int
+xfs_refcountbt_scrub_helper(
+	struct xfs_btree_scrub		*bs,
+	union xfs_btree_rec		*rec)
+{
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_rmap_irec		low;
+	struct xfs_rmap_irec		high;
+	struct xfs_refcount_irec	irec;
+	struct xfs_refcountbt_scrub_rmap_check_info	rsrci;
+	struct xfs_refcountbt_scrub_fragment		*cur;
+	int				error;
+
+	irec.rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
+	irec.rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
+	irec.rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
+
+	XFS_BTREC_SCRUB_CHECK(bs, irec.rc_startblock < mp->m_sb.sb_agblocks);
+	XFS_BTREC_SCRUB_CHECK(bs, irec.rc_startblock < irec.rc_startblock +
+			irec.rc_blockcount);
+	XFS_BTREC_SCRUB_CHECK(bs, (unsigned long long)irec.rc_startblock +
+			irec.rc_blockcount <= mp->m_sb.sb_agblocks);
+	XFS_BTREC_SCRUB_CHECK(bs, irec.rc_refcount >= 1);
+
+	/* confirm the refcount */
+	if (!bs->rmap_cur)
+		return 0;
+
+	memset(&low, 0, sizeof(low));
+	low.rm_startblock = irec.rc_startblock;
+	memset(&high, 0xFF, sizeof(high));
+	high.rm_startblock = irec.rc_startblock + irec.rc_blockcount - 1;
+
+	rsrci.nr = 0;
+	rsrci.rc = irec;
+	INIT_LIST_HEAD(&rsrci.fragments);
+	error = xfs_rmapbt_query_range(bs->rmap_cur, &low, &high,
+			&xfs_refcountbt_scrub_rmap_check, &rsrci);
+	if (error && error != XFS_BTREE_QUERY_RANGE_ABORT)
+		goto err;
+	error = 0;
+	xfs_refcountbt_process_rmap_fragments(mp, &rsrci);
+	XFS_BTREC_SCRUB_CHECK(bs, irec.rc_refcount == rsrci.nr);
+
+err:
+	while (!list_empty(&rsrci.fragments)) {
+		cur = list_first_entry(&rsrci.fragments,
+				struct xfs_refcountbt_scrub_fragment, list);
+		list_del(&cur->list);
+		kmem_free(cur);
+	}
+	return error;
+}
+
+/* Scrub the refcount btree for some AG. */
+int
+xfs_refcountbt_scrub(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_btree_scrub	bs;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &bs.agf_bp);
+	if (error)
+		return error;
+
+	bs.cur = xfs_refcountbt_init_cursor(mp, NULL, bs.agf_bp, agno, NULL);
+	bs.scrub_rec = xfs_refcountbt_scrub_helper;
+	xfs_rmap_ag_owner(&bs.oinfo, XFS_RMAP_OWN_REFC);
+	error = xfs_btree_scrub(&bs);
+	xfs_btree_del_cursor(bs.cur,
+			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_trans_brelse(NULL, bs.agf_bp);
+
+	if (!error && bs.error)
+		error = bs.error;
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 44b0346..d2317f1 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -68,4 +68,6 @@ extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
 		struct xfs_defer_ops *dfops, xfs_fsblock_t fsb,
 		xfs_extlen_t len);
 
+extern int xfs_refcountbt_scrub(struct xfs_mount *mp, xfs_agnumber_t agno);
+
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index abf1ebf..c5d2942 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -197,6 +197,16 @@ xfs_refcountbt_key_diff(
 	return (__int64_t)be32_to_cpu(kp->rc_startblock) - rec->rc_startblock;
 }
 
+STATIC __int64_t
+xfs_refcountbt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be32_to_cpu(k2->refc.rc_startblock) -
+			  be32_to_cpu(k1->refc.rc_startblock);
+}
+
 STATIC bool
 xfs_refcountbt_verify(
 	struct xfs_buf		*bp)
@@ -259,7 +269,6 @@ const struct xfs_buf_ops xfs_refcountbt_buf_ops = {
 	.verify_write		= xfs_refcountbt_write_verify,
 };
 
-#if defined(DEBUG) || defined(XFS_WARN)
 STATIC int
 xfs_refcountbt_keys_inorder(
 	struct xfs_btree_cur	*cur,
@@ -288,13 +297,13 @@ xfs_refcountbt_recs_inorder(
 		b.rc_startblock = be32_to_cpu(r2->refc.rc_startblock);
 		b.rc_blockcount = be32_to_cpu(r2->refc.rc_blockcount);
 		b.rc_refcount = be32_to_cpu(r2->refc.rc_refcount);
+		a = a; b = b;
 		trace_xfs_refcount_rec_order_error(cur->bc_mp,
 				cur->bc_private.a.agno, &a, &b);
 	}
 
 	return ret;
 }
-#endif	/* DEBUG */
 
 static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.rec_len		= sizeof(struct xfs_refcount_rec),
@@ -311,10 +320,9 @@ static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.init_ptr_from_cur	= xfs_refcountbt_init_ptr_from_cur,
 	.key_diff		= xfs_refcountbt_key_diff,
 	.buf_ops		= &xfs_refcountbt_buf_ops,
-#if defined(DEBUG) || defined(XFS_WARN)
+	.diff_two_keys		= xfs_refcountbt_diff_two_keys,
 	.keys_inorder		= xfs_refcountbt_keys_inorder,
 	.recs_inorder		= xfs_refcountbt_recs_inorder,
-#endif
 };
 
 /*
diff --git a/fs/xfs/xfs_scrub_sysfs.c b/fs/xfs/xfs_scrub_sysfs.c
index c2256f3..ad51e05 100644
--- a/fs/xfs/xfs_scrub_sysfs.c
+++ b/fs/xfs/xfs_scrub_sysfs.c
@@ -179,6 +179,7 @@ XFS_AGDATA_SCRUB_ATTR(cntbt, NULL);
 XFS_AGDATA_SCRUB_ATTR(inobt, NULL);
 XFS_AGDATA_SCRUB_ATTR(finobt, xfs_sb_version_hasfinobt);
 XFS_AGDATA_SCRUB_ATTR(rmapbt, xfs_sb_version_hasrmapbt);
+XFS_AGDATA_SCRUB_ATTR(refcountbt, xfs_sb_version_hasreflink);
 
 static struct attribute *xfs_agdata_scrub_attrs[] = {
 	XFS_AGDATA_SCRUB_LIST(bnobt),
@@ -186,6 +187,7 @@ static struct attribute *xfs_agdata_scrub_attrs[] = {
 	XFS_AGDATA_SCRUB_LIST(inobt),
 	XFS_AGDATA_SCRUB_LIST(finobt),
 	XFS_AGDATA_SCRUB_LIST(rmapbt),
+	XFS_AGDATA_SCRUB_LIST(refcountbt),
 	NULL,
 };
 


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 119/119] xfs: add btree scrub tracepoints
  2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
                   ` (117 preceding siblings ...)
  2016-06-17  1:30 ` [PATCH 118/119] xfs: support scrubbing refcount btree Darrick J. Wong
@ 2016-06-17  1:30 ` Darrick J. Wong
  118 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17  1:30 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_scrub.c |   14 ++++++++++++++
 fs/xfs/xfs_trace.h        |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_scrub.c b/fs/xfs/libxfs/xfs_scrub.c
index d43d5c5..d43e742 100644
--- a/fs/xfs/libxfs/xfs_scrub.c
+++ b/fs/xfs/libxfs/xfs_scrub.c
@@ -34,6 +34,7 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
+#include "xfs_trace.h"
 #include "xfs_scrub.h"
 
 static const char * const btree_types[] = {
@@ -88,6 +89,12 @@ xfs_btree_scrub_rec(
 	struct xfs_btree_block	*block;
 	struct xfs_btree_block	*keyblock;
 
+	trace_xfs_btree_scrub_rec(cur->bc_mp, cur->bc_private.a.agno,
+			XFS_FSB_TO_AGBNO(cur->bc_mp,
+				XFS_DADDR_TO_FSB(cur->bc_mp,
+					cur->bc_bufs[0]->b_bn)),
+			cur->bc_btnum, 0, cur->bc_nlevels, cur->bc_ptrs[0]);
+
 	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[0]);
 	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
 
@@ -135,6 +142,13 @@ xfs_btree_scrub_key(
 	struct xfs_btree_block	*block;
 	struct xfs_btree_block	*keyblock;
 
+	trace_xfs_btree_scrub_key(cur->bc_mp, cur->bc_private.a.agno,
+			XFS_FSB_TO_AGBNO(cur->bc_mp,
+				XFS_DADDR_TO_FSB(cur->bc_mp,
+					cur->bc_bufs[level]->b_bn)),
+			cur->bc_btnum, level, cur->bc_nlevels,
+			cur->bc_ptrs[level]);
+
 	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
 	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9fe812f..e295374 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3428,6 +3428,46 @@ DEFINE_GETFSMAP_EVENT(xfs_getfsmap_low_key);
 DEFINE_GETFSMAP_EVENT(xfs_getfsmap_high_key);
 DEFINE_GETFSMAP_EVENT(xfs_getfsmap_mapping);
 
+/* scrub */
+DECLARE_EVENT_CLASS(xfs_scrub_sbtree_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t bno,
+		 xfs_btnum_t btnum, int level, int nlevels, int ptr),
+	TP_ARGS(mp, agno, bno, btnum, level, nlevels, ptr),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_btnum_t, btnum)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, bno)
+		__field(int, level)
+		__field(int, nlevels)
+		__field(int, ptr)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->btnum = btnum;
+		__entry->bno = bno;
+		__entry->level = level;
+		__entry->nlevels = nlevels;
+		__entry->ptr = ptr;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u btnum %d level %d nlevels %d ptr %d\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->bno,
+		  __entry->btnum,
+		  __entry->level,
+		  __entry->nlevels,
+		  __entry->ptr)
+)
+#define DEFINE_SCRUB_SBTREE_EVENT(name) \
+DEFINE_EVENT(xfs_scrub_sbtree_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t bno, \
+		 xfs_btnum_t btnum, int level, int nlevels, int ptr), \
+	TP_ARGS(mp, agno, bno, btnum, level, nlevels, ptr))
+DEFINE_SCRUB_SBTREE_EVENT(xfs_btree_scrub_rec);
+DEFINE_SCRUB_SBTREE_EVENT(xfs_btree_scrub_key);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range
  2016-06-17  1:17 ` [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range Darrick J. Wong
@ 2016-06-17 11:32   ` Christoph Hellwig
  2016-06-28 19:19     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:17:59PM -0700, Darrick J. Wong wrote:
> All the VFS functions in the dedupe ioctl path return int status, so
> the ioctl handler ought to as well.
> 
> Found by Coverity, CID 1350952.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

This should go out to Al as a separate patch.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 003/119] xfs: check offsets of variable length structures
  2016-06-17  1:18 ` [PATCH 003/119] xfs: check offsets of variable length structures Darrick J. Wong
@ 2016-06-17 11:33   ` Christoph Hellwig
  2016-06-17 17:34   ` Brian Foster
  1 sibling, 0 replies; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:18:12PM -0700, Darrick J. Wong wrote:
> Some of the directory/attr structures contain variable-length objects,
> so the enclosing structure doesn't have a meaningful fixed size at
> compile time.  We can check the offsets of the members before the
> variable-length member, so do those.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine, and should go in independently of the rmap work:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 004/119] xfs: enable buffer deadlock postmortem diagnosis via ftrace
  2016-06-17  1:18 ` [PATCH 004/119] xfs: enable buffer deadlock postmortem diagnosis via ftrace Darrick J. Wong
@ 2016-06-17 11:34   ` Christoph Hellwig
  2016-06-21  0:47     ` Dave Chinner
  0 siblings, 1 reply; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index efa2a73..2333db7 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -947,7 +947,8 @@ xfs_buf_trylock(
>  	if (locked)
>  		XB_SET_OWNER(bp);
>  
> -	trace_xfs_buf_trylock(bp, _RET_IP_);
> +	locked ? trace_xfs_buf_trylock(bp, _RET_IP_) :
> +		 trace_xfs_buf_trylock_fail(bp, _RET_IP_);
>  	return locked;

I think this should be something like:

	if (locked) {
		XB_SET_OWNER(bp);
		trace_xfs_buf_trylock(bp, _RET_IP_);
	} else {
		trace_xfs_buf_trylock_fail(bp, _RET_IP_);
	}

otherwise this looks good and can go in without the rest of the series.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 005/119] xfs: check for a valid error_tag in errortag_add
  2016-06-17  1:18 ` [PATCH 005/119] xfs: check for a valid error_tag in errortag_add Darrick J. Wong
@ 2016-06-17 11:34   ` Christoph Hellwig
  0 siblings, 0 replies; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:18:24PM -0700, Darrick J. Wong wrote:
> Currently we don't check the error_tag when someone's trying to set up
> error injection testing.  If userspace passes in a value we don't know
> about, send back an error.  This will help xfstests to _notrun a test
> that uses error injection to test things like log replay.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 007/119] xfs: rearrange xfs_bmap_add_free parameters
  2016-06-17  1:18 ` [PATCH 007/119] xfs: rearrange xfs_bmap_add_free parameters Darrick J. Wong
@ 2016-06-17 11:39   ` Christoph Hellwig
  0 siblings, 0 replies; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: david, linux-fsdevel, vishal.l.verma, xfs, Christoph Hellwig,
	Dave Chinner

On Thu, Jun 16, 2016 at 06:18:37PM -0700, Darrick J. Wong wrote:
> This is already in xfsprogs' libxfs, so port it to the kernel.

Oh well, this is something that should have gone into the kernel
at the same time..

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE
  2016-06-17  1:18 ` [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE Darrick J. Wong
@ 2016-06-17 11:41   ` Christoph Hellwig
  2016-06-17 12:16     ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:18:05PM -0700, Darrick J. Wong wrote:
> Introduce XFLAGs for the new XFS reflink inode flag and the CoW extent
> size hint, and actually plumb the CoW extent size hint into the fsxattr
> structure.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Should go behind all the updates that are useful without any new
rmap or reflink functionality.  In fact it would be great if you
could send out a series with just those little fixes and cleanups
first.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 008/119] xfs: separate freelist fixing into a separate helper
  2016-06-17  1:18 ` [PATCH 008/119] xfs: separate freelist fixing into a separate helper Darrick J. Wong
@ 2016-06-17 11:52   ` Christoph Hellwig
  2016-06-21  0:48     ` Dave Chinner
  2016-06-21  1:40   ` Dave Chinner
  1 sibling, 1 reply; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

> +/* Ensure that the freelist is at full capacity. */
> +int
> +xfs_free_extent_fix_freelist(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	struct xfs_buf		**agbp)
>  {
> -	xfs_alloc_arg_t	args;
> -	int		error;
> +	xfs_alloc_arg_t		args;

Use struct xfs_alloc_arg if you change this anyway.

> +	int			error;
>  
> -	ASSERT(len != 0);
>  	memset(&args, 0, sizeof(xfs_alloc_arg_t));

Same here.

> -	if (args.agbno + len >
> -			be32_to_cpu(XFS_BUF_TO_AGF(args.agbp)->agf_length)) {
> -		error = -EFSCORRUPTED;
> -		goto error0;
> -	}
> +	XFS_WANT_CORRUPTED_GOTO(mp,
> +			agbno + len <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length),
> +			err);

This introduces an overly long line.

But except for these nitpicks this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 009/119] xfs: convert list of extents to free into a regular list
  2016-06-17  1:18 ` [PATCH 009/119] xfs: convert list of extents to free into a regular list Darrick J. Wong
@ 2016-06-17 11:59   ` Christoph Hellwig
  2016-06-18 20:15     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 11:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

>  {
> +	struct xfs_bmap_free_item	*new;		/* new element */
>  #ifdef DEBUG
>  	xfs_agnumber_t		agno;
>  	xfs_agblock_t		agbno;
> @@ -597,17 +595,7 @@ xfs_bmap_add_free(
>  	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
>  	new->xbfi_startblock = bno;
>  	new->xbfi_blockcount = (xfs_extlen_t)len;
> +	list_add(&new->xbfi_list, &flist->xbf_flist);
>  	flist->xbf_count++;

Please kill xbf_count while you're at it, it's entirely superflous.

> @@ -617,14 +605,10 @@ xfs_bmap_add_free(
>   */
>  void
>  xfs_bmap_del_free(
> -	xfs_bmap_free_t		*flist,	/* free item list header */
> -	xfs_bmap_free_item_t	*prev,	/* previous item on list, if any */
> -	xfs_bmap_free_item_t	*free)	/* list item to be freed */
> +	struct xfs_bmap_free		*flist,	/* free item list header */
> +	struct xfs_bmap_free_item	*free)	/* list item to be freed */

Which then also gets rid of the flist argument here.

> @@ -634,17 +618,16 @@ xfs_bmap_del_free(
>   */
>  void
>  xfs_bmap_cancel(
> +	struct xfs_bmap_free		*flist)	/* list of bmap_free_items */
>  {
> +	struct xfs_bmap_free_item	*free;	/* free list item */
>  
>  	if (flist->xbf_count == 0)
>  		return;
> +	while (!list_empty(&flist->xbf_flist)) {
> +		free = list_first_entry(&flist->xbf_flist,
> +				struct xfs_bmap_free_item, xbfi_list);

	while ((free = list_first_entry_or_null(...))

> +	list_sort((*tp)->t_mountp, &flist->xbf_flist, xfs_bmap_free_list_cmp);

Can you add a comment on why we are sorting the list?

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE
  2016-06-17 11:41   ` Christoph Hellwig
@ 2016-06-17 12:16     ` Brian Foster
  2016-06-17 15:06       ` Christoph Hellwig
  2016-06-17 16:54       ` Darrick J. Wong
  0 siblings, 2 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-17 12:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jun 17, 2016 at 04:41:17AM -0700, Christoph Hellwig wrote:
> On Thu, Jun 16, 2016 at 06:18:05PM -0700, Darrick J. Wong wrote:
> > Introduce XFLAGs for the new XFS reflink inode flag and the CoW extent
> > size hint, and actually plumb the CoW extent size hint into the fsxattr
> > structure.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Should go behind all the updates that are useful without any new
> rmap or reflink functionality.  In fact it would be great if you
> could send out a series with just those little fixes and cleanups
> first.
> 

I'd take that a step further and suggest the entire series be split into
independent feature series, as appropriate. Unless I'm missing
something, I don't think there's any reason these all need to be bundled
together. Further, my expectation is that they probably end up being
merged as independent units, so I think it's easier for everybody for
Darrick to carve that up on the logical boundaries rather than assume
all reviewers and maintainer are going to do so consistently.

Note that I'm not saying this has to be reposted.. I think I can pull
off the rmap bits for the time being. I'm just suggesting that if a
repost is required from this point forward for any of the logical
subunits (deps, rmap, reflink, scrub), I'd suggest to post, version and
changelog those units independently.

Brian

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 006/119] xfs: port differences from xfsprogs libxfs
  2016-06-17  1:18 ` [PATCH 006/119] xfs: port differences from xfsprogs libxfs Darrick J. Wong
@ 2016-06-17 15:06   ` Christoph Hellwig
  2016-06-20  0:21   ` Dave Chinner
  1 sibling, 0 replies; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 15:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

I think this needs to be split out into a patches, one for each logical
change.

> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 99b077c..58bdca7 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2415,7 +2415,9 @@ xfs_alloc_read_agf(
>  			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
>  		spin_lock_init(&pag->pagb_lock);
>  		pag->pagb_count = 0;
> +#ifdef __KERNEL__
>  		pag->pagb_tree = RB_ROOT;
> +#endif

I'd much rather have a dummy tree in libxfs than sprinkling random
ifdefs.

> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 932381c..499e980 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -1425,7 +1425,7 @@ xfs_bmap_search_multi_extents(
>   * Else, *lastxp will be set to the index of the found
>   * entry; *gotp will contain the entry.
>   */
> -STATIC xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
> +xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
>  xfs_bmap_search_extents(

probably wants a comment that we keep it public for xfsprogs..

> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 1f88e1c..105979d 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2532,6 +2532,7 @@ error0:
>  	return error;
>  }
>  
> +#ifdef __KERNEL__
>  struct xfs_btree_split_args {
>  	struct xfs_btree_cur	*cur;
>  	int			level;
> @@ -2609,6 +2610,9 @@ xfs_btree_split(
>  	destroy_work_on_stack(&args.work);
>  	return args.result;
>  }
> +#else /* !KERNEL */
> +#define xfs_btree_split	__xfs_btree_split
> +#endif

I'd really prefer to avoid the ifdefs - can't we rename and move
the kernel version that might be a possibility.

> @@ -115,7 +115,7 @@ do {    \
>  		__XFS_BTREE_STATS_ADD(__mp, ibt, stat, val); break; \
>  	case XFS_BTNUM_FINO:	\
>  		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
> -	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \
> +	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \

Or add whatever gcc flag we use to silece this one to xfsprogs as well?

> index 3cc3cf7..06b574d 100644
> --- a/fs/xfs/libxfs/xfs_dquot_buf.c
> +++ b/fs/xfs/libxfs/xfs_dquot_buf.c
> @@ -31,10 +31,16 @@
>  #include "xfs_cksum.h"
>  #include "xfs_trace.h"
>  
> +/*
> + * XXX: kernel implementation causes ndquots calc to go real
> + * bad. Just leaving the existing userspace calc here right now.
> + */
>  int
>  xfs_calc_dquots_per_chunk(
>  	unsigned int		nbblks)	/* basic block units */
>  {
> +#ifdef __KERNEL__
> +	/* kernel code that goes wrong in userspace! */
>  	unsigned int	ndquots;
>  
>  	ASSERT(nbblks > 0);
> @@ -42,6 +48,10 @@ xfs_calc_dquots_per_chunk(
>  	do_div(ndquots, sizeof(xfs_dqblk_t));
>  
>  	return ndquots;
> +#else
> +	ASSERT(nbblks > 0);
> +	return BBTOB(nbblks) / sizeof(xfs_dqblk_t);
> +#endif

Eww.  Can someone explain why we aren't always use the userspace
version? Using do_div on a 32-bit variable seems rather pointless.

> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 9d9559e..794fa66 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -56,6 +56,17 @@ xfs_inobp_check(
>  }
>  #endif
>  
> +bool
> +xfs_dinode_good_version(
> +	struct xfs_mount *mp,
> +	__u8		version)
> +{
> +	if (xfs_sb_version_hascrc(&mp->m_sb))
> +		return version == 3;
> +
> +	return version == 1 || version == 2;
> +}

Odd that this appeared in xfsprogs only.  

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE
  2016-06-17 12:16     ` Brian Foster
@ 2016-06-17 15:06       ` Christoph Hellwig
  2016-06-17 16:54       ` Darrick J. Wong
  1 sibling, 0 replies; 236+ messages in thread
From: Christoph Hellwig @ 2016-06-17 15:06 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, linux-fsdevel, vishal.l.verma, xfs, Darrick J. Wong

On Fri, Jun 17, 2016 at 08:16:05AM -0400, Brian Foster wrote:
> I'd take that a step further and suggest the entire series be split into
> independent feature series, as appropriate.

Yes, that's what I meant.  I just didn't manage to get to the rest yet.

> off the rmap bits for the time being. I'm just suggesting that if a
> repost is required from this point forward for any of the logical
> subunits (deps, rmap, reflink, scrub), I'd suggest to post, version and
> changelog those units independently.

Agreed.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE
  2016-06-17 12:16     ` Brian Foster
  2016-06-17 15:06       ` Christoph Hellwig
@ 2016-06-17 16:54       ` Darrick J. Wong
  2016-06-17 17:38         ` Brian Foster
  1 sibling, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-17 16:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jun 17, 2016 at 08:16:05AM -0400, Brian Foster wrote:
> On Fri, Jun 17, 2016 at 04:41:17AM -0700, Christoph Hellwig wrote:
> > On Thu, Jun 16, 2016 at 06:18:05PM -0700, Darrick J. Wong wrote:
> > > Introduce XFLAGs for the new XFS reflink inode flag and the CoW extent
> > > size hint, and actually plumb the CoW extent size hint into the fsxattr
> > > structure.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Should go behind all the updates that are useful without any new
> > rmap or reflink functionality.  In fact it would be great if you
> > could send out a series with just those little fixes and cleanups
> > first.
> > 
> 
> I'd take that a step further and suggest the entire series be split into
> independent feature series, as appropriate. Unless I'm missing
> something, I don't think there's any reason these all need to be bundled
> together. Further, my expectation is that they probably end up being
> merged as independent units, so I think it's easier for everybody for
> Darrick to carve that up on the logical boundaries rather than assume
> all reviewers and maintainer are going to do so consistently.
> 
> Note that I'm not saying this has to be reposted.. I think I can pull
> off the rmap bits for the time being. I'm just suggesting that if a
> repost is required from this point forward for any of the logical
> subunits (deps, rmap, reflink, scrub), I'd suggest to post, version and
> changelog those units independently.

I'd thought about continuing my old practice of listing which patches
go with which feature... but then got lazy. :(  Cleanups/rmap/reflink/scrub
actually are in their own contiguous sections of the patchbomb, though that
isn't obvious from looking at it.

You ought to be able to pull only as far as the end of the rmap series and
still have a working XFS.  I only did the intensive testing with the full
patchset, but the quick xfstests group ran fine with just the rmap pieces.

Kernel patches:
===============
Cleanups, 1-11
rmap + dependencies, 12-49
    Overlapped interval btree, 12-15
    Deferred operations, 16-22
    rmap, 23-49
reflink + dependencies, 50-111
    AG reservations, 50-52
    refcount btree, 53-68
    deferred remap, 69-73
    cow, 74-88
    reflink, 89-111
getfsmapx, 112
scrub, 113-119

xfsprogs:
=========
Cleanups, 1-15
rmap + deps, 16-70
    Overlapped interval btree, 16-19
    Deferred operations, 20-27
    rmap, 28-70
reflink + dependencies, 71-135
    AG reservations, 71-72
    refcount btree, 73-85
    deferred remap, 86-90
    reflink, 91-135
getfsmapx, 136-138
scrub, 139-145

--D

> 
> Brian
> 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 003/119] xfs: check offsets of variable length structures
  2016-06-17  1:18 ` [PATCH 003/119] xfs: check offsets of variable length structures Darrick J. Wong
  2016-06-17 11:33   ` Christoph Hellwig
@ 2016-06-17 17:34   ` Brian Foster
  2016-06-18 18:01     ` Darrick J. Wong
  1 sibling, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-17 17:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:18:12PM -0700, Darrick J. Wong wrote:
> Some of the directory/attr structures contain variable-length objects,
> so the enclosing structure doesn't have a meaningful fixed size at
> compile time.  We can check the offsets of the members before the
> variable-length member, so do those.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

I'm missing why this is necessary. Is the intent still to catch
alignment and/or padding issues? If so, isn't the size check sufficient,
regardless of trailing variable size fields?

Perhaps the goal here is to reduce the scope of checking from where it
isn't needed..? For example, xfs_dir2_data_unused_t looks like it has a
field where the offset in the structure is irrelevant, so that's a
possible false positive if that changes down the road. On the flip side,
that doesn't appear to be the case for other structures such as
xfs_attr_leaf_name_[local|remote]_t.

Brian

>  fs/xfs/xfs_ondisk.h |   25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> index 184c44e..0272301 100644
> --- a/fs/xfs/xfs_ondisk.h
> +++ b/fs/xfs/xfs_ondisk.h
> @@ -22,6 +22,11 @@
>  	BUILD_BUG_ON_MSG(sizeof(structname) != (size), "XFS: sizeof(" \
>  		#structname ") is wrong, expected " #size)
>  
> +#define XFS_CHECK_OFFSET(structname, member, off) \
> +	BUILD_BUG_ON_MSG(offsetof(structname, member) != (off), \
> +		"XFS: offsetof(" #structname ", " #member ") is wrong, " \
> +		"expected " #off)
> +
>  static inline void __init
>  xfs_check_ondisk_structs(void)
>  {
> @@ -75,15 +80,28 @@ xfs_check_ondisk_structs(void)
>  	XFS_CHECK_STRUCT_SIZE(xfs_attr_leaf_name_remote_t,	12);
>  	 */
>  
> +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, valuelen,	0);
> +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, namelen,	2);
> +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, nameval,	3);
> +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valueblk,	0);
> +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valuelen,	4);
> +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, namelen,	8);
> +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, name,	9);
>  	XFS_CHECK_STRUCT_SIZE(xfs_attr_leafblock_t,		40);
> -	XFS_CHECK_STRUCT_SIZE(xfs_attr_shortform_t,		8);
> +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.totsize,	0);
> +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.count,	2);
> +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].namelen,	4);
> +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].valuelen, 5);
> +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].flags,	6);
> +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].nameval,	7);
>  	XFS_CHECK_STRUCT_SIZE(xfs_da_blkinfo_t,			12);
>  	XFS_CHECK_STRUCT_SIZE(xfs_da_intnode_t,			16);
>  	XFS_CHECK_STRUCT_SIZE(xfs_da_node_entry_t,		8);
>  	XFS_CHECK_STRUCT_SIZE(xfs_da_node_hdr_t,		16);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_free_t,		4);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_hdr_t,		16);
> -	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_unused_t,		6);
> +	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, freetag,	0);
> +	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, length,	2);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_hdr_t,		16);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_t,			16);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_ino4_t,			4);
> @@ -94,6 +112,9 @@ xfs_check_ondisk_structs(void)
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_t,			16);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_tail_t,		4);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_entry_t,		3);
> +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, namelen,		0);
> +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, offset,		1);
> +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, name,		3);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_hdr_t,		10);
>  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_off_t,		2);
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE
  2016-06-17 16:54       ` Darrick J. Wong
@ 2016-06-17 17:38         ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-17 17:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, xfs, vishal.l.verma

On Fri, Jun 17, 2016 at 09:54:00AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 17, 2016 at 08:16:05AM -0400, Brian Foster wrote:
> > On Fri, Jun 17, 2016 at 04:41:17AM -0700, Christoph Hellwig wrote:
> > > On Thu, Jun 16, 2016 at 06:18:05PM -0700, Darrick J. Wong wrote:
> > > > Introduce XFLAGs for the new XFS reflink inode flag and the CoW extent
> > > > size hint, and actually plumb the CoW extent size hint into the fsxattr
> > > > structure.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Should go behind all the updates that are useful without any new
> > > rmap or reflink functionality.  In fact it would be great if you
> > > could send out a series with just those little fixes and cleanups
> > > first.
> > > 
> > 
> > I'd take that a step further and suggest the entire series be split into
> > independent feature series, as appropriate. Unless I'm missing
> > something, I don't think there's any reason these all need to be bundled
> > together. Further, my expectation is that they probably end up being
> > merged as independent units, so I think it's easier for everybody for
> > Darrick to carve that up on the logical boundaries rather than assume
> > all reviewers and maintainer are going to do so consistently.
> > 
> > Note that I'm not saying this has to be reposted.. I think I can pull
> > off the rmap bits for the time being. I'm just suggesting that if a
> > repost is required from this point forward for any of the logical
> > subunits (deps, rmap, reflink, scrub), I'd suggest to post, version and
> > changelog those units independently.
> 
> I'd thought about continuing my old practice of listing which patches
> go with which feature... but then got lazy. :(  Cleanups/rmap/reflink/scrub
> actually are in their own contiguous sections of the patchbomb, though that
> isn't obvious from looking at it.
> 

Yeah, I figured as much. I was able to surmise where the rmap stuff
ends. It's not so clear where the line between cleanups vs.
dependencies is, however, and as hch mentioned, some of that stuff
apparently stands on its own (i.e, can be merged without being blocked
on rmap review/test/dev cycles).

> You ought to be able to pull only as far as the end of the rmap series and
> still have a working XFS.  I only did the intensive testing with the full
> patchset, but the quick xfstests group ran fine with just the rmap pieces.
> 

Ok. I'll probably end up testing more with just the rmap bits.

> Kernel patches:
> ===============
> Cleanups, 1-11
> rmap + dependencies, 12-49
>     Overlapped interval btree, 12-15
>     Deferred operations, 16-22
>     rmap, 23-49

Thanks. I still stand by my previous comment wrt to splitting any
subsequent postings, if necessary, into separate series though. ;)

Brian

> reflink + dependencies, 50-111
>     AG reservations, 50-52
>     refcount btree, 53-68
>     deferred remap, 69-73
>     cow, 74-88
>     reflink, 89-111
> getfsmapx, 112
> scrub, 113-119
> 
> xfsprogs:
> =========
> Cleanups, 1-15
> rmap + deps, 16-70
>     Overlapped interval btree, 16-19
>     Deferred operations, 20-27
>     rmap, 28-70
> reflink + dependencies, 71-135
>     AG reservations, 71-72
>     refcount btree, 73-85
>     deferred remap, 86-90
>     reflink, 91-135
> getfsmapx, 136-138
> scrub, 139-145
> 
> --D
> 
> > 
> > Brian
> > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 003/119] xfs: check offsets of variable length structures
  2016-06-17 17:34   ` Brian Foster
@ 2016-06-18 18:01     ` Darrick J. Wong
  2016-06-20 12:38       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-18 18:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jun 17, 2016 at 01:34:27PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:18:12PM -0700, Darrick J. Wong wrote:
> > Some of the directory/attr structures contain variable-length objects,
> > so the enclosing structure doesn't have a meaningful fixed size at
> > compile time.  We can check the offsets of the members before the
> > variable-length member, so do those.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> I'm missing why this is necessary. Is the intent still to catch
> alignment and/or padding issues? If so, isn't the size check sufficient,
> regardless of trailing variable size fields?
> 
> Perhaps the goal here is to reduce the scope of checking from where it
> isn't needed..? For example, xfs_dir2_data_unused_t looks like it has a
> field where the offset in the structure is irrelevant, so that's a
> possible false positive if that changes down the road. On the flip side,
> that doesn't appear to be the case for other structures such as
> xfs_attr_leaf_name_[local|remote]_t.

ISTR making this change to work around behavioral variances in how
much padding gcc adds to structures across its various targets.  The
macros that go along with the variable sized structures work fine,
but testing the sizeof() doesn't work reliably.

--D

> 
> Brian
> 
> >  fs/xfs/xfs_ondisk.h |   25 +++++++++++++++++++++++--
> >  1 file changed, 23 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> > index 184c44e..0272301 100644
> > --- a/fs/xfs/xfs_ondisk.h
> > +++ b/fs/xfs/xfs_ondisk.h
> > @@ -22,6 +22,11 @@
> >  	BUILD_BUG_ON_MSG(sizeof(structname) != (size), "XFS: sizeof(" \
> >  		#structname ") is wrong, expected " #size)
> >  
> > +#define XFS_CHECK_OFFSET(structname, member, off) \
> > +	BUILD_BUG_ON_MSG(offsetof(structname, member) != (off), \
> > +		"XFS: offsetof(" #structname ", " #member ") is wrong, " \
> > +		"expected " #off)
> > +
> >  static inline void __init
> >  xfs_check_ondisk_structs(void)
> >  {
> > @@ -75,15 +80,28 @@ xfs_check_ondisk_structs(void)
> >  	XFS_CHECK_STRUCT_SIZE(xfs_attr_leaf_name_remote_t,	12);
> >  	 */
> >  
> > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, valuelen,	0);
> > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, namelen,	2);
> > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, nameval,	3);
> > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valueblk,	0);
> > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valuelen,	4);
> > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, namelen,	8);
> > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, name,	9);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_attr_leafblock_t,		40);
> > -	XFS_CHECK_STRUCT_SIZE(xfs_attr_shortform_t,		8);
> > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.totsize,	0);
> > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.count,	2);
> > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].namelen,	4);
> > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].valuelen, 5);
> > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].flags,	6);
> > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].nameval,	7);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_da_blkinfo_t,			12);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_da_intnode_t,			16);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_da_node_entry_t,		8);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_da_node_hdr_t,		16);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_free_t,		4);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_hdr_t,		16);
> > -	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_unused_t,		6);
> > +	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, freetag,	0);
> > +	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, length,	2);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_hdr_t,		16);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_t,			16);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_ino4_t,			4);
> > @@ -94,6 +112,9 @@ xfs_check_ondisk_structs(void)
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_t,			16);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_tail_t,		4);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_entry_t,		3);
> > +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, namelen,		0);
> > +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, offset,		1);
> > +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, name,		3);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_hdr_t,		10);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_off_t,		2);
> >  
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 009/119] xfs: convert list of extents to free into a regular list
  2016-06-17 11:59   ` Christoph Hellwig
@ 2016-06-18 20:15     ` Darrick J. Wong
  2016-06-21  0:57       ` Dave Chinner
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-18 20:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Fri, Jun 17, 2016 at 04:59:30AM -0700, Christoph Hellwig wrote:
> >  {
> > +	struct xfs_bmap_free_item	*new;		/* new element */
> >  #ifdef DEBUG
> >  	xfs_agnumber_t		agno;
> >  	xfs_agblock_t		agbno;
> > @@ -597,17 +595,7 @@ xfs_bmap_add_free(
> >  	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
> >  	new->xbfi_startblock = bno;
> >  	new->xbfi_blockcount = (xfs_extlen_t)len;
> > +	list_add(&new->xbfi_list, &flist->xbf_flist);
> >  	flist->xbf_count++;
> 
> Please kill xbf_count while you're at it, it's entirely superflous.

The deferred ops conversion patch kills this off by moving the whole
"defer an op to the next transaction by logging redo items" logic
into a separate file and mechanism.

This patch is just a cleanup to reduce some of the open coded list ugliness
before starting on the rmap stuff.  Once the deferred ops code lands, all
three of these functions go away.

> > @@ -617,14 +605,10 @@ xfs_bmap_add_free(
> >   */
> >  void
> >  xfs_bmap_del_free(
> > -	xfs_bmap_free_t		*flist,	/* free item list header */
> > -	xfs_bmap_free_item_t	*prev,	/* previous item on list, if any */
> > -	xfs_bmap_free_item_t	*free)	/* list item to be freed */
> > +	struct xfs_bmap_free		*flist,	/* free item list header */
> > +	struct xfs_bmap_free_item	*free)	/* list item to be freed */
> 
> Which then also gets rid of the flist argument here.
> 
> > @@ -634,17 +618,16 @@ xfs_bmap_del_free(
> >   */
> >  void
> >  xfs_bmap_cancel(
> > +	struct xfs_bmap_free		*flist)	/* list of bmap_free_items */
> >  {
> > +	struct xfs_bmap_free_item	*free;	/* free list item */
> >  
> >  	if (flist->xbf_count == 0)
> >  		return;
> > +	while (!list_empty(&flist->xbf_flist)) {
> > +		free = list_first_entry(&flist->xbf_flist,
> > +				struct xfs_bmap_free_item, xbfi_list);
> 
> 	while ((free = list_first_entry_or_null(...))
> 
> > +	list_sort((*tp)->t_mountp, &flist->xbf_flist, xfs_bmap_free_list_cmp);
> 
> Can you add a comment on why we are sorting the list?

We sort the list so that we process the freed extents in AG order to
avoid deadlocking.

I'll add a comment to the deferred ops code if there isn't one already.

--D

> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 006/119] xfs: port differences from xfsprogs libxfs
  2016-06-17  1:18 ` [PATCH 006/119] xfs: port differences from xfsprogs libxfs Darrick J. Wong
  2016-06-17 15:06   ` Christoph Hellwig
@ 2016-06-20  0:21   ` Dave Chinner
  2016-07-13 23:39     ` Darrick J. Wong
  1 sibling, 1 reply; 236+ messages in thread
From: Dave Chinner @ 2016-06-20  0:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:18:30PM -0700, Darrick J. Wong wrote:
> Port various differences between xfsprogs and the kernel.  This
> cleans up both so that we can develop rmap and reflink on the
> same libxfs code.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Nak. I'm essentially trying to keep the little hacks needed in 
userspace out of the kernel libxfs tree. We quite regularly get
people scanning the kernel tree and trying to remove things like
exported function prototypes that are not used in kernel space,
so the headers in userspace carry those simply to prevent people
continually sending kernel patches that we have to look at and then
ignore...

> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 99b077c..58bdca7 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2415,7 +2415,9 @@ xfs_alloc_read_agf(
>  			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
>  		spin_lock_init(&pag->pagb_lock);
>  		pag->pagb_count = 0;
> +#ifdef __KERNEL__
>  		pag->pagb_tree = RB_ROOT;
> +#endif
>  		pag->pagf_init = 1;
>  	}
>  #ifdef DEBUG

e.g. this is an indication that reminds us that there is
functionality in the libxfs kernel tree that isn't in userspace...

> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> index 4f2aed0..8ef420a 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> @@ -51,7 +51,7 @@ int	xfs_attr_shortform_getvalue(struct xfs_da_args *args);
>  int	xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
>  int	xfs_attr_shortform_remove(struct xfs_da_args *args);
>  int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
> -int	xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
> +int	xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
>  void	xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);

Things like this are fine...

>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 932381c..499e980 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -1425,7 +1425,7 @@ xfs_bmap_search_multi_extents(
>   * Else, *lastxp will be set to the index of the found
>   * entry; *gotp will contain the entry.
>   */
> -STATIC xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
> +xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
>  xfs_bmap_search_extents(
>  	xfs_inode_t     *ip,            /* incore inode pointer */
>  	xfs_fileoff_t   bno,            /* block number searched for */
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 423a34e..79e3ebe 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -231,4 +231,10 @@ int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
>  		int num_exts);
>  int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
>  
> +struct xfs_bmbt_rec_host *
> +	xfs_bmap_search_extents(struct xfs_inode *ip, xfs_fileoff_t bno,
> +				int fork, int *eofp, xfs_extnum_t *lastxp,
> +				struct xfs_bmbt_irec *gotp,
> +				struct xfs_bmbt_irec *prevp);
> +
>  #endif	/* __XFS_BMAP_H__ */

But these are the sort of "clean up the kernel patches" that I was
refering to. If there's a user in kernel space, then fine, otherwise
it doesn't hurt to keep it only in userspace. There are relatively
few of these....

> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 1f88e1c..105979d 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2532,6 +2532,7 @@ error0:
>  	return error;
>  }
>  
> +#ifdef __KERNEL__
>  struct xfs_btree_split_args {
>  	struct xfs_btree_cur	*cur;
>  	int			level;
> @@ -2609,6 +2610,9 @@ xfs_btree_split(
>  	destroy_work_on_stack(&args.work);
>  	return args.result;
>  }
> +#else /* !KERNEL */
> +#define xfs_btree_split	__xfs_btree_split
> +#endif

Same again -this is 4 lines of code that is userspace only. It's a
tiny amount compared to the original difference that these
kernel-only stack splits required, and so not a huge issue.

> --- a/fs/xfs/libxfs/xfs_dquot_buf.c
> +++ b/fs/xfs/libxfs/xfs_dquot_buf.c
> @@ -31,10 +31,16 @@
>  #include "xfs_cksum.h"
>  #include "xfs_trace.h"
>  
> +/*
> + * XXX: kernel implementation causes ndquots calc to go real
> + * bad. Just leaving the existing userspace calc here right now.
> + */
>  int
>  xfs_calc_dquots_per_chunk(
>  	unsigned int		nbblks)	/* basic block units */
>  {
> +#ifdef __KERNEL__
> +	/* kernel code that goes wrong in userspace! */
>  	unsigned int	ndquots;
>  
>  	ASSERT(nbblks > 0);
> @@ -42,6 +48,10 @@ xfs_calc_dquots_per_chunk(
>  	do_div(ndquots, sizeof(xfs_dqblk_t));
>  
>  	return ndquots;
> +#else
> +	ASSERT(nbblks > 0);
> +	return BBTOB(nbblks) / sizeof(xfs_dqblk_t);
> +#endif
>  }

This is a clear case that we need to fix the code to be
correct for both kernel and userspace without modification, not
propagate the userspace hack back into the kernel code.

> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 9d9559e..794fa66 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -56,6 +56,17 @@ xfs_inobp_check(
>  }
>  #endif
>  
> +bool
> +xfs_dinode_good_version(
> +	struct xfs_mount *mp,
> +	__u8		version)
> +{
> +	if (xfs_sb_version_hascrc(&mp->m_sb))
> +		return version == 3;
> +
> +	return version == 1 || version == 2;
> +}

This xfs_dinode_good_version() change needs to be a separate patch

>  void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
>  #else
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e8f49c0..e5baba3 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -462,8 +462,8 @@ static inline uint xfs_log_dinode_size(int version)
>  typedef struct xfs_buf_log_format {
>  	unsigned short	blf_type;	/* buf log item type indicator */
>  	unsigned short	blf_size;	/* size of this item */
> -	ushort		blf_flags;	/* misc state */
> -	ushort		blf_len;	/* number of blocks in this buf */
> +	unsigned short	blf_flags;	/* misc state */
> +	unsigned short	blf_len;	/* number of blocks in this buf */
>  	__int64_t	blf_blkno;	/* starting blkno of this buf */
>  	unsigned int	blf_map_size;	/* used size of data bitmap in words */
>  	unsigned int	blf_data_map[XFS_BLF_DATAMAP_SIZE]; /* dirty bitmap */

The removal of ushort/uint from the kernel code needs to be a
separate patch that addresses all the users, not just the couple in
shared headers....

> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index 12ca867..09d6fd0 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -261,6 +261,7 @@ xfs_mount_validate_sb(
>  	/*
>  	 * Until this is fixed only page-sized or smaller data blocks work.
>  	 */
> +#ifdef __KERNEL__
>  	if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) {
>  		xfs_warn(mp,
>  		"File system with blocksize %d bytes. "
> @@ -268,6 +269,7 @@ xfs_mount_validate_sb(
>  				sbp->sb_blocksize, PAGE_SIZE);
>  		return -ENOSYS;
>  	}
> +#endif
>  
>  	/*
>  	 * Currently only very few inode sizes are supported.
> @@ -291,10 +293,12 @@ xfs_mount_validate_sb(
>  		return -EFBIG;
>  	}
>  
> +#ifdef __KERNEL__
>  	if (check_inprogress && sbp->sb_inprogress) {
>  		xfs_warn(mp, "Offline file system operation in progress!");
>  		return -EFSCORRUPTED;
>  	}
> +#endif
>  	return 0;
>  }

Again, I don't think this needs to be propagated back into the
kernel code...

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 003/119] xfs: check offsets of variable length structures
  2016-06-18 18:01     ` Darrick J. Wong
@ 2016-06-20 12:38       ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-20 12:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Sat, Jun 18, 2016 at 11:01:33AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 17, 2016 at 01:34:27PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:18:12PM -0700, Darrick J. Wong wrote:
> > > Some of the directory/attr structures contain variable-length objects,
> > > so the enclosing structure doesn't have a meaningful fixed size at
> > > compile time.  We can check the offsets of the members before the
> > > variable-length member, so do those.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > I'm missing why this is necessary. Is the intent still to catch
> > alignment and/or padding issues? If so, isn't the size check sufficient,
> > regardless of trailing variable size fields?
> > 
> > Perhaps the goal here is to reduce the scope of checking from where it
> > isn't needed..? For example, xfs_dir2_data_unused_t looks like it has a
> > field where the offset in the structure is irrelevant, so that's a
> > possible false positive if that changes down the road. On the flip side,
> > that doesn't appear to be the case for other structures such as
> > xfs_attr_leaf_name_[local|remote]_t.
> 
> ISTR making this change to work around behavioral variances in how
> much padding gcc adds to structures across its various targets.  The
> macros that go along with the variable sized structures work fine,
> but testing the sizeof() doesn't work reliably.
> 

Ok, I take that to mean that we may or may not have padding in some of
the variable structures depending on architecture (and we really only
care about certain fields in those structures). Fair enough, thanks!

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  fs/xfs/xfs_ondisk.h |   25 +++++++++++++++++++++++--
> > >  1 file changed, 23 insertions(+), 2 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> > > index 184c44e..0272301 100644
> > > --- a/fs/xfs/xfs_ondisk.h
> > > +++ b/fs/xfs/xfs_ondisk.h
> > > @@ -22,6 +22,11 @@
> > >  	BUILD_BUG_ON_MSG(sizeof(structname) != (size), "XFS: sizeof(" \
> > >  		#structname ") is wrong, expected " #size)
> > >  
> > > +#define XFS_CHECK_OFFSET(structname, member, off) \
> > > +	BUILD_BUG_ON_MSG(offsetof(structname, member) != (off), \
> > > +		"XFS: offsetof(" #structname ", " #member ") is wrong, " \
> > > +		"expected " #off)
> > > +
> > >  static inline void __init
> > >  xfs_check_ondisk_structs(void)
> > >  {
> > > @@ -75,15 +80,28 @@ xfs_check_ondisk_structs(void)
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_attr_leaf_name_remote_t,	12);
> > >  	 */
> > >  
> > > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, valuelen,	0);
> > > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, namelen,	2);
> > > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_local_t, nameval,	3);
> > > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valueblk,	0);
> > > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, valuelen,	4);
> > > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, namelen,	8);
> > > +	XFS_CHECK_OFFSET(xfs_attr_leaf_name_remote_t, name,	9);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_attr_leafblock_t,		40);
> > > -	XFS_CHECK_STRUCT_SIZE(xfs_attr_shortform_t,		8);
> > > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.totsize,	0);
> > > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, hdr.count,	2);
> > > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].namelen,	4);
> > > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].valuelen, 5);
> > > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].flags,	6);
> > > +	XFS_CHECK_OFFSET(xfs_attr_shortform_t, list[0].nameval,	7);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_da_blkinfo_t,			12);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_da_intnode_t,			16);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_da_node_entry_t,		8);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_da_node_hdr_t,		16);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_free_t,		4);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_hdr_t,		16);
> > > -	XFS_CHECK_STRUCT_SIZE(xfs_dir2_data_unused_t,		6);
> > > +	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, freetag,	0);
> > > +	XFS_CHECK_OFFSET(xfs_dir2_data_unused_t, length,	2);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_hdr_t,		16);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_free_t,			16);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_ino4_t,			4);
> > > @@ -94,6 +112,9 @@ xfs_check_ondisk_structs(void)
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_t,			16);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_leaf_tail_t,		4);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_entry_t,		3);
> > > +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, namelen,		0);
> > > +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, offset,		1);
> > > +	XFS_CHECK_OFFSET(xfs_dir2_sf_entry_t, name,		3);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_hdr_t,		10);
> > >  	XFS_CHECK_STRUCT_SIZE(xfs_dir2_sf_off_t,		2);
> > >  
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 010/119] xfs: create a standard btree size calculator code
  2016-06-17  1:18 ` [PATCH 010/119] xfs: create a standard btree size calculator code Darrick J. Wong
@ 2016-06-20 14:31   ` Brian Foster
  2016-06-20 19:34     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-20 14:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:18:56PM -0700, Darrick J. Wong wrote:
> Create a helper to generate AG btree height calculator functions.
> This will be used (much) later when we get to the refcount btree.
> 
> v2: Use a helper function instead of a macro.
> v3: We can (theoretically) store more than 2^32 records in a btree, so
>     widen the fields to accept that.
> v4: Don't modify xfs_bmap_worst_indlen; the purpose of /that/ function
>     is to estimate the worst-case number of blocks needed for a bmbt
>     expansion, not to calculate the space required to store nr records.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

I think this one should probably be pushed out to where it is used
(easier to review with an example imo). I don't see it used anywhere up
through the rmapbt stuff, anyways...

>  fs/xfs/libxfs/xfs_btree.c |   27 +++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h |    3 +++
>  2 files changed, 30 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 105979d..5eb4e40 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4156,3 +4156,30 @@ xfs_btree_sblock_verify(
>  
>  	return true;
>  }
> +
> +/*
> + * Calculate the number of blocks needed to store a given number of records
> + * in a short-format (per-AG metadata) btree.
> + */
> +xfs_extlen_t
> +xfs_btree_calc_size(
> +	struct xfs_mount	*mp,
> +	uint			*limits,
> +	unsigned long long	len)
> +{
> +	int			level;
> +	int			maxrecs;
> +	xfs_extlen_t		rval;
> +
> +	maxrecs = limits[0];
> +	for (level = 0, rval = 0; len > 0; level++) {

len is unsigned, so len > 0 is kind of pointless. Perhaps check len > 1
and kill the check in the loop?

Brian

> +		len += maxrecs - 1;
> +		do_div(len, maxrecs);
> +		rval += len;
> +		if (len == 1)
> +			return rval;
> +		if (level == 0)
> +			maxrecs = limits[1];
> +	}
> +	return rval;
> +}
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 9a88839..b330f19 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -475,4 +475,7 @@ static inline int xfs_btree_get_level(struct xfs_btree_block *block)
>  bool xfs_btree_sblock_v5hdr_verify(struct xfs_buf *bp);
>  bool xfs_btree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
>  
> +xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
> +		unsigned long long len);
> +
>  #endif	/* __XFS_BTREE_H__ */
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 011/119] xfs: refactor btree maxlevels computation
  2016-06-17  1:19 ` [PATCH 011/119] xfs: refactor btree maxlevels computation Darrick J. Wong
@ 2016-06-20 14:31   ` Brian Foster
  2016-06-20 18:23     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-20 14:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:02PM -0700, Darrick J. Wong wrote:
> Create a common function to calculate the maximum height of a per-AG
> btree.  This will eventually be used by the rmapbt and refcountbt code
> to calculate appropriate maxlevels values for each.  This is important
> because the verifiers and the transaction block reservations depend on
> accurate estimates of many blocks are needed to satisfy a btree split.

			how many

> 
> We were mistakenly using the max bnobt height for all the btrees,
> which creates a dangerous situation since the larger records and keys
> in an rmapbt make it very possible that the rmapbt will be taller than
> the bnobt and so we can run out of transaction block reservation.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_alloc.c  |   15 ++-------------
>  fs/xfs/libxfs/xfs_btree.c  |   19 +++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h  |    2 ++
>  fs/xfs/libxfs/xfs_ialloc.c |   19 +++++--------------
>  4 files changed, 28 insertions(+), 27 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 1c76a0e..c366889 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1839,19 +1839,8 @@ void
>  xfs_alloc_compute_maxlevels(
>  	xfs_mount_t	*mp)	/* file system mount structure */
>  {
> -	int		level;
> -	uint		maxblocks;
> -	uint		maxleafents;
> -	int		minleafrecs;
> -	int		minnoderecs;
> -
> -	maxleafents = (mp->m_sb.sb_agblocks + 1) / 2;
> -	minleafrecs = mp->m_alloc_mnr[0];
> -	minnoderecs = mp->m_alloc_mnr[1];
> -	maxblocks = (maxleafents + minleafrecs - 1) / minleafrecs;
> -	for (level = 1; maxblocks > 1; level++)
> -		maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
> -	mp->m_ag_maxlevels = level;
> +	mp->m_ag_maxlevels = xfs_btree_compute_maxlevels(mp, mp->m_alloc_mnr,
> +			(mp->m_sb.sb_agblocks + 1) / 2);
>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 5eb4e40..046fbcf 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4158,6 +4158,25 @@ xfs_btree_sblock_verify(
>  }
>  
>  /*
> + * Calculate the number of btree levels needed to store a given number of
> + * records in a short-format btree.
> + */
> +uint
> +xfs_btree_compute_maxlevels(
> +	struct xfs_mount	*mp,
> +	uint			*limits,
> +	unsigned long		len)
> +{
> +	uint			level;
> +	unsigned long		maxblocks;
> +
> +	maxblocks = (len + limits[0] - 1) / limits[0];
> +	for (level = 1; maxblocks > 1; level++)
> +		maxblocks = (maxblocks + limits[1] - 1) / limits[1];
> +	return level;
> +}
> +
> +/*
>   * Calculate the number of blocks needed to store a given number of records
>   * in a short-format (per-AG metadata) btree.
>   */
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index b330f19..b955e5d 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -477,5 +477,7 @@ bool xfs_btree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
>  
>  xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
>  		unsigned long long len);
> +uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
> +		unsigned long len);
>  
>  #endif	/* __XFS_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 9d0003c..cda7269 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -2394,20 +2394,11 @@ void
>  xfs_ialloc_compute_maxlevels(
>  	xfs_mount_t	*mp)		/* file system mount structure */
>  {
> -	int		level;
> -	uint		maxblocks;
> -	uint		maxleafents;
> -	int		minleafrecs;
> -	int		minnoderecs;
> -
> -	maxleafents = (1LL << XFS_INO_AGINO_BITS(mp)) >>
> -		XFS_INODES_PER_CHUNK_LOG;
> -	minleafrecs = mp->m_inobt_mnr[0];
> -	minnoderecs = mp->m_inobt_mnr[1];
> -	maxblocks = (maxleafents + minleafrecs - 1) / minleafrecs;
> -	for (level = 1; maxblocks > 1; level++)
> -		maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
> -	mp->m_in_maxlevels = level;
> +	uint		inodes;
> +
> +	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
> +	mp->m_in_maxlevels = xfs_btree_compute_maxlevels(mp, mp->m_inobt_mnr,
> +							 inodes);
>  }
>  
>  /*
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 011/119] xfs: refactor btree maxlevels computation
  2016-06-20 14:31   ` Brian Foster
@ 2016-06-20 18:23     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-20 18:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 20, 2016 at 10:31:59AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:19:02PM -0700, Darrick J. Wong wrote:
> > Create a common function to calculate the maximum height of a per-AG
> > btree.  This will eventually be used by the rmapbt and refcountbt code
> > to calculate appropriate maxlevels values for each.  This is important
> > because the verifiers and the transaction block reservations depend on
> > accurate estimates of many blocks are needed to satisfy a btree split.
> 
> 			how many

Got it, will change for the next posting.

--D

> 
> > 
> > We were mistakenly using the max bnobt height for all the btrees,
> > which creates a dangerous situation since the larger records and keys
> > in an rmapbt make it very possible that the rmapbt will be taller than
> > the bnobt and so we can run out of transaction block reservation.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> >  fs/xfs/libxfs/xfs_alloc.c  |   15 ++-------------
> >  fs/xfs/libxfs/xfs_btree.c  |   19 +++++++++++++++++++
> >  fs/xfs/libxfs/xfs_btree.h  |    2 ++
> >  fs/xfs/libxfs/xfs_ialloc.c |   19 +++++--------------
> >  4 files changed, 28 insertions(+), 27 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 1c76a0e..c366889 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -1839,19 +1839,8 @@ void
> >  xfs_alloc_compute_maxlevels(
> >  	xfs_mount_t	*mp)	/* file system mount structure */
> >  {
> > -	int		level;
> > -	uint		maxblocks;
> > -	uint		maxleafents;
> > -	int		minleafrecs;
> > -	int		minnoderecs;
> > -
> > -	maxleafents = (mp->m_sb.sb_agblocks + 1) / 2;
> > -	minleafrecs = mp->m_alloc_mnr[0];
> > -	minnoderecs = mp->m_alloc_mnr[1];
> > -	maxblocks = (maxleafents + minleafrecs - 1) / minleafrecs;
> > -	for (level = 1; maxblocks > 1; level++)
> > -		maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
> > -	mp->m_ag_maxlevels = level;
> > +	mp->m_ag_maxlevels = xfs_btree_compute_maxlevels(mp, mp->m_alloc_mnr,
> > +			(mp->m_sb.sb_agblocks + 1) / 2);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 5eb4e40..046fbcf 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -4158,6 +4158,25 @@ xfs_btree_sblock_verify(
> >  }
> >  
> >  /*
> > + * Calculate the number of btree levels needed to store a given number of
> > + * records in a short-format btree.
> > + */
> > +uint
> > +xfs_btree_compute_maxlevels(
> > +	struct xfs_mount	*mp,
> > +	uint			*limits,
> > +	unsigned long		len)
> > +{
> > +	uint			level;
> > +	unsigned long		maxblocks;
> > +
> > +	maxblocks = (len + limits[0] - 1) / limits[0];
> > +	for (level = 1; maxblocks > 1; level++)
> > +		maxblocks = (maxblocks + limits[1] - 1) / limits[1];
> > +	return level;
> > +}
> > +
> > +/*
> >   * Calculate the number of blocks needed to store a given number of records
> >   * in a short-format (per-AG metadata) btree.
> >   */
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index b330f19..b955e5d 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -477,5 +477,7 @@ bool xfs_btree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
> >  
> >  xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
> >  		unsigned long long len);
> > +uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
> > +		unsigned long len);
> >  
> >  #endif	/* __XFS_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 9d0003c..cda7269 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -2394,20 +2394,11 @@ void
> >  xfs_ialloc_compute_maxlevels(
> >  	xfs_mount_t	*mp)		/* file system mount structure */
> >  {
> > -	int		level;
> > -	uint		maxblocks;
> > -	uint		maxleafents;
> > -	int		minleafrecs;
> > -	int		minnoderecs;
> > -
> > -	maxleafents = (1LL << XFS_INO_AGINO_BITS(mp)) >>
> > -		XFS_INODES_PER_CHUNK_LOG;
> > -	minleafrecs = mp->m_inobt_mnr[0];
> > -	minnoderecs = mp->m_inobt_mnr[1];
> > -	maxblocks = (maxleafents + minleafrecs - 1) / minleafrecs;
> > -	for (level = 1; maxblocks > 1; level++)
> > -		maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
> > -	mp->m_in_maxlevels = level;
> > +	uint		inodes;
> > +
> > +	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
> > +	mp->m_in_maxlevels = xfs_btree_compute_maxlevels(mp, mp->m_inobt_mnr,
> > +							 inodes);
> >  }
> >  
> >  /*
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 010/119] xfs: create a standard btree size calculator code
  2016-06-20 14:31   ` Brian Foster
@ 2016-06-20 19:34     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-20 19:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 20, 2016 at 10:31:49AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:18:56PM -0700, Darrick J. Wong wrote:
> > Create a helper to generate AG btree height calculator functions.
> > This will be used (much) later when we get to the refcount btree.
> > 
> > v2: Use a helper function instead of a macro.
> > v3: We can (theoretically) store more than 2^32 records in a btree, so
> >     widen the fields to accept that.
> > v4: Don't modify xfs_bmap_worst_indlen; the purpose of /that/ function
> >     is to estimate the worst-case number of blocks needed for a bmbt
> >     expansion, not to calculate the space required to store nr records.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> I think this one should probably be pushed out to where it is used
> (easier to review with an example imo). I don't see it used anywhere up
> through the rmapbt stuff, anyways...

Oh, heh, you're right.  At one point I was using it for the rmapbt, but
nowadays it's only used for per-AG reservations (reflink+rmap) as you
point out, so it could move.

> 
> >  fs/xfs/libxfs/xfs_btree.c |   27 +++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_btree.h |    3 +++
> >  2 files changed, 30 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 105979d..5eb4e40 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -4156,3 +4156,30 @@ xfs_btree_sblock_verify(
> >  
> >  	return true;
> >  }
> > +
> > +/*
> > + * Calculate the number of blocks needed to store a given number of records
> > + * in a short-format (per-AG metadata) btree.
> > + */
> > +xfs_extlen_t
> > +xfs_btree_calc_size(
> > +	struct xfs_mount	*mp,
> > +	uint			*limits,
> > +	unsigned long long	len)
> > +{
> > +	int			level;
> > +	int			maxrecs;
> > +	xfs_extlen_t		rval;
> > +
> > +	maxrecs = limits[0];
> > +	for (level = 0, rval = 0; len > 0; level++) {
> 
> len is unsigned, so len > 0 is kind of pointless. Perhaps check len > 1
> and kill the check in the loop?

Yup.  Thank you for pointing that out.

--D

> 
> Brian
> 
> > +		len += maxrecs - 1;
> > +		do_div(len, maxrecs);
> > +		rval += len;
> > +		if (len == 1)
> > +			return rval;
> > +		if (level == 0)
> > +			maxrecs = limits[1];
> > +	}
> > +	return rval;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index 9a88839..b330f19 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -475,4 +475,7 @@ static inline int xfs_btree_get_level(struct xfs_btree_block *block)
> >  bool xfs_btree_sblock_v5hdr_verify(struct xfs_buf *bp);
> >  bool xfs_btree_sblock_verify(struct xfs_buf *bp, unsigned int max_recs);
> >  
> > +xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
> > +		unsigned long long len);
> > +
> >  #endif	/* __XFS_BTREE_H__ */
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 004/119] xfs: enable buffer deadlock postmortem diagnosis via ftrace
  2016-06-17 11:34   ` Christoph Hellwig
@ 2016-06-21  0:47     ` Dave Chinner
  0 siblings, 0 replies; 236+ messages in thread
From: Dave Chinner @ 2016-06-21  0:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jun 17, 2016 at 04:34:23AM -0700, Christoph Hellwig wrote:
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index efa2a73..2333db7 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -947,7 +947,8 @@ xfs_buf_trylock(
> >  	if (locked)
> >  		XB_SET_OWNER(bp);
> >  
> > -	trace_xfs_buf_trylock(bp, _RET_IP_);
> > +	locked ? trace_xfs_buf_trylock(bp, _RET_IP_) :
> > +		 trace_xfs_buf_trylock_fail(bp, _RET_IP_);
> >  	return locked;
> 
> I think this should be something like:
> 
> 	if (locked) {
> 		XB_SET_OWNER(bp);
> 		trace_xfs_buf_trylock(bp, _RET_IP_);
> 	} else {
> 		trace_xfs_buf_trylock_fail(bp, _RET_IP_);
> 	}
> 
> otherwise this looks good and can go in without the rest of the series.

I'll fix that up on commit.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 008/119] xfs: separate freelist fixing into a separate helper
  2016-06-17 11:52   ` Christoph Hellwig
@ 2016-06-21  0:48     ` Dave Chinner
  0 siblings, 0 replies; 236+ messages in thread
From: Dave Chinner @ 2016-06-21  0:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jun 17, 2016 at 04:52:04AM -0700, Christoph Hellwig wrote:
> > +/* Ensure that the freelist is at full capacity. */
> > +int
> > +xfs_free_extent_fix_freelist(
> > +	struct xfs_trans	*tp,
> > +	xfs_agnumber_t		agno,
> > +	struct xfs_buf		**agbp)
> >  {
> > -	xfs_alloc_arg_t	args;
> > -	int		error;
> > +	xfs_alloc_arg_t		args;
> 
> Use struct xfs_alloc_arg if you change this anyway.
> 
> > +	int			error;
> >  
> > -	ASSERT(len != 0);
> >  	memset(&args, 0, sizeof(xfs_alloc_arg_t));
> 
> Same here.
> 
> > -	if (args.agbno + len >
> > -			be32_to_cpu(XFS_BUF_TO_AGF(args.agbp)->agf_length)) {
> > -		error = -EFSCORRUPTED;
> > -		goto error0;
> > -	}
> > +	XFS_WANT_CORRUPTED_GOTO(mp,
> > +			agbno + len <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length),
> > +			err);
> 
> This introduces an overly long line.
> 
> But except for these nitpicks this looks fine:

I'll clean them up on commit.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 009/119] xfs: convert list of extents to free into a regular list
  2016-06-18 20:15     ` Darrick J. Wong
@ 2016-06-21  0:57       ` Dave Chinner
  0 siblings, 0 replies; 236+ messages in thread
From: Dave Chinner @ 2016-06-21  0:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, vishal.l.verma, xfs

On Sat, Jun 18, 2016 at 01:15:10PM -0700, Darrick J. Wong wrote:
> On Fri, Jun 17, 2016 at 04:59:30AM -0700, Christoph Hellwig wrote:
> > >  {
> > > +	struct xfs_bmap_free_item	*new;		/* new element */
> > >  #ifdef DEBUG
> > >  	xfs_agnumber_t		agno;
> > >  	xfs_agblock_t		agbno;
> > > @@ -597,17 +595,7 @@ xfs_bmap_add_free(
> > >  	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
> > >  	new->xbfi_startblock = bno;
> > >  	new->xbfi_blockcount = (xfs_extlen_t)len;
> > > +	list_add(&new->xbfi_list, &flist->xbf_flist);
> > >  	flist->xbf_count++;
> > 
> > Please kill xbf_count while you're at it, it's entirely superflous.
> 
> The deferred ops conversion patch kills this off by moving the whole
> "defer an op to the next transaction by logging redo items" logic
> into a separate file and mechanism.
> 
> This patch is just a cleanup to reduce some of the open coded list ugliness
> before starting on the rmap stuff.  Once the deferred ops code lands, all
> three of these functions go away.

Ok, so because all these functions go away, I'll take this patch now
without the suggested cleanups so that you don't have to rework it.

....

> > > +	list_sort((*tp)->t_mountp, &flist->xbf_flist, xfs_bmap_free_list_cmp);
> > 
> > Can you add a comment on why we are sorting the list?
> 
> We sort the list so that we process the freed extents in AG order to
> avoid deadlocking.
> 
> I'll add a comment to the deferred ops code if there isn't one already.

This seems best - add the clean up to the later patches rather than
have to rework lots of patches because of minor mods to early
patches...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 008/119] xfs: separate freelist fixing into a separate helper
  2016-06-17  1:18 ` [PATCH 008/119] xfs: separate freelist fixing into a separate helper Darrick J. Wong
  2016-06-17 11:52   ` Christoph Hellwig
@ 2016-06-21  1:40   ` Dave Chinner
  1 sibling, 0 replies; 236+ messages in thread
From: Dave Chinner @ 2016-06-21  1:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:18:43PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <david@fromorbit.com>
> 
> Break up xfs_free_extent() into a helper that fixes the freelist.
> This helper will be used subsequently to ensure the freelist during
> deferred rmap processing.
> 
> Signed-off-by: Dave Chinner <david@fromorbit.com>

Just noticed - should be from/sob dchinner@redhat.com. I'll fix this
up, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion
  2016-06-17  1:19 ` [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion Darrick J. Wong
@ 2016-06-21 13:00   ` Brian Foster
  2016-06-27 22:30     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-21 13:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:08PM -0700, Darrick J. Wong wrote:
> When a btree block has to be split, we pass the new block's ptr from
> xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
> however, we pass the block's key through the cursor's record.  It is a
> little weird to "initialize" a record from a key since the non-key
> attributes will have garbage values.
> 
> When we go to add support for interval queries, we have to be able to
> pass the lowest and highest keys accessible via a pointer.  There's no
> clean way to pass this back through the cursor's record field.
> Therefore, pass the key directly back to xfs_btree_insert() the same
> way that we pass the btree_ptr.
> 
> As a bonus, we no longer need init_rec_from_key and can drop it from the
> codebase.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c  |   12 ----------
>  fs/xfs/libxfs/xfs_bmap_btree.c   |   12 ----------
>  fs/xfs/libxfs/xfs_btree.c        |   44 +++++++++++++++++++-------------------
>  fs/xfs/libxfs/xfs_btree.h        |    2 --
>  fs/xfs/libxfs/xfs_ialloc_btree.c |   10 ---------
>  5 files changed, 22 insertions(+), 58 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 046fbcf..a096539 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
...
> @@ -2929,16 +2927,16 @@ xfs_btree_insrec(
>  	struct xfs_btree_cur	*cur,	/* btree cursor */
>  	int			level,	/* level to insert record at */
>  	union xfs_btree_ptr	*ptrp,	/* i/o: block number inserted */
> -	union xfs_btree_rec	*recp,	/* i/o: record data inserted */
> +	union xfs_btree_key	*key,	/* i/o: block key for ptrp */
>  	struct xfs_btree_cur	**curp,	/* output: new cursor replacing cur */
>  	int			*stat)	/* success/failure */
>  {
>  	struct xfs_btree_block	*block;	/* btree block */
>  	struct xfs_buf		*bp;	/* buffer for block */
> -	union xfs_btree_key	key;	/* btree key */
>  	union xfs_btree_ptr	nptr;	/* new block ptr */
>  	struct xfs_btree_cur	*ncur;	/* new btree cursor */
> -	union xfs_btree_rec	nrec;	/* new record count */
> +	union xfs_btree_key	nkey;	/* new block key */
> +	union xfs_btree_rec	rec;	/* record to insert */
>  	int			optr;	/* old key/record index */
>  	int			ptr;	/* key/record index */
>  	int			numrecs;/* number of records */
> @@ -2947,8 +2945,14 @@ xfs_btree_insrec(
>  	int			i;
>  #endif
>  
> +	/* Make a key out of the record data to be inserted, and save it. */
> +	if (level == 0) {
> +		cur->bc_ops->init_rec_from_cur(cur, &rec);
> +		cur->bc_ops->init_key_from_rec(key, &rec);
> +	}

The level == 0 check looks a bit hacky to me. IOW, I think it's cleaner
that the key is initialized once in the caller rather than check for a
particular iteration down in xfs_btree_insrec(). That said,
xfs_btree_insrec() still needs rec initialized in the level == 0 case.

I wonder if we could create an inline xfs_btree_init_key_from_cur()
helper to combine the above calls, invoke it once in xfs_btree_insert(),
then push down the ->init_rec_from_cur() calls to the contexts further
down in this function where rec is actually required. There are only two
and one of them is DEBUG code. Thoughts?

> +
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> -	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, recp);
> +	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, &rec);
>  

So these look like unimplemented dummy tracing hooks. It sounds like
previously rec could have a junk value after a btree split, but now it
looks like rec is junk for every non-zero level. Kind of annoying, I
wonder if we can just kill these.. :/

Brian

>  	ncur = NULL;
>  
> @@ -2973,9 +2977,6 @@ xfs_btree_insrec(
>  		return 0;
>  	}
>  
> -	/* Make a key out of the record data to be inserted, and save it. */
> -	cur->bc_ops->init_key_from_rec(&key, recp);
> -
>  	optr = ptr;
>  
>  	XFS_BTREE_STATS_INC(cur, insrec);
> @@ -2992,10 +2993,10 @@ xfs_btree_insrec(
>  	/* Check that the new entry is being inserted in the right place. */
>  	if (ptr <= numrecs) {
>  		if (level == 0) {
> -			ASSERT(cur->bc_ops->recs_inorder(cur, recp,
> +			ASSERT(cur->bc_ops->recs_inorder(cur, &rec,
>  				xfs_btree_rec_addr(cur, ptr, block)));
>  		} else {
> -			ASSERT(cur->bc_ops->keys_inorder(cur, &key,
> +			ASSERT(cur->bc_ops->keys_inorder(cur, key,
>  				xfs_btree_key_addr(cur, ptr, block)));
>  		}
>  	}
> @@ -3008,7 +3009,7 @@ xfs_btree_insrec(
>  	xfs_btree_set_ptr_null(cur, &nptr);
>  	if (numrecs == cur->bc_ops->get_maxrecs(cur, level)) {
>  		error = xfs_btree_make_block_unfull(cur, level, numrecs,
> -					&optr, &ptr, &nptr, &ncur, &nrec, stat);
> +					&optr, &ptr, &nptr, &ncur, &nkey, stat);
>  		if (error || *stat == 0)
>  			goto error0;
>  	}
> @@ -3058,7 +3059,7 @@ xfs_btree_insrec(
>  #endif
>  
>  		/* Now put the new data in, bump numrecs and log it. */
> -		xfs_btree_copy_keys(cur, kp, &key, 1);
> +		xfs_btree_copy_keys(cur, kp, key, 1);
>  		xfs_btree_copy_ptrs(cur, pp, ptrp, 1);
>  		numrecs++;
>  		xfs_btree_set_numrecs(block, numrecs);
> @@ -3079,7 +3080,7 @@ xfs_btree_insrec(
>  		xfs_btree_shift_recs(cur, rp, 1, numrecs - ptr + 1);
>  
>  		/* Now put the new data in, bump numrecs and log it. */
> -		xfs_btree_copy_recs(cur, rp, recp, 1);
> +		xfs_btree_copy_recs(cur, rp, &rec, 1);
>  		xfs_btree_set_numrecs(block, ++numrecs);
>  		xfs_btree_log_recs(cur, bp, ptr, numrecs);
>  #ifdef DEBUG
> @@ -3095,7 +3096,7 @@ xfs_btree_insrec(
>  
>  	/* If we inserted at the start of a block, update the parents' keys. */
>  	if (optr == 1) {
> -		error = xfs_btree_updkey(cur, &key, level + 1);
> +		error = xfs_btree_updkey(cur, key, level + 1);
>  		if (error)
>  			goto error0;
>  	}
> @@ -3105,7 +3106,7 @@ xfs_btree_insrec(
>  	 * we are at the far right edge of the tree, update it.
>  	 */
>  	if (xfs_btree_is_lastrec(cur, block, level)) {
> -		cur->bc_ops->update_lastrec(cur, block, recp,
> +		cur->bc_ops->update_lastrec(cur, block, &rec,
>  					    ptr, LASTREC_INSREC);
>  	}
>  
> @@ -3115,7 +3116,7 @@ xfs_btree_insrec(
>  	 */
>  	*ptrp = nptr;
>  	if (!xfs_btree_ptr_is_null(cur, &nptr)) {
> -		*recp = nrec;
> +		*key = nkey;
>  		*curp = ncur;
>  	}
>  
> @@ -3146,14 +3147,13 @@ xfs_btree_insert(
>  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
>  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
>  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> -	union xfs_btree_rec	rec;	/* record to insert */
> +	union xfs_btree_key	key;	/* key of block to insert */
>  
>  	level = 0;
>  	ncur = NULL;
>  	pcur = cur;
>  
>  	xfs_btree_set_ptr_null(cur, &nptr);
> -	cur->bc_ops->init_rec_from_cur(cur, &rec);
>  
>  	/*
>  	 * Loop going up the tree, starting at the leaf level.
> @@ -3165,7 +3165,7 @@ xfs_btree_insert(
>  		 * Insert nrec/nptr into this level of the tree.
>  		 * Note if we fail, nptr will be null.
>  		 */
> -		error = xfs_btree_insrec(pcur, level, &nptr, &rec, &ncur, &i);
> +		error = xfs_btree_insrec(pcur, level, &nptr, &key, &ncur, &i);
>  		if (error) {
>  			if (pcur != cur)
>  				xfs_btree_del_cursor(pcur, XFS_BTREE_ERROR);
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index b955e5d..b99c018 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -158,8 +158,6 @@ struct xfs_btree_ops {
>  	/* init values of btree structures */
>  	void	(*init_key_from_rec)(union xfs_btree_key *key,
>  				     union xfs_btree_rec *rec);
> -	void	(*init_rec_from_key)(union xfs_btree_key *key,
> -				     union xfs_btree_rec *rec);
>  	void	(*init_rec_from_cur)(struct xfs_btree_cur *cur,
>  				     union xfs_btree_rec *rec);
>  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 89c21d7..88da2ad 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -146,14 +146,6 @@ xfs_inobt_init_key_from_rec(
>  }
>  
>  STATIC void
> -xfs_inobt_init_rec_from_key(
> -	union xfs_btree_key	*key,
> -	union xfs_btree_rec	*rec)
> -{
> -	rec->inobt.ir_startino = key->inobt.ir_startino;
> -}
> -
> -STATIC void
>  xfs_inobt_init_rec_from_cur(
>  	struct xfs_btree_cur	*cur,
>  	union xfs_btree_rec	*rec)
> @@ -314,7 +306,6 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
>  	.get_minrecs		= xfs_inobt_get_minrecs,
>  	.get_maxrecs		= xfs_inobt_get_maxrecs,
>  	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
> -	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
>  	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
>  	.init_ptr_from_cur	= xfs_inobt_init_ptr_from_cur,
>  	.key_diff		= xfs_inobt_key_diff,
> @@ -336,7 +327,6 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
>  	.get_minrecs		= xfs_inobt_get_minrecs,
>  	.get_maxrecs		= xfs_inobt_get_maxrecs,
>  	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
> -	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
>  	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
>  	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
>  	.key_diff		= xfs_inobt_key_diff,
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
  2016-06-17  1:19 ` [PATCH 013/119] xfs: support btrees with overlapping intervals for keys Darrick J. Wong
@ 2016-06-22 15:17   ` Brian Foster
  2016-06-28  3:26     ` Darrick J. Wong
  2016-07-06  4:59   ` Dave Chinner
  1 sibling, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-22 15:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> On a filesystem with both reflink and reverse mapping enabled, it's
> possible to have multiple rmap records referring to the same blocks on
> disk.  When overlapping intervals are possible, querying a classic
> btree to find all records intersecting a given interval is inefficient
> because we cannot use the left side of the search interval to filter
> out non-matching records the same way that we can use the existing
> btree key to filter out records coming after the right side of the
> search interval.  This will become important once we want to use the
> rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
> 
> (For the non-overlapping case, we can perform such queries trivially
> by starting at the left side of the interval and walking the tree
> until we pass the right side.)
> 
> Therefore, extend the btree code to come closer to supporting
> intervals as a first-class record attribute.  This involves widening
> the btree node's key space to store both the lowest key reachable via
> the node pointer (as the btree does now) and the highest key reachable
> via the same pointer and teaching the btree modifying functions to
> keep the highest-key records up to date.
> 
> This behavior can be turned on via a new btree ops flag so that btrees
> that cannot store overlapping intervals don't pay the overhead costs
> in terms of extra code and disk format changes.
> 
> v2: When we're deleting a record in a btree that supports overlapped
> interval records and the deletion results in two btree blocks being
> joined, we defer updating the high/low keys until after all possible
> joining (at higher levels in the tree) have finished.  At this point,
> the btree pointers at all levels have been updated to remove the empty
> blocks and we can update the low and high keys.
> 
> When we're doing this, we must be careful to update the keys of all
> node pointers up to the root instead of stopping at the first set of
> keys that don't need updating.  This is because it's possible for a
> single deletion to cause joining of multiple levels of tree, and so
> we need to update everything going back to the root.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

I think I get the gist of this and it mostly looks Ok to me. A few
questions and minor comments...

>  fs/xfs/libxfs/xfs_btree.c |  379 +++++++++++++++++++++++++++++++++++++++++----
>  fs/xfs/libxfs/xfs_btree.h |   16 ++
>  fs/xfs/xfs_trace.h        |   36 ++++
>  3 files changed, 395 insertions(+), 36 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index a096539..afcafd6 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -52,6 +52,11 @@ static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
>  	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
>  
>  
> +struct xfs_btree_double_key {
> +	union xfs_btree_key	low;
> +	union xfs_btree_key	high;
> +};
> +
>  STATIC int				/* error (0 or EFSCORRUPTED) */
>  xfs_btree_check_lblock(
>  	struct xfs_btree_cur	*cur,	/* btree cursor */
> @@ -428,6 +433,30 @@ xfs_btree_dup_cursor(
>   * into a btree block (xfs_btree_*_offset) or return a pointer to the given
>   * record, key or pointer (xfs_btree_*_addr).  Note that all addressing
>   * inside the btree block is done using indices starting at one, not zero!
> + *
> + * If XFS_BTREE_OVERLAPPING is set, then this btree supports keys containing
> + * overlapping intervals.  In such a tree, records are still sorted lowest to
> + * highest and indexed by the smallest key value that refers to the record.
> + * However, nodes are different: each pointer has two associated keys -- one
> + * indexing the lowest key available in the block(s) below (the same behavior
> + * as the key in a regular btree) and another indexing the highest key
> + * available in the block(s) below.  Because records are /not/ sorted by the
> + * highest key, all leaf block updates require us to compute the highest key
> + * that matches any record in the leaf and to recursively update the high keys
> + * in the nodes going further up in the tree, if necessary.  Nodes look like
> + * this:
> + *
> + *		+--------+-----+-----+-----+-----+-----+-------+-------+-----+
> + * Non-Leaf:	| header | lo1 | hi1 | lo2 | hi2 | ... | ptr 1 | ptr 2 | ... |
> + *		+--------+-----+-----+-----+-----+-----+-------+-------+-----+
> + *
> + * To perform an interval query on an overlapped tree, perform the usual
> + * depth-first search and use the low and high keys to decide if we can skip
> + * that particular node.  If a leaf node is reached, return the records that
> + * intersect the interval.  Note that an interval query may return numerous
> + * entries.  For a non-overlapped tree, simply search for the record associated
> + * with the lowest key and iterate forward until a non-matching record is
> + * found.
>   */
>  
>  /*
> @@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct xfs_btree_cur *cur)
>  	return XFS_BTREE_SBLOCK_LEN;
>  }
>  
> +/* Return size of btree block keys for this btree instance. */
> +static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur)
> +{
> +	size_t			len;
> +
> +	len = cur->bc_ops->key_len;
> +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> +		len *= 2;
> +	return len;
> +}
> +
>  /*
>   * Return size of btree block pointers for this btree instance.
>   */
> @@ -475,7 +515,19 @@ xfs_btree_key_offset(
>  	int			n)
>  {
>  	return xfs_btree_block_len(cur) +
> -		(n - 1) * cur->bc_ops->key_len;
> +		(n - 1) * xfs_btree_key_len(cur);
> +}
> +
> +/*
> + * Calculate offset of the n-th high key in a btree block.
> + */
> +STATIC size_t
> +xfs_btree_high_key_offset(
> +	struct xfs_btree_cur	*cur,
> +	int			n)
> +{
> +	return xfs_btree_block_len(cur) +
> +		(n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len;
>  }
>  
>  /*
> @@ -488,7 +540,7 @@ xfs_btree_ptr_offset(
>  	int			level)
>  {
>  	return xfs_btree_block_len(cur) +
> -		cur->bc_ops->get_maxrecs(cur, level) * cur->bc_ops->key_len +
> +		cur->bc_ops->get_maxrecs(cur, level) * xfs_btree_key_len(cur) +
>  		(n - 1) * xfs_btree_ptr_len(cur);
>  }
>  
> @@ -519,6 +571,19 @@ xfs_btree_key_addr(
>  }
>  
>  /*
> + * Return a pointer to the n-th high key in the btree block.
> + */
> +STATIC union xfs_btree_key *
> +xfs_btree_high_key_addr(
> +	struct xfs_btree_cur	*cur,
> +	int			n,
> +	struct xfs_btree_block	*block)
> +{
> +	return (union xfs_btree_key *)
> +		((char *)block + xfs_btree_high_key_offset(cur, n));
> +}
> +
> +/*
>   * Return a pointer to the n-th block pointer in the btree block.
>   */
>  STATIC union xfs_btree_ptr *
> @@ -1217,7 +1282,7 @@ xfs_btree_copy_keys(
>  	int			numkeys)
>  {
>  	ASSERT(numkeys >= 0);
> -	memcpy(dst_key, src_key, numkeys * cur->bc_ops->key_len);
> +	memcpy(dst_key, src_key, numkeys * xfs_btree_key_len(cur));
>  }
>  
>  /*
> @@ -1263,8 +1328,8 @@ xfs_btree_shift_keys(
>  	ASSERT(numkeys >= 0);
>  	ASSERT(dir == 1 || dir == -1);
>  
> -	dst_key = (char *)key + (dir * cur->bc_ops->key_len);
> -	memmove(dst_key, key, numkeys * cur->bc_ops->key_len);
> +	dst_key = (char *)key + (dir * xfs_btree_key_len(cur));
> +	memmove(dst_key, key, numkeys * xfs_btree_key_len(cur));
>  }
>  
>  /*
> @@ -1879,6 +1944,180 @@ error0:
>  	return error;
>  }
>  
> +/* Determine the low and high keys of a leaf block */
> +STATIC void
> +xfs_btree_find_leaf_keys(
> +	struct xfs_btree_cur	*cur,
> +	struct xfs_btree_block	*block,
> +	union xfs_btree_key	*low,
> +	union xfs_btree_key	*high)
> +{
> +	int			n;
> +	union xfs_btree_rec	*rec;
> +	union xfs_btree_key	max_hkey;
> +	union xfs_btree_key	hkey;
> +
> +	rec = xfs_btree_rec_addr(cur, 1, block);
> +	cur->bc_ops->init_key_from_rec(low, rec);
> +
> +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> +		return;
> +
> +	cur->bc_ops->init_high_key_from_rec(&max_hkey, rec);
> +	for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
> +		rec = xfs_btree_rec_addr(cur, n, block);
> +		cur->bc_ops->init_high_key_from_rec(&hkey, rec);
> +		if (cur->bc_ops->diff_two_keys(cur, &max_hkey, &hkey) > 0)
> +			max_hkey = hkey;
> +	}
> +
> +	*high = max_hkey;
> +}
> +
> +/* Determine the low and high keys of a node block */
> +STATIC void
> +xfs_btree_find_node_keys(
> +	struct xfs_btree_cur	*cur,
> +	struct xfs_btree_block	*block,
> +	union xfs_btree_key	*low,
> +	union xfs_btree_key	*high)
> +{
> +	int			n;
> +	union xfs_btree_key	*hkey;
> +	union xfs_btree_key	*max_hkey;
> +
> +	*low = *xfs_btree_key_addr(cur, 1, block);
> +
> +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> +		return;
> +
> +	max_hkey = xfs_btree_high_key_addr(cur, 1, block);
> +	for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
> +		hkey = xfs_btree_high_key_addr(cur, n, block);
> +		if (cur->bc_ops->diff_two_keys(cur, max_hkey, hkey) > 0)
> +			max_hkey = hkey;
> +	}
> +
> +	*high = *max_hkey;
> +}
> +
> +/*
> + * Update parental low & high keys from some block all the way back to the
> + * root of the btree.
> + */
> +STATIC int
> +__xfs_btree_updkeys(
> +	struct xfs_btree_cur	*cur,
> +	int			level,
> +	struct xfs_btree_block	*block,
> +	struct xfs_buf		*bp0,
> +	bool			force_all)
> +{
> +	union xfs_btree_key	lkey;	/* keys from current level */
> +	union xfs_btree_key	hkey;
> +	union xfs_btree_key	*nlkey;	/* keys from the next level up */
> +	union xfs_btree_key	*nhkey;
> +	struct xfs_buf		*bp;
> +	int			ptr = -1;

ptr doesn't appear to require initialization.

> +
> +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> +		return 0;
> +
> +	if (level + 1 >= cur->bc_nlevels)
> +		return 0;

This could use a comment to indicate we're checking for a parent level
to update.

> +
> +	trace_xfs_btree_updkeys(cur, level, bp0);
> +
> +	if (level == 0)
> +		xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey);
> +	else
> +		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
> +	for (level++; level < cur->bc_nlevels; level++) {
> +		block = xfs_btree_get_block(cur, level, &bp);
> +		trace_xfs_btree_updkeys(cur, level, bp);
> +		ptr = cur->bc_ptrs[level];
> +		nlkey = xfs_btree_key_addr(cur, ptr, block);
> +		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
> +		if (!(cur->bc_ops->diff_two_keys(cur, nlkey, &lkey) != 0 ||
> +		      cur->bc_ops->diff_two_keys(cur, nhkey, &hkey) != 0) &&
> +		    !force_all)
> +			break;
> +		memcpy(nlkey, &lkey, cur->bc_ops->key_len);
> +		memcpy(nhkey, &hkey, cur->bc_ops->key_len);
> +		xfs_btree_log_keys(cur, bp, ptr, ptr);
> +		if (level + 1 >= cur->bc_nlevels)
> +			break;
> +		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Update all the keys from a sibling block at some level in the cursor back
> + * to the root, stopping when we find a key pair that doesn't need updating.
> + */
> +STATIC int
> +xfs_btree_sibling_updkeys(
> +	struct xfs_btree_cur	*cur,
> +	int			level,
> +	int			ptr,
> +	struct xfs_btree_block	*block,
> +	struct xfs_buf		*bp0)
> +{
> +	struct xfs_btree_cur	*ncur;
> +	int			stat;
> +	int			error;
> +
> +	error = xfs_btree_dup_cursor(cur, &ncur);
> +	if (error)
> +		return error;
> +
> +	if (level + 1 >= ncur->bc_nlevels)
> +		error = -EDOM;
> +	else if (ptr == XFS_BB_RIGHTSIB)
> +		error = xfs_btree_increment(ncur, level + 1, &stat);
> +	else if (ptr == XFS_BB_LEFTSIB)
> +		error = xfs_btree_decrement(ncur, level + 1, &stat);
> +	else
> +		error = -EBADE;

So we inc/dec the cursor at the next level up the tree, then update the
keys up that path with the __xfs_btree_updkeys() call below. The inc/dec
calls explicitly say that they don't alter the cursor below the level,
so it looks like we'd end up with a weird cursor path here.

Digging around further, it looks like we pass the sibling bp/block
pointers from the caller and thus __xfs_btree_updkeys() should do the
correct thing, but this is not very clear. If I'm on the right track,
I'd suggest to add a big fat comment here. :)

> +	if (error || !stat)
> +		return error;

Looks like a potential cursor leak on error.

> +
> +	error = __xfs_btree_updkeys(ncur, level, block, bp0, false);
> +	xfs_btree_del_cursor(ncur, XFS_BTREE_NOERROR);
> +	return error;
> +}
> +
> +/*
> + * Update all the keys from some level in cursor back to the root, stopping
> + * when we find a key pair that don't need updating.
> + */
> +STATIC int
> +xfs_btree_updkeys(
> +	struct xfs_btree_cur	*cur,
> +	int			level)
> +{
> +	struct xfs_buf		*bp;
> +	struct xfs_btree_block	*block;
> +
> +	block = xfs_btree_get_block(cur, level, &bp);
> +	return __xfs_btree_updkeys(cur, level, block, bp, false);
> +}
> +
> +/* Update all the keys from some level in cursor back to the root. */
> +STATIC int
> +xfs_btree_updkeys_force(
> +	struct xfs_btree_cur	*cur,
> +	int			level)
> +{
> +	struct xfs_buf		*bp;
> +	struct xfs_btree_block	*block;
> +
> +	block = xfs_btree_get_block(cur, level, &bp);
> +	return __xfs_btree_updkeys(cur, level, block, bp, true);
> +}
> +
>  /*
>   * Update keys at all levels from here to the root along the cursor's path.
>   */
> @@ -1893,6 +2132,9 @@ xfs_btree_updkey(
>  	union xfs_btree_key	*kp;
>  	int			ptr;
>  
> +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> +		return 0;
> +
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
>  	XFS_BTREE_TRACE_ARGIK(cur, level, keyp);
>  
> @@ -1970,7 +2212,8 @@ xfs_btree_update(
>  					    ptr, LASTREC_UPDATE);
>  	}
>  
> -	/* Updating first rec in leaf. Pass new key value up to our parent. */
> +	/* Pass new key value up to our parent. */
> +	xfs_btree_updkeys(cur, 0);
>  	if (ptr == 1) {
>  		union xfs_btree_key	key;
>  
> @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
>  		rkp = &key;
>  	}
>  
> -	/* Update the parent key values of right. */
> +	/* Update the parent key values of left and right. */
> +	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> +	xfs_btree_updkeys(cur, level);
>  	error = xfs_btree_updkey(cur, rkp, level + 1);
>  	if (error)
>  		goto error0;
> @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
>  	if (error)
>  		goto error1;
>  
> +	/* Update left and right parent pointers */
> +	xfs_btree_updkeys(cur, level);
> +	xfs_btree_updkeys(tcur, level);

In this case, we grab the last record of the block, increment from there
and update using the cursor. This is much more straightforward, imo.
Could we use this approach in the left shift case as well?

>  	error = xfs_btree_updkey(tcur, rkp, level + 1);
>  	if (error)
>  		goto error1;
> @@ -2356,7 +2604,7 @@ __xfs_btree_split(
>  	struct xfs_btree_cur	*cur,
>  	int			level,
>  	union xfs_btree_ptr	*ptrp,
> -	union xfs_btree_key	*key,
> +	struct xfs_btree_double_key	*key,
>  	struct xfs_btree_cur	**curp,
>  	int			*stat)		/* success/failure */
>  {
> @@ -2452,9 +2700,6 @@ __xfs_btree_split(
>  
>  		xfs_btree_log_keys(cur, rbp, 1, rrecs);
>  		xfs_btree_log_ptrs(cur, rbp, 1, rrecs);
> -
> -		/* Grab the keys to the entries moved to the right block */
> -		xfs_btree_copy_keys(cur, key, rkp, 1);
>  	} else {
>  		/* It's a leaf.  Move records.  */
>  		union xfs_btree_rec	*lrp;	/* left record pointer */
> @@ -2465,12 +2710,8 @@ __xfs_btree_split(
>  
>  		xfs_btree_copy_recs(cur, rrp, lrp, rrecs);
>  		xfs_btree_log_recs(cur, rbp, 1, rrecs);
> -
> -		cur->bc_ops->init_key_from_rec(key,
> -			xfs_btree_rec_addr(cur, 1, right));
>  	}
>  
> -
>  	/*
>  	 * Find the left block number by looking in the buffer.
>  	 * Adjust numrecs, sibling pointers.
> @@ -2484,6 +2725,12 @@ __xfs_btree_split(
>  	xfs_btree_set_numrecs(left, lrecs);
>  	xfs_btree_set_numrecs(right, xfs_btree_get_numrecs(right) + rrecs);
>  
> +	/* Find the low & high keys for the new block. */
> +	if (level > 0)
> +		xfs_btree_find_node_keys(cur, right, &key->low, &key->high);
> +	else
> +		xfs_btree_find_leaf_keys(cur, right, &key->low, &key->high);
> +

Why not push these into the above if/else where the previous key
copy/init calls were removed from?

>  	xfs_btree_log_block(cur, rbp, XFS_BB_ALL_BITS);
>  	xfs_btree_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB);
>  
> @@ -2499,6 +2746,10 @@ __xfs_btree_split(
>  		xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB);
>  		xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB);
>  	}
> +
> +	/* Update the left block's keys... */
> +	xfs_btree_updkeys(cur, level);
> +
>  	/*
>  	 * If the cursor is really in the right block, move it there.
>  	 * If it's just pointing past the last entry in left, then we'll
> @@ -2537,7 +2788,7 @@ struct xfs_btree_split_args {
>  	struct xfs_btree_cur	*cur;
>  	int			level;
>  	union xfs_btree_ptr	*ptrp;
> -	union xfs_btree_key	*key;
> +	struct xfs_btree_double_key	*key;
>  	struct xfs_btree_cur	**curp;
>  	int			*stat;		/* success/failure */
>  	int			result;
> @@ -2586,7 +2837,7 @@ xfs_btree_split(
>  	struct xfs_btree_cur	*cur,
>  	int			level,
>  	union xfs_btree_ptr	*ptrp,
> -	union xfs_btree_key	*key,
> +	struct xfs_btree_double_key	*key,
>  	struct xfs_btree_cur	**curp,
>  	int			*stat)		/* success/failure */
>  {
> @@ -2806,27 +3057,27 @@ xfs_btree_new_root(
>  		bp = lbp;
>  		nptr = 2;
>  	}
> +
>  	/* Fill in the new block's btree header and log it. */
>  	xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2);
>  	xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS);
>  	ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) &&
>  			!xfs_btree_ptr_is_null(cur, &rptr));
> -

?

>  	/* Fill in the key data in the new root. */
>  	if (xfs_btree_get_level(left) > 0) {
> -		xfs_btree_copy_keys(cur,
> +		xfs_btree_find_node_keys(cur, left,
>  				xfs_btree_key_addr(cur, 1, new),
> -				xfs_btree_key_addr(cur, 1, left), 1);
> -		xfs_btree_copy_keys(cur,
> +				xfs_btree_high_key_addr(cur, 1, new));
> +		xfs_btree_find_node_keys(cur, right,
>  				xfs_btree_key_addr(cur, 2, new),
> -				xfs_btree_key_addr(cur, 1, right), 1);
> +				xfs_btree_high_key_addr(cur, 2, new));
>  	} else {
> -		cur->bc_ops->init_key_from_rec(
> -				xfs_btree_key_addr(cur, 1, new),
> -				xfs_btree_rec_addr(cur, 1, left));
> -		cur->bc_ops->init_key_from_rec(
> -				xfs_btree_key_addr(cur, 2, new),
> -				xfs_btree_rec_addr(cur, 1, right));
> +		xfs_btree_find_leaf_keys(cur, left,
> +			xfs_btree_key_addr(cur, 1, new),
> +			xfs_btree_high_key_addr(cur, 1, new));
> +		xfs_btree_find_leaf_keys(cur, right,
> +			xfs_btree_key_addr(cur, 2, new),
> +			xfs_btree_high_key_addr(cur, 2, new));
>  	}
>  	xfs_btree_log_keys(cur, nbp, 1, 2);
>  
> @@ -2837,6 +3088,7 @@ xfs_btree_new_root(
>  		xfs_btree_ptr_addr(cur, 2, new), &rptr, 1);
>  	xfs_btree_log_ptrs(cur, nbp, 1, 2);
>  
> +

Extra line.

>  	/* Fix up the cursor. */
>  	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
>  	cur->bc_ptrs[cur->bc_nlevels] = nptr;
> @@ -2862,7 +3114,7 @@ xfs_btree_make_block_unfull(
>  	int			*index,	/* new tree index */
>  	union xfs_btree_ptr	*nptr,	/* new btree ptr */
>  	struct xfs_btree_cur	**ncur,	/* new btree cursor */
> -	union xfs_btree_key	*key, /* key of new block */
> +	struct xfs_btree_double_key	*key,	/* key of new block */
>  	int			*stat)
>  {
>  	int			error = 0;
> @@ -2918,6 +3170,22 @@ xfs_btree_make_block_unfull(
>  	return 0;
>  }
>  
> +/* Copy a double key into a btree block. */
> +static void
> +xfs_btree_copy_double_keys(
> +	struct xfs_btree_cur	*cur,
> +	int			ptr,
> +	struct xfs_btree_block	*block,
> +	struct xfs_btree_double_key	*key)
> +{
> +	memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low,
> +			cur->bc_ops->key_len);
> +
> +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> +		memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high,
> +				cur->bc_ops->key_len);
> +}
> +
>  /*
>   * Insert one record/level.  Return information to the caller
>   * allowing the next level up to proceed if necessary.
> @@ -2927,7 +3195,7 @@ xfs_btree_insrec(
>  	struct xfs_btree_cur	*cur,	/* btree cursor */
>  	int			level,	/* level to insert record at */
>  	union xfs_btree_ptr	*ptrp,	/* i/o: block number inserted */
> -	union xfs_btree_key	*key,	/* i/o: block key for ptrp */
> +	struct xfs_btree_double_key	*key, /* i/o: block key for ptrp */
>  	struct xfs_btree_cur	**curp,	/* output: new cursor replacing cur */
>  	int			*stat)	/* success/failure */
>  {
> @@ -2935,7 +3203,7 @@ xfs_btree_insrec(
>  	struct xfs_buf		*bp;	/* buffer for block */
>  	union xfs_btree_ptr	nptr;	/* new block ptr */
>  	struct xfs_btree_cur	*ncur;	/* new btree cursor */
> -	union xfs_btree_key	nkey;	/* new block key */
> +	struct xfs_btree_double_key	nkey;	/* new block key */
>  	union xfs_btree_rec	rec;	/* record to insert */
>  	int			optr;	/* old key/record index */
>  	int			ptr;	/* key/record index */
> @@ -2944,11 +3212,12 @@ xfs_btree_insrec(
>  #ifdef DEBUG
>  	int			i;
>  #endif
> +	xfs_daddr_t		old_bn;
>  
>  	/* Make a key out of the record data to be inserted, and save it. */
>  	if (level == 0) {
>  		cur->bc_ops->init_rec_from_cur(cur, &rec);
> -		cur->bc_ops->init_key_from_rec(key, &rec);
> +		cur->bc_ops->init_key_from_rec(&key->low, &rec);
>  	}
>  
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> @@ -2983,6 +3252,7 @@ xfs_btree_insrec(
>  
>  	/* Get pointers to the btree buffer and block. */
>  	block = xfs_btree_get_block(cur, level, &bp);
> +	old_bn = bp ? bp->b_bn : XFS_BUF_DADDR_NULL;
>  	numrecs = xfs_btree_get_numrecs(block);
>  
>  #ifdef DEBUG
> @@ -2996,7 +3266,7 @@ xfs_btree_insrec(
>  			ASSERT(cur->bc_ops->recs_inorder(cur, &rec,
>  				xfs_btree_rec_addr(cur, ptr, block)));
>  		} else {
> -			ASSERT(cur->bc_ops->keys_inorder(cur, key,
> +			ASSERT(cur->bc_ops->keys_inorder(cur, &key->low,
>  				xfs_btree_key_addr(cur, ptr, block)));
>  		}
>  	}
> @@ -3059,7 +3329,7 @@ xfs_btree_insrec(
>  #endif
>  
>  		/* Now put the new data in, bump numrecs and log it. */
> -		xfs_btree_copy_keys(cur, kp, key, 1);
> +		xfs_btree_copy_double_keys(cur, ptr, block, key);
>  		xfs_btree_copy_ptrs(cur, pp, ptrp, 1);
>  		numrecs++;
>  		xfs_btree_set_numrecs(block, numrecs);
> @@ -3095,8 +3365,24 @@ xfs_btree_insrec(
>  	xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);
>  
>  	/* If we inserted at the start of a block, update the parents' keys. */

This comment is associated with the codeblock that has been pushed
further down, no?

> +	if (ncur && bp->b_bn != old_bn) {
> +		/*
> +		 * We just inserted into a new tree block, which means that
> +		 * the key for the block is in nkey, not the tree.
> +		 */
> +		if (level == 0)
> +			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> +					&nkey.high);
> +		else
> +			xfs_btree_find_node_keys(cur, block, &nkey.low,
> +					&nkey.high);
> +	} else {
> +		/* Updating the left block, do it the standard way. */
> +		xfs_btree_updkeys(cur, level);
> +	}
> +

Not quite sure I follow the purpose of this hunk. Is this for the case
where a btree split occurs, nkey is filled in for the new/right block
and then (after nkey is filled in) the new record ends up being added to
the new block? If so, what about the case where ncur is not created?
(It looks like that's possible from the code, but I could easily be
missing some context as to why that's not the case.)

In any event, I think we could elaborate a bit in the comment on why
this is necessary. I'd also move it above the top-level if/else.

>  	if (optr == 1) {
> -		error = xfs_btree_updkey(cur, key, level + 1);
> +		error = xfs_btree_updkey(cur, &key->low, level + 1);
>  		if (error)
>  			goto error0;
>  	}
> @@ -3147,7 +3433,7 @@ xfs_btree_insert(
>  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
>  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
>  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> -	union xfs_btree_key	key;	/* key of block to insert */
> +	struct xfs_btree_double_key	key;	/* key of block to insert */

Probably should fix up the function param alignment here and the couple
other or so places we make this change.

Brian

>  
>  	level = 0;
>  	ncur = NULL;
> @@ -3552,6 +3838,7 @@ xfs_btree_delrec(
>  	 * If we deleted the leftmost entry in the block, update the
>  	 * key values above us in the tree.
>  	 */
> +	xfs_btree_updkeys(cur, level);
>  	if (ptr == 1) {
>  		error = xfs_btree_updkey(cur, keyp, level + 1);
>  		if (error)
> @@ -3882,6 +4169,16 @@ xfs_btree_delrec(
>  	if (level > 0)
>  		cur->bc_ptrs[level]--;
>  
> +	/*
> +	 * We combined blocks, so we have to update the parent keys if the
> +	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
> +	 * points to the old block so that the caller knows which record to
> +	 * delete.  Therefore, the caller must be savvy enough to call updkeys
> +	 * for us if we return stat == 2.  The other exit points from this
> +	 * function don't require deletions further up the tree, so they can
> +	 * call updkeys directly.
> +	 */
> +
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
>  	/* Return value means the next level up has something to do. */
>  	*stat = 2;
> @@ -3907,6 +4204,7 @@ xfs_btree_delete(
>  	int			error;	/* error return value */
>  	int			level;
>  	int			i;
> +	bool			joined = false;
>  
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
>  
> @@ -3920,8 +4218,17 @@ xfs_btree_delete(
>  		error = xfs_btree_delrec(cur, level, &i);
>  		if (error)
>  			goto error0;
> +		if (i == 2)
> +			joined = true;
>  	}
>  
> +	/*
> +	 * If we combined blocks as part of deleting the record, delrec won't
> +	 * have updated the parent keys so we have to do that here.
> +	 */
> +	if (joined)
> +		xfs_btree_updkeys_force(cur, 0);
> +
>  	if (i == 0) {
>  		for (level = 1; level < cur->bc_nlevels; level++) {
>  			if (cur->bc_ptrs[level] == 0) {
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index b99c018..a5ec6c7 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -126,6 +126,9 @@ struct xfs_btree_ops {
>  	size_t	key_len;
>  	size_t	rec_len;
>  
> +	/* flags */
> +	uint	flags;
> +
>  	/* cursor operations */
>  	struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *);
>  	void	(*update_cursor)(struct xfs_btree_cur *src,
> @@ -162,11 +165,21 @@ struct xfs_btree_ops {
>  				     union xfs_btree_rec *rec);
>  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
>  				     union xfs_btree_ptr *ptr);
> +	void	(*init_high_key_from_rec)(union xfs_btree_key *key,
> +					  union xfs_btree_rec *rec);
>  
>  	/* difference between key value and cursor value */
>  	__int64_t (*key_diff)(struct xfs_btree_cur *cur,
>  			      union xfs_btree_key *key);
>  
> +	/*
> +	 * Difference between key2 and key1 -- positive if key2 > key1,
> +	 * negative if key2 < key1, and zero if equal.
> +	 */
> +	__int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
> +				   union xfs_btree_key *key1,
> +				   union xfs_btree_key *key2);
> +
>  	const struct xfs_buf_ops	*buf_ops;
>  
>  #if defined(DEBUG) || defined(XFS_WARN)
> @@ -182,6 +195,9 @@ struct xfs_btree_ops {
>  #endif
>  };
>  
> +/* btree ops flags */
> +#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
> +
>  /*
>   * Reasons for the update_lastrec method to be called.
>   */
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 68f27f7..ffea28c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -38,6 +38,7 @@ struct xlog_recover_item;
>  struct xfs_buf_log_format;
>  struct xfs_inode_log_format;
>  struct xfs_bmbt_irec;
> +struct xfs_btree_cur;
>  
>  DECLARE_EVENT_CLASS(xfs_attr_list_class,
>  	TP_PROTO(struct xfs_attr_list_context *ctx),
> @@ -2183,6 +2184,41 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
>  DEFINE_DISCARD_EVENT(xfs_discard_exclude);
>  DEFINE_DISCARD_EVENT(xfs_discard_busy);
>  
> +/* btree cursor events */
> +DECLARE_EVENT_CLASS(xfs_btree_cur_class,
> +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp),
> +	TP_ARGS(cur, level, bp),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_btnum_t, btnum)
> +		__field(int, level)
> +		__field(int, nlevels)
> +		__field(int, ptr)
> +		__field(xfs_daddr_t, daddr)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = cur->bc_mp->m_super->s_dev;
> +		__entry->btnum = cur->bc_btnum;
> +		__entry->level = level;
> +		__entry->nlevels = cur->bc_nlevels;
> +		__entry->ptr = cur->bc_ptrs[level];
> +		__entry->daddr = bp->b_bn;
> +	),
> +	TP_printk("dev %d:%d btnum %d level %d/%d ptr %d daddr 0x%llx",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->btnum,
> +		  __entry->level,
> +		  __entry->nlevels,
> +		  __entry->ptr,
> +		  (unsigned long long)__entry->daddr)
> +)
> +
> +#define DEFINE_BTREE_CUR_EVENT(name) \
> +DEFINE_EVENT(xfs_btree_cur_class, name, \
> +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> +	TP_ARGS(cur, level, bp))
> +DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 014/119] xfs: introduce interval queries on btrees
  2016-06-17  1:19 ` [PATCH 014/119] xfs: introduce interval queries on btrees Darrick J. Wong
@ 2016-06-22 15:18   ` Brian Foster
  2016-06-27 21:07     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-22 15:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:21PM -0700, Darrick J. Wong wrote:
> Create a function to enable querying of btree records mapping to a
> range of keys.  This will be used in subsequent patches to allow
> querying the reverse mapping btree to find the extents mapped to a
> range of physical blocks, though the generic code can be used for
> any range query.
> 
> v2: add some shortcuts so that we can jump out of processing once
> we know there won't be any more records to find.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_btree.c |  249 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h |   22 +++-
>  fs/xfs/xfs_trace.h        |    1 
>  3 files changed, 267 insertions(+), 5 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index afcafd6..5f5cf23 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4509,3 +4509,252 @@ xfs_btree_calc_size(
>  	}
>  	return rval;
>  }
> +
> +/* Query a regular btree for all records overlapping a given interval. */

Can you elaborate on the search algorithm used? (More for reference
against the overlapped query, as that one is more complex).

> +STATIC int
> +xfs_btree_simple_query_range(
> +	struct xfs_btree_cur		*cur,
> +	union xfs_btree_irec		*low_rec,
> +	union xfs_btree_irec		*high_rec,
> +	xfs_btree_query_range_fn	fn,
> +	void				*priv)
> +{
> +	union xfs_btree_rec		*recp;
> +	union xfs_btree_rec		rec;
> +	union xfs_btree_key		low_key;
> +	union xfs_btree_key		high_key;
> +	union xfs_btree_key		rec_key;
> +	__int64_t			diff;
> +	int				stat;
> +	bool				firstrec = true;
> +	int				error;
> +
> +	ASSERT(cur->bc_ops->init_high_key_from_rec);
> +
> +	/* Find the keys of both ends of the interval. */
> +	cur->bc_rec = *high_rec;
> +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> +
> +	cur->bc_rec = *low_rec;
> +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> +
> +	/* Find the leftmost record. */
> +	stat = 0;
> +	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat);
> +	if (error)
> +		goto out;
> +
> +	while (stat) {
> +		/* Find the record. */
> +		error = xfs_btree_get_rec(cur, &recp, &stat);
> +		if (error || !stat)
> +			break;
> +
> +		/* Can we tell if this record is too low? */
> +		if (firstrec) {
> +			cur->bc_rec = *low_rec;
> +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> +			diff = cur->bc_ops->key_diff(cur, &rec_key);
> +			if (diff < 0)
> +				goto advloop;
> +		}
> +		firstrec = false;

This could move up into the if block.

> +
> +		/* Have we gone past the end? */
> +		cur->bc_rec = *high_rec;
> +		cur->bc_ops->init_key_from_rec(&rec_key, recp);

I'd move this up to immediately after the xfs_btree_get_rec() call and
eliminate the duplicate in the 'if (firstrec)' block above.

> +		diff = cur->bc_ops->key_diff(cur, &rec_key);
> +		if (diff > 0)
> +			break;
> +
> +		/* Callback */
> +		error = fn(cur, recp, priv);
> +		if (error < 0 || error == XFS_BTREE_QUERY_RANGE_ABORT)
> +			break;
> +
> +advloop:
> +		/* Move on to the next record. */
> +		error = xfs_btree_increment(cur, 0, &stat);
> +		if (error)
> +			break;
> +	}
> +
> +out:
> +	return error;
> +}
> +
> +/*
> + * Query an overlapped interval btree for all records overlapping a given
> + * interval.
> + */

Same comment here, can you elaborate on the search algorithm? Also, I
think an example or generic description of the rules around what records
this query returns (e.g., low_rec/high_rec vs. record low/high keys)
would be useful, particularly since I, at least, don't have much context
on the rmap+reflink scenarios quite yet.

> +STATIC int
> +xfs_btree_overlapped_query_range(
> +	struct xfs_btree_cur		*cur,
> +	union xfs_btree_irec		*low_rec,
> +	union xfs_btree_irec		*high_rec,
> +	xfs_btree_query_range_fn	fn,
> +	void				*priv)
> +{
> +	union xfs_btree_ptr		ptr;
> +	union xfs_btree_ptr		*pp;
> +	union xfs_btree_key		rec_key;
> +	union xfs_btree_key		low_key;
> +	union xfs_btree_key		high_key;
> +	union xfs_btree_key		*lkp;
> +	union xfs_btree_key		*hkp;
> +	union xfs_btree_rec		rec;
> +	union xfs_btree_rec		*recp;
> +	struct xfs_btree_block		*block;
> +	__int64_t			ldiff;
> +	__int64_t			hdiff;
> +	int				level;
> +	struct xfs_buf			*bp;
> +	int				i;
> +	int				error;
> +
> +	/* Find the keys of both ends of the interval. */
> +	cur->bc_rec = *high_rec;
> +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> +
> +	cur->bc_rec = *low_rec;
> +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> +
> +	/* Load the root of the btree. */
> +	level = cur->bc_nlevels - 1;
> +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> +	error = xfs_btree_lookup_get_block(cur, level, &ptr, &block);
> +	if (error)
> +		return error;
> +	xfs_btree_get_block(cur, level, &bp);
> +	trace_xfs_btree_overlapped_query_range(cur, level, bp);
> +#ifdef DEBUG
> +	error = xfs_btree_check_block(cur, block, level, bp);
> +	if (error)
> +		goto out;
> +#endif
> +	cur->bc_ptrs[level] = 1;
> +
> +	while (level < cur->bc_nlevels) {
> +		block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
> +
> +		if (level == 0) {
> +			/* End of leaf, pop back towards the root. */
> +			if (cur->bc_ptrs[level] >
> +			    be16_to_cpu(block->bb_numrecs)) {
> +leaf_pop_up:
> +				if (level < cur->bc_nlevels - 1)
> +					cur->bc_ptrs[level + 1]++;
> +				level++;
> +				continue;
> +			}
> +
> +			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> +
> +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> +			ldiff = cur->bc_ops->diff_two_keys(cur, &low_key,
> +					&rec_key);
> +
> +			cur->bc_ops->init_key_from_rec(&rec_key, recp);
> +			hdiff = cur->bc_ops->diff_two_keys(cur, &rec_key,
> +					&high_key);
> +

This looked a little funny to me because I expected diff_two_keys() to
basically be param1 - param2. Looking ahead at the rmapbt code, it is in
fact the other way around. I'm not sure we have precedent for either
way, tbh. I still have to stare at this some more, but I wonder if a
"does record overlap" helper (with comments) would help clean this up a
bit.

> +			/* If the record matches, callback */
> +			if (ldiff >= 0 && hdiff >= 0) {
> +				error = fn(cur, recp, priv);
> +				if (error < 0 ||
> +				    error == XFS_BTREE_QUERY_RANGE_ABORT)
> +					break;
> +			} else if (hdiff < 0) {
> +				/* Record is larger than high key; pop. */
> +				goto leaf_pop_up;
> +			}
> +			cur->bc_ptrs[level]++;
> +			continue;
> +		}
> +
> +		/* End of node, pop back towards the root. */
> +		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> +node_pop_up:
> +			if (level < cur->bc_nlevels - 1)
> +				cur->bc_ptrs[level + 1]++;
> +			level++;
> +			continue;

Looks like same code as leaf_pop_up. I wonder if we can bury this at the
end of the loop with a common label.

> +		}
> +
> +		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
> +		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
> +		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> +
> +		ldiff = cur->bc_ops->diff_two_keys(cur, &low_key, hkp);
> +		hdiff = cur->bc_ops->diff_two_keys(cur, lkp, &high_key);
> +
> +		/* If the key matches, drill another level deeper. */
> +		if (ldiff >= 0 && hdiff >= 0) {
> +			level--;
> +			error = xfs_btree_lookup_get_block(cur, level, pp,
> +					&block);
> +			if (error)
> +				goto out;
> +			xfs_btree_get_block(cur, level, &bp);
> +			trace_xfs_btree_overlapped_query_range(cur, level, bp);
> +#ifdef DEBUG
> +			error = xfs_btree_check_block(cur, block, level, bp);
> +			if (error)
> +				goto out;
> +#endif
> +			cur->bc_ptrs[level] = 1;
> +			continue;
> +		} else if (hdiff < 0) {
> +			/* The low key is larger than the upper range; pop. */
> +			goto node_pop_up;
> +		}
> +		cur->bc_ptrs[level]++;
> +	}
> +
> +out:
> +	/*
> +	 * If we don't end this function with the cursor pointing at a record
> +	 * block, a subsequent non-error cursor deletion will not release
> +	 * node-level buffers, causing a buffer leak.  This is quite possible
> +	 * with a zero-results range query, so release the buffers if we
> +	 * failed to return any results.
> +	 */
> +	if (cur->bc_bufs[0] == NULL) {
> +		for (i = 0; i < cur->bc_nlevels; i++) {
> +			if (cur->bc_bufs[i]) {
> +				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> +				cur->bc_bufs[i] = NULL;
> +				cur->bc_ptrs[i] = 0;
> +				cur->bc_ra[i] = 0;
> +			}
> +		}
> +	}
> +
> +	return error;
> +}
> +
> +/*
> + * Query a btree for all records overlapping a given interval of keys.  The
> + * supplied function will be called with each record found; return one of the
> + * XFS_BTREE_QUERY_RANGE_{CONTINUE,ABORT} values or the usual negative error
> + * code.  This function returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a
> + * negative error code.
> + */
> +int
> +xfs_btree_query_range(
> +	struct xfs_btree_cur		*cur,
> +	union xfs_btree_irec		*low_rec,
> +	union xfs_btree_irec		*high_rec,
> +	xfs_btree_query_range_fn	fn,
> +	void				*priv)
> +{
> +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> +		return xfs_btree_simple_query_range(cur, low_rec,
> +				high_rec, fn, priv);
> +	return xfs_btree_overlapped_query_range(cur, low_rec, high_rec,
> +			fn, priv);
> +}
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index a5ec6c7..898fee5 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -206,6 +206,12 @@ struct xfs_btree_ops {
>  #define LASTREC_DELREC	2
>  
>  
> +union xfs_btree_irec {
> +	xfs_alloc_rec_incore_t		a;
> +	xfs_bmbt_irec_t			b;
> +	xfs_inobt_rec_incore_t		i;
> +};
> +

We might as well kill off the typedef usage here.

Brian

>  /*
>   * Btree cursor structure.
>   * This collects all information needed by the btree code in one place.
> @@ -216,11 +222,7 @@ typedef struct xfs_btree_cur
>  	struct xfs_mount	*bc_mp;	/* file system mount struct */
>  	const struct xfs_btree_ops *bc_ops;
>  	uint			bc_flags; /* btree features - below */
> -	union {
> -		xfs_alloc_rec_incore_t	a;
> -		xfs_bmbt_irec_t		b;
> -		xfs_inobt_rec_incore_t	i;
> -	}		bc_rec;		/* current insert/search record value */
> +	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
>  	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
>  	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
>  	__uint8_t	bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
> @@ -494,4 +496,14 @@ xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
>  uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
>  		unsigned long len);
>  
> +/* return codes */
> +#define XFS_BTREE_QUERY_RANGE_CONTINUE	0	/* keep iterating */
> +#define XFS_BTREE_QUERY_RANGE_ABORT	1	/* stop iterating */
> +typedef int (*xfs_btree_query_range_fn)(struct xfs_btree_cur *cur,
> +		union xfs_btree_rec *rec, void *priv);
> +
> +int xfs_btree_query_range(struct xfs_btree_cur *cur,
> +		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
> +		xfs_btree_query_range_fn fn, void *priv);
> +
>  #endif	/* __XFS_BTREE_H__ */
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index ffea28c..f0ac9c9 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2218,6 +2218,7 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
>  	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
>  	TP_ARGS(cur, level, bp))
>  DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> +DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
>  
>  #endif /* _TRACE_XFS_H */
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 015/119] xfs: refactor btree owner change into a separate visit-blocks function
  2016-06-17  1:19 ` [PATCH 015/119] xfs: refactor btree owner change into a separate visit-blocks function Darrick J. Wong
@ 2016-06-23 17:19   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-23 17:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:28PM -0700, Darrick J. Wong wrote:
> Refactor the btree_change_owner function into a more generic apparatus
> which visits all blocks in a btree.  We'll use this in a subsequent
> patch for counting btree blocks for AG reservations.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_btree.c |  141 +++++++++++++++++++++++++++++----------------
>  fs/xfs/libxfs/xfs_btree.h |    5 ++
>  2 files changed, 96 insertions(+), 50 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 5f5cf23..eac876a 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4289,6 +4289,81 @@ xfs_btree_get_rec(
>  	return 0;
>  }
>  
> +/* Visit a block in a btree. */
> +STATIC int
> +xfs_btree_visit_block(
> +	struct xfs_btree_cur		*cur,
> +	int				level,
> +	xfs_btree_visit_blocks_fn	fn,
> +	void				*data)
> +{
> +	struct xfs_btree_block		*block;
> +	struct xfs_buf			*bp;
> +	union xfs_btree_ptr		rptr;
> +	int				error;
> +
> +	/* do right sibling readahead */
> +	xfs_btree_readahead(cur, level, XFS_BTCUR_RIGHTRA);
> +	block = xfs_btree_get_block(cur, level, &bp);
> +
> +	/* process the block */
> +	error = fn(cur, level, data);
> +	if (error)
> +		return error;
> +
> +	/* now read rh sibling block for next iteration */
> +	xfs_btree_get_sibling(cur, block, &rptr, XFS_BB_RIGHTSIB);
> +	if (xfs_btree_ptr_is_null(cur, &rptr))
> +		return -ENOENT;
> +
> +	return xfs_btree_lookup_get_block(cur, level, &rptr, &block);
> +}
> +
> +
> +/* Visit every block in a btree. */
> +int
> +xfs_btree_visit_blocks(
> +	struct xfs_btree_cur		*cur,
> +	xfs_btree_visit_blocks_fn	fn,
> +	void				*data)
> +{
> +	union xfs_btree_ptr		lptr;
> +	int				level;
> +	struct xfs_btree_block		*block = NULL;
> +	int				error = 0;
> +
> +	cur->bc_ops->init_ptr_from_cur(cur, &lptr);
> +
> +	/* for each level */
> +	for (level = cur->bc_nlevels - 1; level >= 0; level--) {
> +		/* grab the left hand block */
> +		error = xfs_btree_lookup_get_block(cur, level, &lptr, &block);
> +		if (error)
> +			return error;
> +
> +		/* readahead the left most block for the next level down */
> +		if (level > 0) {
> +			union xfs_btree_ptr     *ptr;
> +
> +			ptr = xfs_btree_ptr_addr(cur, 1, block);
> +			xfs_btree_readahead_ptr(cur, ptr, 1);
> +
> +			/* save for the next iteration of the loop */
> +			lptr = *ptr;
> +		}
> +
> +		/* for each buffer in the level */
> +		do {
> +			error = xfs_btree_visit_block(cur, level, fn, data);
> +		} while (!error);
> +
> +		if (error != -ENOENT)
> +			return error;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Change the owner of a btree.
>   *
> @@ -4313,26 +4388,27 @@ xfs_btree_get_rec(
>   * just queue the modified buffer as delayed write buffer so the transaction
>   * recovery completion writes the changes to disk.
>   */
> +struct xfs_btree_block_change_owner_info {
> +	__uint64_t		new_owner;
> +	struct list_head	*buffer_list;
> +};
> +
>  static int
>  xfs_btree_block_change_owner(
>  	struct xfs_btree_cur	*cur,
>  	int			level,
> -	__uint64_t		new_owner,
> -	struct list_head	*buffer_list)
> +	void			*data)
>  {
> +	struct xfs_btree_block_change_owner_info	*bbcoi = data;
>  	struct xfs_btree_block	*block;
>  	struct xfs_buf		*bp;
> -	union xfs_btree_ptr     rptr;
> -
> -	/* do right sibling readahead */
> -	xfs_btree_readahead(cur, level, XFS_BTCUR_RIGHTRA);
>  
>  	/* modify the owner */
>  	block = xfs_btree_get_block(cur, level, &bp);
>  	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
> -		block->bb_u.l.bb_owner = cpu_to_be64(new_owner);
> +		block->bb_u.l.bb_owner = cpu_to_be64(bbcoi->new_owner);
>  	else
> -		block->bb_u.s.bb_owner = cpu_to_be32(new_owner);
> +		block->bb_u.s.bb_owner = cpu_to_be32(bbcoi->new_owner);
>  
>  	/*
>  	 * If the block is a root block hosted in an inode, we might not have a
> @@ -4346,19 +4422,14 @@ xfs_btree_block_change_owner(
>  			xfs_trans_ordered_buf(cur->bc_tp, bp);
>  			xfs_btree_log_block(cur, bp, XFS_BB_OWNER);
>  		} else {
> -			xfs_buf_delwri_queue(bp, buffer_list);
> +			xfs_buf_delwri_queue(bp, bbcoi->buffer_list);
>  		}
>  	} else {
>  		ASSERT(cur->bc_flags & XFS_BTREE_ROOT_IN_INODE);
>  		ASSERT(level == cur->bc_nlevels - 1);
>  	}
>  
> -	/* now read rh sibling block for next iteration */
> -	xfs_btree_get_sibling(cur, block, &rptr, XFS_BB_RIGHTSIB);
> -	if (xfs_btree_ptr_is_null(cur, &rptr))
> -		return -ENOENT;
> -
> -	return xfs_btree_lookup_get_block(cur, level, &rptr, &block);
> +	return 0;
>  }
>  
>  int
> @@ -4367,43 +4438,13 @@ xfs_btree_change_owner(
>  	__uint64_t		new_owner,
>  	struct list_head	*buffer_list)
>  {
> -	union xfs_btree_ptr     lptr;
> -	int			level;
> -	struct xfs_btree_block	*block = NULL;
> -	int			error = 0;
> +	struct xfs_btree_block_change_owner_info	bbcoi;
>  
> -	cur->bc_ops->init_ptr_from_cur(cur, &lptr);
> +	bbcoi.new_owner = new_owner;
> +	bbcoi.buffer_list = buffer_list;
>  
> -	/* for each level */
> -	for (level = cur->bc_nlevels - 1; level >= 0; level--) {
> -		/* grab the left hand block */
> -		error = xfs_btree_lookup_get_block(cur, level, &lptr, &block);
> -		if (error)
> -			return error;
> -
> -		/* readahead the left most block for the next level down */
> -		if (level > 0) {
> -			union xfs_btree_ptr     *ptr;
> -
> -			ptr = xfs_btree_ptr_addr(cur, 1, block);
> -			xfs_btree_readahead_ptr(cur, ptr, 1);
> -
> -			/* save for the next iteration of the loop */
> -			lptr = *ptr;
> -		}
> -
> -		/* for each buffer in the level */
> -		do {
> -			error = xfs_btree_block_change_owner(cur, level,
> -							     new_owner,
> -							     buffer_list);
> -		} while (!error);
> -
> -		if (error != -ENOENT)
> -			return error;
> -	}
> -
> -	return 0;
> +	return xfs_btree_visit_blocks(cur, xfs_btree_block_change_owner,
> +			&bbcoi);
>  }
>  
>  /**
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 898fee5..0ec3055 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -506,4 +506,9 @@ int xfs_btree_query_range(struct xfs_btree_cur *cur,
>  		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
>  		xfs_btree_query_range_fn fn, void *priv);
>  
> +typedef int (*xfs_btree_visit_blocks_fn)(struct xfs_btree_cur *cur, int level,
> +		void *data);
> +int xfs_btree_visit_blocks(struct xfs_btree_cur *cur,
> +		xfs_btree_visit_blocks_fn fn, void *data);
> +
>  #endif	/* __XFS_BTREE_H__ */
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 016/119] xfs: move deferred operations into a separate file
  2016-06-17  1:19 ` [PATCH 016/119] xfs: move deferred operations into a separate file Darrick J. Wong
@ 2016-06-27 13:14   ` Brian Foster
  2016-06-27 19:14     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-27 13:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:34PM -0700, Darrick J. Wong wrote:
> All the code around struct xfs_bmap_free basically implements a
> deferred operation framework through which we can roll transactions
> (to unlock buffers and avoid violating lock order rules) while
> managing all the necessary log redo items.  Previously we only used
> this code to free extents after some sort of mapping operation, but
> with the advent of rmap and reflink, we suddenly need to do more than
> that.
> 
> With that in mind, xfs_bmap_free really becomes a deferred ops control
> structure.  Rename the structure and move the deferred ops into their
> own file to avoid further bloating of the bmap code.
> 
> v2: actually sort the work items by AG to avoid deadlocks
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

So if I'm following this correctly, we 1.) abstract the bmap freeing
infrastructure into a generic mechanism and 2.) enhance it a bit to
provide things like partial intent completion, etc. If so and for future
reference, this would probably be easier to review if the abstraction
and enhancement were done separately. It's probably not worth that at
this point, however...

>  fs/xfs/Makefile           |    2 
>  fs/xfs/libxfs/xfs_defer.c |  471 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_defer.h |   96 +++++++++
>  fs/xfs/xfs_defer_item.c   |   36 +++
>  fs/xfs/xfs_super.c        |    2 
>  5 files changed, 607 insertions(+)
>  create mode 100644 fs/xfs/libxfs/xfs_defer.c
>  create mode 100644 fs/xfs/libxfs/xfs_defer.h
>  create mode 100644 fs/xfs/xfs_defer_item.c
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 3542d94..ad46a2d 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -39,6 +39,7 @@ xfs-y				+= $(addprefix libxfs/, \
>  				   xfs_btree.o \
>  				   xfs_da_btree.o \
>  				   xfs_da_format.o \
> +				   xfs_defer.o \
>  				   xfs_dir2.o \
>  				   xfs_dir2_block.o \
>  				   xfs_dir2_data.o \
> @@ -66,6 +67,7 @@ xfs-y				+= xfs_aops.o \
>  				   xfs_attr_list.o \
>  				   xfs_bmap_util.o \
>  				   xfs_buf.o \
> +				   xfs_defer_item.o \
>  				   xfs_dir2_readdir.o \
>  				   xfs_discard.o \
>  				   xfs_error.o \
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> new file mode 100644
> index 0000000..ad14e33e
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -0,0 +1,471 @@
> +/*
> + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_bit.h"
> +#include "xfs_sb.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_trans.h"
> +#include "xfs_trace.h"
> +
> +/*
> + * Deferred Operations in XFS
> + *
> + * Due to the way locking rules work in XFS, certain transactions (block
> + * mapping and unmapping, typically) have permanent reservations so that
> + * we can roll the transaction to adhere to AG locking order rules and
> + * to unlock buffers between metadata updates.  Prior to rmap/reflink,
> + * the mapping code had a mechanism to perform these deferrals for
> + * extents that were going to be freed; this code makes that facility
> + * more generic.
> + *
> + * When adding the reverse mapping and reflink features, it became
> + * necessary to perform complex remapping multi-transactions to comply
> + * with AG locking order rules, and to be able to spread a single
> + * refcount update operation (an operation on an n-block extent can
> + * update as many as n records!) among multiple transactions.  XFS can
> + * roll a transaction to facilitate this, but using this facility
> + * requires us to log "intent" items in case log recovery needs to
> + * redo the operation, and to log "done" items to indicate that redo
> + * is not necessary.
> + *
> + * The xfs_defer_ops structure tracks incoming deferred work (which is
> + * work that has not yet had an intent logged) in xfs_defer_intake.

Do you mean xfs_defer_pending rather than xfs_defer_intake?

> + * There is one xfs_defer_intake for each type of deferrable
> + * operation.  Each new deferral is placed in the op's intake list,
> + * where it waits for the caller to finish the deferred operations.
> + *
> + * Finishing a set of deferred operations is an involved process.  To
> + * start, we define "rolling a deferred-op transaction" as follows:
> + *
> + * > For each xfs_defer_intake,
> + *   - Sort the items on the intake list in AG order.
> + *   - Create a log intent item for that type.
> + *   - Attach to it the items on the intake list.
> + *   - Stash the intent+items for later in an xfs_defer_pending.

Does this mean "the pending list?"

Thanks for the big comment and example below. It looks that perhaps
terminology is a bit out of sync with the latest code (I'm guessing
design and data structures evolved a bit since this was written).

> + *   - Attach the xfs_defer_pending to the xfs_defer_ops work list.
> + * > Roll the transaction.
> + *
> + * NOTE: To avoid exceeding the transaction reservation, we limit the
> + * number of items that we attach to a given xfs_defer_pending.
> + *
> + * The actual finishing process looks like this:
> + *
> + * > For each xfs_defer_pending in the xfs_defer_ops work list,
> + *   - Roll the deferred-op transaction as above.
> + *   - Create a log done item for that type, and attach it to the
> + *     intent item.
> + *   - For each work item attached to the intent item,
> + *     * Perform the described action.
> + *     * Attach the work item to the log done item.
> + *     * If the result of doing the work was -EAGAIN, log a fresh
> + *       intent item and attach all remaining work items to it.  Put
> + *       the xfs_defer_pending item back on the work list, and repeat
> + *       the loop.  This allows us to make partial progress even if
> + *       the transaction is too full to finish the job.
> + *
> + * The key here is that we must log an intent item for all pending
> + * work items every time we roll the transaction, and that we must log
> + * a done item as soon as the work is completed.  With this mechanism
> + * we can perform complex remapping operations, chaining intent items
> + * as needed.
> + *
> + * This is an example of remapping the extent (E, E+B) into file X at
> + * offset A and dealing with the extent (C, C+B) already being mapped
> + * there:
> + * +-------------------------------------------------+
> + * | Unmap file X startblock C offset A length B     | t0
> + * | Intent to reduce refcount for extent (C, B)     |
> + * | Intent to remove rmap (X, C, A, B)              |
> + * | Intent to free extent (D, 1) (bmbt block)       |
> + * | Intent to map (X, A, B) at startblock E         |
> + * +-------------------------------------------------+
> + * | Map file X startblock E offset A length B       | t1
> + * | Done mapping (X, E, A, B)                       |
> + * | Intent to increase refcount for extent (E, B)   |
> + * | Intent to add rmap (X, E, A, B)                 |
> + * +-------------------------------------------------+
> + * | Reduce refcount for extent (C, B)               | t2
> + * | Done reducing refcount for extent (C, B)        |
> + * | Increase refcount for extent (E, B)             |
> + * | Done increasing refcount for extent (E, B)      |
> + * | Intent to free extent (C, B)                    |
> + * | Intent to free extent (F, 1) (refcountbt block) |
> + * | Intent to remove rmap (F, 1, REFC)              |
> + * +-------------------------------------------------+
> + * | Remove rmap (X, C, A, B)                        | t3
> + * | Done removing rmap (X, C, A, B)                 |
> + * | Add rmap (X, E, A, B)                           |
> + * | Done adding rmap (X, E, A, B)                   |
> + * | Remove rmap (F, 1, REFC)                        |
> + * | Done removing rmap (F, 1, REFC)                 |
> + * +-------------------------------------------------+
> + * | Free extent (C, B)                              | t4
> + * | Done freeing extent (C, B)                      |
> + * | Free extent (D, 1)                              |
> + * | Done freeing extent (D, 1)                      |
> + * | Free extent (F, 1)                              |
> + * | Done freeing extent (F, 1)                      |
> + * +-------------------------------------------------+
> + *
> + * If we should crash before t2 commits, log recovery replays
> + * the following intent items:
> + *
> + * - Intent to reduce refcount for extent (C, B)
> + * - Intent to remove rmap (X, C, A, B)
> + * - Intent to free extent (D, 1) (bmbt block)
> + * - Intent to increase refcount for extent (E, B)
> + * - Intent to add rmap (X, E, A, B)
> + *
> + * In the process of recovering, it should also generate and take care
> + * of these intent items:
> + *
> + * - Intent to free extent (C, B)
> + * - Intent to free extent (F, 1) (refcountbt block)
> + * - Intent to remove rmap (F, 1, REFC)
> + */
> +
> +static const struct xfs_defer_op_type *defer_op_types[XFS_DEFER_OPS_TYPE_MAX];
> +
> +/*
> + * For each pending item in the intake list, log its intent item and the
> + * associated extents, then add the entire intake list to the end of
> + * the pending list.
> + */
> +STATIC void

I don't think we're using 'STATIC' any longer. Better to use 'static' so
we can eventually kill off the former.

> +xfs_defer_intake_work(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_ops		*dop)
> +{
> +	struct list_head		*li;
> +	struct xfs_defer_pending	*dfp;
> +
> +	list_for_each_entry(dfp, &dop->dop_intake, dfp_list) {
> +		dfp->dfp_intent = dfp->dfp_type->create_intent(tp,
> +				dfp->dfp_count);
> +		list_sort(tp->t_mountp, &dfp->dfp_work,
> +				dfp->dfp_type->diff_items);
> +		list_for_each(li, &dfp->dfp_work)
> +			dfp->dfp_type->log_item(tp, dfp->dfp_intent, li);
> +	}
> +
> +	list_splice_tail_init(&dop->dop_intake, &dop->dop_pending);
> +}
> +
> +/* Abort all the intents that were committed. */
> +STATIC void
> +xfs_defer_trans_abort(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_ops		*dop,
> +	int				error)
> +{
> +	struct xfs_defer_pending	*dfp;
> +
> +	/*
> +	 * If the transaction was committed, drop the intent reference
> +	 * since we're bailing out of here. The other reference is
> +	 * dropped when the intent hits the AIL.  If the transaction
> +	 * was not committed, the intent is freed by the intent item
> +	 * unlock handler on abort.
> +	 */
> +	if (!dop->dop_committed)
> +		return;
> +
> +	/* Abort intent items. */
> +	list_for_each_entry(dfp, &dop->dop_pending, dfp_list) {
> +		if (dfp->dfp_committed)
> +			dfp->dfp_type->abort_intent(dfp->dfp_intent);
> +	}
> +
> +	/* Shut down FS. */
> +	xfs_force_shutdown(tp->t_mountp, (error == -EFSCORRUPTED) ?
> +			SHUTDOWN_CORRUPT_INCORE : SHUTDOWN_META_IO_ERROR);
> +}
> +
> +/* Roll a transaction so we can do some deferred op processing. */
> +STATIC int
> +xfs_defer_trans_roll(
> +	struct xfs_trans		**tp,
> +	struct xfs_defer_ops		*dop,
> +	struct xfs_inode		*ip)
> +{
> +	int				i;
> +	int				error;
> +
> +	/* Log all the joined inodes except the one we passed in. */
> +	for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dop->dop_inodes[i]; i++) {
> +		if (dop->dop_inodes[i] == ip)
> +			continue;
> +		xfs_trans_log_inode(*tp, dop->dop_inodes[i], XFS_ILOG_CORE);
> +	}
> +
> +	/* Roll the transaction. */
> +	error = xfs_trans_roll(tp, ip);
> +	if (error) {
> +		xfs_defer_trans_abort(*tp, dop, error);
> +		return error;
> +	}
> +	dop->dop_committed = true;
> +
> +	/* Log all the joined inodes except the one we passed in. */

Rejoin?

> +	for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dop->dop_inodes[i]; i++) {
> +		if (dop->dop_inodes[i] == ip)
> +			continue;
> +		xfs_trans_ijoin(*tp, dop->dop_inodes[i], 0);
> +	}
> +
> +	return error;
> +}
> +
> +/* Do we have any work items to finish? */
> +bool
> +xfs_defer_has_unfinished_work(
> +	struct xfs_defer_ops		*dop)
> +{
> +	return !list_empty(&dop->dop_pending) || !list_empty(&dop->dop_intake);
> +}
> +
> +/*
> + * Add this inode to the deferred op.  Each joined inode is relogged
> + * each time we roll the transaction, in addition to any inode passed
> + * to xfs_defer_finish().
> + */
> +int
> +xfs_defer_join(
> +	struct xfs_defer_ops		*dop,
> +	struct xfs_inode		*ip)
> +{
> +	int				i;
> +
> +	for (i = 0; i < XFS_DEFER_OPS_NR_INODES; i++) {
> +		if (dop->dop_inodes[i] == ip)
> +			return 0;
> +		else if (dop->dop_inodes[i] == NULL) {
> +			dop->dop_inodes[i] = ip;
> +			return 0;
> +		}
> +	}
> +
> +	return -EFSCORRUPTED;
> +}
> +
> +/*
> + * Finish all the pending work.  This involves logging intent items for
> + * any work items that wandered in since the last transaction roll (if
> + * one has even happened), rolling the transaction, and finishing the
> + * work items in the first item on the logged-and-pending list.
> + *
> + * If an inode is provided, relog it to the new transaction.
> + */
> +int
> +xfs_defer_finish(
> +	struct xfs_trans		**tp,
> +	struct xfs_defer_ops		*dop,
> +	struct xfs_inode		*ip)
> +{
> +	struct xfs_defer_pending	*dfp;
> +	struct list_head		*li;
> +	struct list_head		*n;
> +	void				*done_item = NULL;
> +	void				*state;
> +	int				error = 0;
> +	void				(*cleanup_fn)(struct xfs_trans *, void *, int);
> +
> +	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> +
> +	/* Until we run out of pending work to finish... */
> +	while (xfs_defer_has_unfinished_work(dop)) {
> +		/* Log intents for work items sitting in the intake. */
> +		xfs_defer_intake_work(*tp, dop);
> +
> +		/* Roll the transaction. */
> +		error = xfs_defer_trans_roll(tp, dop, ip);
> +		if (error)
> +			goto out;
> +
> +		/* Mark all pending intents as committed. */
> +		list_for_each_entry_reverse(dfp, &dop->dop_pending, dfp_list) {
> +			if (dfp->dfp_committed)
> +				break;
> +			dfp->dfp_committed = true;
> +		}
> +
> +		/* Log an intent-done item for the first pending item. */
> +		dfp = list_first_entry(&dop->dop_pending,
> +				struct xfs_defer_pending, dfp_list);
> +		done_item = dfp->dfp_type->create_done(*tp, dfp->dfp_intent,
> +				dfp->dfp_count);
> +		cleanup_fn = dfp->dfp_type->finish_cleanup;
> +
> +		/* Finish the work items. */
> +		state = NULL;
> +		list_for_each_safe(li, n, &dfp->dfp_work) {
> +			list_del(li);
> +			dfp->dfp_count--;
> +			error = dfp->dfp_type->finish_item(*tp, dop, li,
> +					done_item, &state);
> +			if (error == -EAGAIN) {
> +				/*
> +				 * If the caller needs to try again, put the
> +				 * item back on the pending list and jump out
> +				 * for further processing.

A little confused by the terminology here. Perhaps better to say "back
on the work list" rather than "pending list?"

Also, what is the meaning/purpose of -EAGAIN here? This isn't used by
the extent free bits so I'm missing some context. For example, is there
an issue with carrying a done_item with an unexpected list count? Is it
expected that xfs_defer_finish() will not return until -EAGAIN is
"cleared" (does relogging below and rolling somehow address this)?

> +				 */
> +				list_add(li, &dfp->dfp_work);
> +				dfp->dfp_count++;
> +				break;
> +			} else if (error) {
> +				/*
> +				 * Clean up after ourselves and jump out.
> +				 * xfs_defer_cancel will take care of freeing
> +				 * all these lists and stuff.
> +				 */
> +				if (cleanup_fn)
> +					cleanup_fn(*tp, state, error);
> +				xfs_defer_trans_abort(*tp, dop, error);
> +				goto out;
> +			}
> +		}
> +		if (error == -EAGAIN) {
> +			/*
> +			 * Log a new intent, relog all the remaining work
> +			 * item to the new intent, attach the new intent to
> +			 * the dfp, and leave the dfp at the head of the list
> +			 * for further processing.
> +			 */

Similar to the above, could you elaborate on the mechanics of this with
respect to the log?  E.g., the comment kind of just repeats what the
code does as opposed to explain why it's here. Is the point here to log
a new intent in the same transaction as the done item to ensure that we
(atomically) indicate that certain operations need to be replayed if
this transaction hits the log and then we crash?

Brian

> +			dfp->dfp_intent = dfp->dfp_type->create_intent(*tp,
> +					dfp->dfp_count);
> +			list_for_each(li, &dfp->dfp_work)
> +				dfp->dfp_type->log_item(*tp, dfp->dfp_intent,
> +						li);
> +		} else {
> +			/* Done with the dfp, free it. */
> +			list_del(&dfp->dfp_list);
> +			kmem_free(dfp);
> +		}
> +
> +		if (cleanup_fn)
> +			cleanup_fn(*tp, state, error);
> +	}
> +
> +out:
> +	return error;
> +}
> +
> +/*
> + * Free up any items left in the list.
> + */
> +void
> +xfs_defer_cancel(
> +	struct xfs_defer_ops		*dop)
> +{
> +	struct xfs_defer_pending	*dfp;
> +	struct xfs_defer_pending	*pli;
> +	struct list_head		*pwi;
> +	struct list_head		*n;
> +
> +	/*
> +	 * Free the pending items.  Caller should already have arranged
> +	 * for the intent items to be released.
> +	 */
> +	list_for_each_entry_safe(dfp, pli, &dop->dop_intake, dfp_list) {
> +		list_del(&dfp->dfp_list);
> +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> +			list_del(pwi);
> +			dfp->dfp_count--;
> +			dfp->dfp_type->cancel_item(pwi);
> +		}
> +		ASSERT(dfp->dfp_count == 0);
> +		kmem_free(dfp);
> +	}
> +	list_for_each_entry_safe(dfp, pli, &dop->dop_pending, dfp_list) {
> +		list_del(&dfp->dfp_list);
> +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> +			list_del(pwi);
> +			dfp->dfp_count--;
> +			dfp->dfp_type->cancel_item(pwi);
> +		}
> +		ASSERT(dfp->dfp_count == 0);
> +		kmem_free(dfp);
> +	}
> +}
> +
> +/* Add an item for later deferred processing. */
> +void
> +xfs_defer_add(
> +	struct xfs_defer_ops		*dop,
> +	enum xfs_defer_ops_type		type,
> +	struct list_head		*li)
> +{
> +	struct xfs_defer_pending	*dfp = NULL;
> +
> +	/*
> +	 * Add the item to a pending item at the end of the intake list.
> +	 * If the last pending item has the same type, reuse it.  Else,
> +	 * create a new pending item at the end of the intake list.
> +	 */
> +	if (!list_empty(&dop->dop_intake)) {
> +		dfp = list_last_entry(&dop->dop_intake,
> +				struct xfs_defer_pending, dfp_list);
> +		if (dfp->dfp_type->type != type ||
> +		    (dfp->dfp_type->max_items &&
> +		     dfp->dfp_count >= dfp->dfp_type->max_items))
> +			dfp = NULL;
> +	}
> +	if (!dfp) {
> +		dfp = kmem_alloc(sizeof(struct xfs_defer_pending),
> +				KM_SLEEP | KM_NOFS);
> +		dfp->dfp_type = defer_op_types[type];
> +		dfp->dfp_committed = false;
> +		dfp->dfp_intent = NULL;
> +		dfp->dfp_count = 0;
> +		INIT_LIST_HEAD(&dfp->dfp_work);
> +		list_add_tail(&dfp->dfp_list, &dop->dop_intake);
> +	}
> +
> +	list_add_tail(li, &dfp->dfp_work);
> +	dfp->dfp_count++;
> +}
> +
> +/* Initialize a deferred operation list. */
> +void
> +xfs_defer_init_op_type(
> +	const struct xfs_defer_op_type	*type)
> +{
> +	defer_op_types[type->type] = type;
> +}
> +
> +/* Initialize a deferred operation. */
> +void
> +xfs_defer_init(
> +	struct xfs_defer_ops		*dop,
> +	xfs_fsblock_t			*fbp)
> +{
> +	dop->dop_committed = false;
> +	dop->dop_low = false;
> +	memset(&dop->dop_inodes, 0, sizeof(dop->dop_inodes));
> +	*fbp = NULLFSBLOCK;
> +	INIT_LIST_HEAD(&dop->dop_intake);
> +	INIT_LIST_HEAD(&dop->dop_pending);
> +}
> diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> new file mode 100644
> index 0000000..85c7a3a
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_defer.h
> @@ -0,0 +1,96 @@
> +/*
> + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#ifndef __XFS_DEFER_H__
> +#define	__XFS_DEFER_H__
> +
> +struct xfs_defer_op_type;
> +
> +/*
> + * Save a log intent item and a list of extents, so that we can replay
> + * whatever action had to happen to the extent list and file the log done
> + * item.
> + */
> +struct xfs_defer_pending {
> +	const struct xfs_defer_op_type	*dfp_type;	/* function pointers */
> +	struct list_head		dfp_list;	/* pending items */
> +	bool				dfp_committed;	/* committed trans? */
> +	void				*dfp_intent;	/* log intent item */
> +	struct list_head		dfp_work;	/* work items */
> +	unsigned int			dfp_count;	/* # extent items */
> +};
> +
> +/*
> + * Header for deferred operation list.
> + *
> + * dop_low is used by the allocator to activate the lowspace algorithm -
> + * when free space is running low the extent allocator may choose to
> + * allocate an extent from an AG without leaving sufficient space for
> + * a btree split when inserting the new extent.  In this case the allocator
> + * will enable the lowspace algorithm which is supposed to allow further
> + * allocations (such as btree splits and newroots) to allocate from
> + * sequential AGs.  In order to avoid locking AGs out of order the lowspace
> + * algorithm will start searching for free space from AG 0.  If the correct
> + * transaction reservations have been made then this algorithm will eventually
> + * find all the space it needs.
> + */
> +enum xfs_defer_ops_type {
> +	XFS_DEFER_OPS_TYPE_MAX,
> +};
> +
> +#define XFS_DEFER_OPS_NR_INODES	2	/* join up to two inodes */
> +
> +struct xfs_defer_ops {
> +	bool			dop_committed;	/* did any trans commit? */
> +	bool			dop_low;	/* alloc in low mode */
> +	struct list_head	dop_intake;	/* unlogged pending work */
> +	struct list_head	dop_pending;	/* logged pending work */
> +
> +	/* relog these inodes with each roll */
> +	struct xfs_inode	*dop_inodes[XFS_DEFER_OPS_NR_INODES];
> +};
> +
> +void xfs_defer_add(struct xfs_defer_ops *dop, enum xfs_defer_ops_type type,
> +		struct list_head *h);
> +int xfs_defer_finish(struct xfs_trans **tp, struct xfs_defer_ops *dop,
> +		struct xfs_inode *ip);
> +void xfs_defer_cancel(struct xfs_defer_ops *dop);
> +void xfs_defer_init(struct xfs_defer_ops *dop, xfs_fsblock_t *fbp);
> +bool xfs_defer_has_unfinished_work(struct xfs_defer_ops *dop);
> +int xfs_defer_join(struct xfs_defer_ops *dop, struct xfs_inode *ip);
> +
> +/* Description of a deferred type. */
> +struct xfs_defer_op_type {
> +	enum xfs_defer_ops_type	type;
> +	unsigned int		max_items;
> +	void (*abort_intent)(void *);
> +	void *(*create_done)(struct xfs_trans *, void *, unsigned int);
> +	int (*finish_item)(struct xfs_trans *, struct xfs_defer_ops *,
> +			struct list_head *, void *, void **);
> +	void (*finish_cleanup)(struct xfs_trans *, void *, int);
> +	void (*cancel_item)(struct list_head *);
> +	int (*diff_items)(void *, struct list_head *, struct list_head *);
> +	void *(*create_intent)(struct xfs_trans *, uint);
> +	void (*log_item)(struct xfs_trans *, void *, struct list_head *);
> +};
> +
> +void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
> +void xfs_defer_init_types(void);
> +
> +#endif /* __XFS_DEFER_H__ */
> diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> new file mode 100644
> index 0000000..849088d
> --- /dev/null
> +++ b/fs/xfs/xfs_defer_item.c
> @@ -0,0 +1,36 @@
> +/*
> + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_bit.h"
> +#include "xfs_sb.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_trans.h"
> +
> +/* Initialize the deferred operation types. */
> +void
> +xfs_defer_init_types(void)
> +{
> +}
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 09722a7..bf63f6d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -46,6 +46,7 @@
>  #include "xfs_quota.h"
>  #include "xfs_sysfs.h"
>  #include "xfs_ondisk.h"
> +#include "xfs_defer.h"
>  
>  #include <linux/namei.h>
>  #include <linux/init.h>
> @@ -1850,6 +1851,7 @@ init_xfs_fs(void)
>  	printk(KERN_INFO XFS_VERSION_STRING " with "
>  			 XFS_BUILD_OPTIONS " enabled\n");
>  
> +	xfs_defer_init_types();
>  	xfs_dir_startup();
>  
>  	error = xfs_init_zones();
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 017/119] xfs: add tracepoints for the deferred ops mechanism
  2016-06-17  1:19 ` [PATCH 017/119] xfs: add tracepoints for the deferred ops mechanism Darrick J. Wong
@ 2016-06-27 13:15   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-27 13:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:40PM -0700, Darrick J. Wong wrote:
> Add tracepoints for the internals of the deferred ops mechanism
> and tracepoint classes for clients of the dops, to make debugging
> easier.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_defer.c |   19 ++++
>  fs/xfs/xfs_defer_item.c   |    1 
>  fs/xfs/xfs_trace.c        |    1 
>  fs/xfs/xfs_trace.h        |  198 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 219 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index ad14e33e..b4e7faa 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -163,6 +163,7 @@ xfs_defer_intake_work(
>  	struct xfs_defer_pending	*dfp;
>  
>  	list_for_each_entry(dfp, &dop->dop_intake, dfp_list) {
> +		trace_xfs_defer_intake_work(tp->t_mountp, dfp);
>  		dfp->dfp_intent = dfp->dfp_type->create_intent(tp,
>  				dfp->dfp_count);
>  		list_sort(tp->t_mountp, &dfp->dfp_work,
> @@ -183,6 +184,7 @@ xfs_defer_trans_abort(
>  {
>  	struct xfs_defer_pending	*dfp;
>  
> +	trace_xfs_defer_trans_abort(tp->t_mountp, dop);
>  	/*
>  	 * If the transaction was committed, drop the intent reference
>  	 * since we're bailing out of here. The other reference is
> @@ -195,6 +197,7 @@ xfs_defer_trans_abort(
>  
>  	/* Abort intent items. */
>  	list_for_each_entry(dfp, &dop->dop_pending, dfp_list) {
> +		trace_xfs_defer_pending_abort(tp->t_mountp, dfp);
>  		if (dfp->dfp_committed)
>  			dfp->dfp_type->abort_intent(dfp->dfp_intent);
>  	}
> @@ -221,9 +224,12 @@ xfs_defer_trans_roll(
>  		xfs_trans_log_inode(*tp, dop->dop_inodes[i], XFS_ILOG_CORE);
>  	}
>  
> +	trace_xfs_defer_trans_roll((*tp)->t_mountp, dop);
> +
>  	/* Roll the transaction. */
>  	error = xfs_trans_roll(tp, ip);
>  	if (error) {
> +		trace_xfs_defer_trans_roll_error((*tp)->t_mountp, dop, error);
>  		xfs_defer_trans_abort(*tp, dop, error);
>  		return error;
>  	}
> @@ -295,6 +301,8 @@ xfs_defer_finish(
>  
>  	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
>  
> +	trace_xfs_defer_finish((*tp)->t_mountp, dop);
> +
>  	/* Until we run out of pending work to finish... */
>  	while (xfs_defer_has_unfinished_work(dop)) {
>  		/* Log intents for work items sitting in the intake. */
> @@ -309,12 +317,14 @@ xfs_defer_finish(
>  		list_for_each_entry_reverse(dfp, &dop->dop_pending, dfp_list) {
>  			if (dfp->dfp_committed)
>  				break;
> +			trace_xfs_defer_pending_commit((*tp)->t_mountp, dfp);
>  			dfp->dfp_committed = true;
>  		}
>  
>  		/* Log an intent-done item for the first pending item. */
>  		dfp = list_first_entry(&dop->dop_pending,
>  				struct xfs_defer_pending, dfp_list);
> +		trace_xfs_defer_pending_finish((*tp)->t_mountp, dfp);
>  		done_item = dfp->dfp_type->create_done(*tp, dfp->dfp_intent,
>  				dfp->dfp_count);
>  		cleanup_fn = dfp->dfp_type->finish_cleanup;
> @@ -370,6 +380,10 @@ xfs_defer_finish(
>  	}
>  
>  out:
> +	if (error)
> +		trace_xfs_defer_finish_error((*tp)->t_mountp, dop, error);
> +	else
> +		trace_xfs_defer_finish_done((*tp)->t_mountp, dop);
>  	return error;
>  }
>  
> @@ -385,11 +399,14 @@ xfs_defer_cancel(
>  	struct list_head		*pwi;
>  	struct list_head		*n;
>  
> +	trace_xfs_defer_cancel(NULL, dop);
> +
>  	/*
>  	 * Free the pending items.  Caller should already have arranged
>  	 * for the intent items to be released.
>  	 */
>  	list_for_each_entry_safe(dfp, pli, &dop->dop_intake, dfp_list) {
> +		trace_xfs_defer_intake_cancel(NULL, dfp);
>  		list_del(&dfp->dfp_list);
>  		list_for_each_safe(pwi, n, &dfp->dfp_work) {
>  			list_del(pwi);
> @@ -400,6 +417,7 @@ xfs_defer_cancel(
>  		kmem_free(dfp);
>  	}
>  	list_for_each_entry_safe(dfp, pli, &dop->dop_pending, dfp_list) {
> +		trace_xfs_defer_pending_cancel(NULL, dfp);
>  		list_del(&dfp->dfp_list);
>  		list_for_each_safe(pwi, n, &dfp->dfp_work) {
>  			list_del(pwi);
> @@ -468,4 +486,5 @@ xfs_defer_init(
>  	*fbp = NULLFSBLOCK;
>  	INIT_LIST_HEAD(&dop->dop_intake);
>  	INIT_LIST_HEAD(&dop->dop_pending);
> +	trace_xfs_defer_init(NULL, dop);
>  }
> diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> index 849088d..4c2ba28 100644
> --- a/fs/xfs/xfs_defer_item.c
> +++ b/fs/xfs/xfs_defer_item.c
> @@ -28,6 +28,7 @@
>  #include "xfs_mount.h"
>  #include "xfs_defer.h"
>  #include "xfs_trans.h"
> +#include "xfs_trace.h"
>  
>  /* Initialize the deferred operation types. */
>  void
> diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
> index 13a0298..3971527 100644
> --- a/fs/xfs/xfs_trace.c
> +++ b/fs/xfs/xfs_trace.c
> @@ -22,6 +22,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
> +#include "xfs_defer.h"
>  #include "xfs_da_format.h"
>  #include "xfs_inode.h"
>  #include "xfs_btree.h"
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index f0ac9c9..5923014 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2220,6 +2220,204 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
>  DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
>  DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
>  
> +/* deferred ops */
> +struct xfs_defer_pending;
> +struct xfs_defer_intake;
> +struct xfs_defer_ops;
> +
> +DECLARE_EVENT_CLASS(xfs_defer_class,
> +	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop),
> +	TP_ARGS(mp, dop),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(void *, dop)
> +		__field(bool, committed)
> +		__field(bool, low)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->dop = dop;
> +		__entry->committed = dop->dop_committed;
> +		__entry->low = dop->dop_low;
> +	),
> +	TP_printk("dev %d:%d ops %p committed %d low %d\n",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->dop,
> +		  __entry->committed,
> +		  __entry->low)
> +)
> +#define DEFINE_DEFER_EVENT(name) \
> +DEFINE_EVENT(xfs_defer_class, name, \
> +	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop), \
> +	TP_ARGS(mp, dop))
> +
> +DECLARE_EVENT_CLASS(xfs_defer_error_class,
> +	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop, int error),
> +	TP_ARGS(mp, dop, error),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(void *, dop)
> +		__field(bool, committed)
> +		__field(bool, low)
> +		__field(int, error)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->dop = dop;
> +		__entry->committed = dop->dop_committed;
> +		__entry->low = dop->dop_low;
> +		__entry->error = error;
> +	),
> +	TP_printk("dev %d:%d ops %p committed %d low %d err %d\n",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->dop,
> +		  __entry->committed,
> +		  __entry->low,
> +		  __entry->error)
> +)
> +#define DEFINE_DEFER_ERROR_EVENT(name) \
> +DEFINE_EVENT(xfs_defer_error_class, name, \
> +	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_ops *dop, int error), \
> +	TP_ARGS(mp, dop, error))
> +
> +DECLARE_EVENT_CLASS(xfs_defer_pending_class,
> +	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp),
> +	TP_ARGS(mp, dfp),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(int, type)
> +		__field(void *, intent)
> +		__field(bool, committed)
> +		__field(int, nr)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->type = dfp->dfp_type->type;
> +		__entry->intent = dfp->dfp_intent;
> +		__entry->committed = dfp->dfp_committed;
> +		__entry->nr = dfp->dfp_count;
> +	),
> +	TP_printk("dev %d:%d optype %d intent %p committed %d nr %d\n",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->type,
> +		  __entry->intent,
> +		  __entry->committed,
> +		  __entry->nr)
> +)
> +#define DEFINE_DEFER_PENDING_EVENT(name) \
> +DEFINE_EVENT(xfs_defer_pending_class, name, \
> +	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp), \
> +	TP_ARGS(mp, dfp))
> +
> +DECLARE_EVENT_CLASS(xfs_phys_extent_deferred_class,
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		 int type, xfs_agblock_t agbno, xfs_extlen_t len),
> +	TP_ARGS(mp, agno, type, agbno, len),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_agnumber_t, agno)
> +		__field(int, type)
> +		__field(xfs_agblock_t, agbno)
> +		__field(xfs_extlen_t, len)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->agno = agno;
> +		__entry->type = type;
> +		__entry->agbno = agbno;
> +		__entry->len = len;
> +	),
> +	TP_printk("dev %d:%d op %d agno %u agbno %u len %u",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->type,
> +		  __entry->agno,
> +		  __entry->agbno,
> +		  __entry->len)
> +);
> +#define DEFINE_PHYS_EXTENT_DEFERRED_EVENT(name) \
> +DEFINE_EVENT(xfs_phys_extent_deferred_class, name, \
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
> +		 int type, \
> +		 xfs_agblock_t bno, \
> +		 xfs_extlen_t len), \
> +	TP_ARGS(mp, agno, type, bno, len))
> +
> +DECLARE_EVENT_CLASS(xfs_map_extent_deferred_class,
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		 int op,
> +		 xfs_agblock_t agbno,
> +		 xfs_ino_t ino,
> +		 int whichfork,
> +		 xfs_fileoff_t offset,
> +		 xfs_filblks_t len,
> +		 xfs_exntst_t state),
> +	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_agnumber_t, agno)
> +		__field(xfs_ino_t, ino)
> +		__field(xfs_agblock_t, agbno)
> +		__field(int, whichfork)
> +		__field(xfs_fileoff_t, l_loff)
> +		__field(xfs_filblks_t, l_len)
> +		__field(xfs_exntst_t, l_state)
> +		__field(int, op)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->agno = agno;
> +		__entry->ino = ino;
> +		__entry->agbno = agbno;
> +		__entry->whichfork = whichfork;
> +		__entry->l_loff = offset;
> +		__entry->l_len = len;
> +		__entry->l_state = state;
> +		__entry->op = op;
> +	),
> +	TP_printk("dev %d:%d op %d agno %u agbno %u owner %lld %s offset %llu len %llu state %d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->op,
> +		  __entry->agno,
> +		  __entry->agbno,
> +		  __entry->ino,
> +		  __entry->whichfork == XFS_ATTR_FORK ? "attr" : "data",
> +		  __entry->l_loff,
> +		  __entry->l_len,
> +		  __entry->l_state)
> +);
> +#define DEFINE_MAP_EXTENT_DEFERRED_EVENT(name) \
> +DEFINE_EVENT(xfs_map_extent_deferred_class, name, \
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
> +		 int op, \
> +		 xfs_agblock_t agbno, \
> +		 xfs_ino_t ino, \
> +		 int whichfork, \
> +		 xfs_fileoff_t offset, \
> +		 xfs_filblks_t len, \
> +		 xfs_exntst_t state), \
> +	TP_ARGS(mp, agno, op, agbno, ino, whichfork, offset, len, state))
> +
> +DEFINE_DEFER_EVENT(xfs_defer_init);
> +DEFINE_DEFER_EVENT(xfs_defer_cancel);
> +DEFINE_DEFER_EVENT(xfs_defer_trans_roll);
> +DEFINE_DEFER_EVENT(xfs_defer_trans_abort);
> +DEFINE_DEFER_EVENT(xfs_defer_finish);
> +DEFINE_DEFER_EVENT(xfs_defer_finish_done);
> +
> +DEFINE_DEFER_ERROR_EVENT(xfs_defer_trans_roll_error);
> +DEFINE_DEFER_ERROR_EVENT(xfs_defer_finish_error);
> +DEFINE_DEFER_ERROR_EVENT(xfs_defer_op_finish_error);
> +
> +DEFINE_DEFER_PENDING_EVENT(xfs_defer_intake_work);
> +DEFINE_DEFER_PENDING_EVENT(xfs_defer_intake_cancel);
> +DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_commit);
> +DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_cancel);
> +DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
> +DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
> +
> +DEFINE_PHYS_EXTENT_DEFERRED_EVENT(xfs_defer_phys_extent);
> +DEFINE_MAP_EXTENT_DEFERRED_EVENT(xfs_defer_map_extent);
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free
  2016-06-17  1:19 ` [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free Darrick J. Wong
@ 2016-06-27 13:15   ` Brian Foster
  2016-06-27 21:41     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-27 13:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:47PM -0700, Darrick J. Wong wrote:
> Connect the xfs_defer mechanism with the pieces that we'll need to
> handle deferred extent freeing.  We'll wire up the existing code to
> our new deferred mechanism later.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Could we merge this with the xfs_trans_*efi/efd* bits? We'd need to
preserve some calls for recovery, but it looks like other parts are only
used by the deferred ops infrastructure at this point.

Brian

>  fs/xfs/libxfs/xfs_defer.h |    1 
>  fs/xfs/xfs_defer_item.c   |  108 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 109 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> index 85c7a3a..743fc32 100644
> --- a/fs/xfs/libxfs/xfs_defer.h
> +++ b/fs/xfs/libxfs/xfs_defer.h
> @@ -51,6 +51,7 @@ struct xfs_defer_pending {
>   * find all the space it needs.
>   */
>  enum xfs_defer_ops_type {
> +	XFS_DEFER_OPS_TYPE_FREE,
>  	XFS_DEFER_OPS_TYPE_MAX,
>  };
>  
> diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> index 4c2ba28..127a54e 100644
> --- a/fs/xfs/xfs_defer_item.c
> +++ b/fs/xfs/xfs_defer_item.c
> @@ -29,9 +29,117 @@
>  #include "xfs_defer.h"
>  #include "xfs_trans.h"
>  #include "xfs_trace.h"
> +#include "xfs_bmap.h"
> +#include "xfs_extfree_item.h"
> +
> +/* Extent Freeing */
> +
> +/* Sort bmap items by AG. */
> +static int
> +xfs_bmap_free_diff_items(
> +	void				*priv,
> +	struct list_head		*a,
> +	struct list_head		*b)
> +{
> +	struct xfs_mount		*mp = priv;
> +	struct xfs_bmap_free_item	*ra;
> +	struct xfs_bmap_free_item	*rb;
> +
> +	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
> +	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
> +	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
> +		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
> +}
> +
> +/* Get an EFI. */
> +STATIC void *
> +xfs_bmap_free_create_intent(
> +	struct xfs_trans		*tp,
> +	unsigned int			count)
> +{
> +	return xfs_trans_get_efi(tp, count);
> +}
> +
> +/* Log a free extent to the intent item. */
> +STATIC void
> +xfs_bmap_free_log_item(
> +	struct xfs_trans		*tp,
> +	void				*intent,
> +	struct list_head		*item)
> +{
> +	struct xfs_bmap_free_item	*free;
> +
> +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> +	xfs_trans_log_efi_extent(tp, intent, free->xbfi_startblock,
> +			free->xbfi_blockcount);
> +}
> +
> +/* Get an EFD so we can process all the free extents. */
> +STATIC void *
> +xfs_bmap_free_create_done(
> +	struct xfs_trans		*tp,
> +	void				*intent,
> +	unsigned int			count)
> +{
> +	return xfs_trans_get_efd(tp, intent, count);
> +}
> +
> +/* Process a free extent. */
> +STATIC int
> +xfs_bmap_free_finish_item(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_ops		*dop,
> +	struct list_head		*item,
> +	void				*done_item,
> +	void				**state)
> +{
> +	struct xfs_bmap_free_item	*free;
> +	int				error;
> +
> +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> +	error = xfs_trans_free_extent(tp, done_item,
> +			free->xbfi_startblock,
> +			free->xbfi_blockcount);
> +	kmem_free(free);
> +	return error;
> +}
> +
> +/* Abort all pending EFIs. */
> +STATIC void
> +xfs_bmap_free_abort_intent(
> +	void				*intent)
> +{
> +	xfs_efi_release(intent);
> +}
> +
> +/* Cancel a free extent. */
> +STATIC void
> +xfs_bmap_free_cancel_item(
> +	struct list_head		*item)
> +{
> +	struct xfs_bmap_free_item	*free;
> +
> +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> +	kmem_free(free);
> +}
> +
> +const struct xfs_defer_op_type xfs_extent_free_defer_type = {
> +	.type		= XFS_DEFER_OPS_TYPE_FREE,
> +	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
> +	.diff_items	= xfs_bmap_free_diff_items,
> +	.create_intent	= xfs_bmap_free_create_intent,
> +	.abort_intent	= xfs_bmap_free_abort_intent,
> +	.log_item	= xfs_bmap_free_log_item,
> +	.create_done	= xfs_bmap_free_create_done,
> +	.finish_item	= xfs_bmap_free_finish_item,
> +	.cancel_item	= xfs_bmap_free_cancel_item,
> +};
> +
> +/* Deferred Item Initialization */
>  
>  /* Initialize the deferred operation types. */
>  void
>  xfs_defer_init_types(void)
>  {
> +	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
>  }
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 016/119] xfs: move deferred operations into a separate file
  2016-06-27 13:14   ` Brian Foster
@ 2016-06-27 19:14     ` Darrick J. Wong
  2016-06-28 12:32       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-27 19:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 09:14:54AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:19:34PM -0700, Darrick J. Wong wrote:
> > All the code around struct xfs_bmap_free basically implements a
> > deferred operation framework through which we can roll transactions
> > (to unlock buffers and avoid violating lock order rules) while
> > managing all the necessary log redo items.  Previously we only used
> > this code to free extents after some sort of mapping operation, but
> > with the advent of rmap and reflink, we suddenly need to do more than
> > that.
> > 
> > With that in mind, xfs_bmap_free really becomes a deferred ops control
> > structure.  Rename the structure and move the deferred ops into their
> > own file to avoid further bloating of the bmap code.
> > 
> > v2: actually sort the work items by AG to avoid deadlocks
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> So if I'm following this correctly, we 1.) abstract the bmap freeing
> infrastructure into a generic mechanism and 2.) enhance it a bit to
> provide things like partial intent completion, etc.

[Back from vacation]

Yup.  The partial intent completion code is for use by the refcount adjust
function because in the worst case an adjustment of N blocks could require
N record updates.

> If so and for future
> reference, this would probably be easier to review if the abstraction
> and enhancement were done separately. It's probably not worth that at
> this point, however...

It wouldn't be difficult to separate them; the partial intent completion
are the two code blocks below that handle the -EAGAIN case.

(On the other hand it's so little code that I figured I might as well
just do the whole file all at once.)

> >  fs/xfs/Makefile           |    2 
> >  fs/xfs/libxfs/xfs_defer.c |  471 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_defer.h |   96 +++++++++
> >  fs/xfs/xfs_defer_item.c   |   36 +++
> >  fs/xfs/xfs_super.c        |    2 
> >  5 files changed, 607 insertions(+)
> >  create mode 100644 fs/xfs/libxfs/xfs_defer.c
> >  create mode 100644 fs/xfs/libxfs/xfs_defer.h
> >  create mode 100644 fs/xfs/xfs_defer_item.c
> > 
> > 
> > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > index 3542d94..ad46a2d 100644
> > --- a/fs/xfs/Makefile
> > +++ b/fs/xfs/Makefile
> > @@ -39,6 +39,7 @@ xfs-y				+= $(addprefix libxfs/, \
> >  				   xfs_btree.o \
> >  				   xfs_da_btree.o \
> >  				   xfs_da_format.o \
> > +				   xfs_defer.o \
> >  				   xfs_dir2.o \
> >  				   xfs_dir2_block.o \
> >  				   xfs_dir2_data.o \
> > @@ -66,6 +67,7 @@ xfs-y				+= xfs_aops.o \
> >  				   xfs_attr_list.o \
> >  				   xfs_bmap_util.o \
> >  				   xfs_buf.o \
> > +				   xfs_defer_item.o \
> >  				   xfs_dir2_readdir.o \
> >  				   xfs_discard.o \
> >  				   xfs_error.o \
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > new file mode 100644
> > index 0000000..ad14e33e
> > --- /dev/null
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -0,0 +1,471 @@
> > +/*
> > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_bit.h"
> > +#include "xfs_sb.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_trace.h"
> > +
> > +/*
> > + * Deferred Operations in XFS
> > + *
> > + * Due to the way locking rules work in XFS, certain transactions (block
> > + * mapping and unmapping, typically) have permanent reservations so that
> > + * we can roll the transaction to adhere to AG locking order rules and
> > + * to unlock buffers between metadata updates.  Prior to rmap/reflink,
> > + * the mapping code had a mechanism to perform these deferrals for
> > + * extents that were going to be freed; this code makes that facility
> > + * more generic.
> > + *
> > + * When adding the reverse mapping and reflink features, it became
> > + * necessary to perform complex remapping multi-transactions to comply
> > + * with AG locking order rules, and to be able to spread a single
> > + * refcount update operation (an operation on an n-block extent can
> > + * update as many as n records!) among multiple transactions.  XFS can
> > + * roll a transaction to facilitate this, but using this facility
> > + * requires us to log "intent" items in case log recovery needs to
> > + * redo the operation, and to log "done" items to indicate that redo
> > + * is not necessary.
> > + *
> > + * The xfs_defer_ops structure tracks incoming deferred work (which is
> > + * work that has not yet had an intent logged) in xfs_defer_intake.
> 
> Do you mean xfs_defer_pending rather than xfs_defer_intake?

Ugh, I forgot to update the documentation.  Unlogged work originally got
its own xfs_defer_intake item, but I then realized that it was basically
a subset of xfs_defer_pending.  Then I reorganized the data structures
so that we create one x_d_p, attach items to it, and later log the redo
item and move it from the intake list to the pending list.

So.... doc changes will be inline in the message.

Just replace the whole paragraph with:

"* Deferred work is tracked in xfs_defer_pending items.  Each pending
 * item tracks one type of deferred work.  Incoming work items (which
 * have not yet had an intent logged) are attached to a pending item
 * on the dop_intake list, where they wait for the caller to finish
 * the deferred operations."

> > + * There is one xfs_defer_intake for each type of deferrable
> > + * operation.  Each new deferral is placed in the op's intake list,
> > + * where it waits for the caller to finish the deferred operations.
> > + *
> > + * Finishing a set of deferred operations is an involved process.  To
> > + * start, we define "rolling a deferred-op transaction" as follows:
> > + *
> > + * > For each xfs_defer_intake,

"For each xfs_defer_pending on the dop_intake list,"

> > + *   - Sort the items on the intake list in AG order.

"Sort the work items in AG order.  XFS locking order rules require us
to lock buffers in AG order."

> > + *   - Create a log intent item for that type.
> > + *   - Attach to it the items on the intake list.

"Attach it to the pending item."

> > + *   - Stash the intent+items for later in an xfs_defer_pending.
> 
> Does this mean "the pending list?"

No, the intent and work items are already attached to the x_d_p item.
This should read:

"Move the xfs_defer_pending item from the dop_intake list to the dop_pending
list."

> Thanks for the big comment and example below. It looks that perhaps
> terminology is a bit out of sync with the latest code (I'm guessing
> design and data structures evolved a bit since this was written).

Yes.

> > + *   - Attach the xfs_defer_pending to the xfs_defer_ops work list.

This line becomes redundant with above.

> > + * > Roll the transaction.
> > + *
> > + * NOTE: To avoid exceeding the transaction reservation, we limit the
> > + * number of items that we attach to a given xfs_defer_pending.
> > + *
> > + * The actual finishing process looks like this:
> > + *
> > + * > For each xfs_defer_pending in the xfs_defer_ops work list,

"For each xfs_defer_pending in the dop_pending list,"

> > + *   - Roll the deferred-op transaction as above.
> > + *   - Create a log done item for that type, and attach it to the
> > + *     intent item.
> > + *   - For each work item attached to the intent item,
> > + *     * Perform the described action.
> > + *     * Attach the work item to the log done item.
> > + *     * If the result of doing the work was -EAGAIN, log a fresh
> > + *       intent item and attach all remaining work items to it.  Put
> > + *       the xfs_defer_pending item back on the work list, and repeat
> > + *       the loop.  This allows us to make partial progress even if
> > + *       the transaction is too full to finish the job.
> > + *
> > + * The key here is that we must log an intent item for all pending
> > + * work items every time we roll the transaction, and that we must log
> > + * a done item as soon as the work is completed.  With this mechanism
> > + * we can perform complex remapping operations, chaining intent items
> > + * as needed.
> > + *
> > + * This is an example of remapping the extent (E, E+B) into file X at
> > + * offset A and dealing with the extent (C, C+B) already being mapped
> > + * there:
> > + * +-------------------------------------------------+
> > + * | Unmap file X startblock C offset A length B     | t0
> > + * | Intent to reduce refcount for extent (C, B)     |
> > + * | Intent to remove rmap (X, C, A, B)              |
> > + * | Intent to free extent (D, 1) (bmbt block)       |
> > + * | Intent to map (X, A, B) at startblock E         |
> > + * +-------------------------------------------------+
> > + * | Map file X startblock E offset A length B       | t1
> > + * | Done mapping (X, E, A, B)                       |
> > + * | Intent to increase refcount for extent (E, B)   |
> > + * | Intent to add rmap (X, E, A, B)                 |
> > + * +-------------------------------------------------+
> > + * | Reduce refcount for extent (C, B)               | t2
> > + * | Done reducing refcount for extent (C, B)        |
> > + * | Increase refcount for extent (E, B)             |
> > + * | Done increasing refcount for extent (E, B)      |
> > + * | Intent to free extent (C, B)                    |
> > + * | Intent to free extent (F, 1) (refcountbt block) |
> > + * | Intent to remove rmap (F, 1, REFC)              |
> > + * +-------------------------------------------------+
> > + * | Remove rmap (X, C, A, B)                        | t3
> > + * | Done removing rmap (X, C, A, B)                 |
> > + * | Add rmap (X, E, A, B)                           |
> > + * | Done adding rmap (X, E, A, B)                   |
> > + * | Remove rmap (F, 1, REFC)                        |
> > + * | Done removing rmap (F, 1, REFC)                 |
> > + * +-------------------------------------------------+
> > + * | Free extent (C, B)                              | t4
> > + * | Done freeing extent (C, B)                      |
> > + * | Free extent (D, 1)                              |
> > + * | Done freeing extent (D, 1)                      |
> > + * | Free extent (F, 1)                              |
> > + * | Done freeing extent (F, 1)                      |
> > + * +-------------------------------------------------+
> > + *
> > + * If we should crash before t2 commits, log recovery replays
> > + * the following intent items:
> > + *
> > + * - Intent to reduce refcount for extent (C, B)
> > + * - Intent to remove rmap (X, C, A, B)
> > + * - Intent to free extent (D, 1) (bmbt block)
> > + * - Intent to increase refcount for extent (E, B)
> > + * - Intent to add rmap (X, E, A, B)
> > + *
> > + * In the process of recovering, it should also generate and take care
> > + * of these intent items:
> > + *
> > + * - Intent to free extent (C, B)
> > + * - Intent to free extent (F, 1) (refcountbt block)
> > + * - Intent to remove rmap (F, 1, REFC)
> > + */
> > +
> > +static const struct xfs_defer_op_type *defer_op_types[XFS_DEFER_OPS_TYPE_MAX];
> > +
> > +/*
> > + * For each pending item in the intake list, log its intent item and the
> > + * associated extents, then add the entire intake list to the end of
> > + * the pending list.
> > + */
> > +STATIC void
> 
> I don't think we're using 'STATIC' any longer. Better to use 'static' so
> we can eventually kill off the former.

<shrug> For debugging I've found it useful to have all these internal
functions show up in stack traces, etc.  On the other hand, I've noticed
that all the new patches have eschewed STATIC for static.

Will queue this for the next time I do full-patchbomb edits.

> > +xfs_defer_intake_work(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_ops		*dop)
> > +{
> > +	struct list_head		*li;
> > +	struct xfs_defer_pending	*dfp;
> > +
> > +	list_for_each_entry(dfp, &dop->dop_intake, dfp_list) {
> > +		dfp->dfp_intent = dfp->dfp_type->create_intent(tp,
> > +				dfp->dfp_count);
> > +		list_sort(tp->t_mountp, &dfp->dfp_work,
> > +				dfp->dfp_type->diff_items);
> > +		list_for_each(li, &dfp->dfp_work)
> > +			dfp->dfp_type->log_item(tp, dfp->dfp_intent, li);
> > +	}
> > +
> > +	list_splice_tail_init(&dop->dop_intake, &dop->dop_pending);
> > +}
> > +
> > +/* Abort all the intents that were committed. */
> > +STATIC void
> > +xfs_defer_trans_abort(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_ops		*dop,
> > +	int				error)
> > +{
> > +	struct xfs_defer_pending	*dfp;
> > +
> > +	/*
> > +	 * If the transaction was committed, drop the intent reference
> > +	 * since we're bailing out of here. The other reference is
> > +	 * dropped when the intent hits the AIL.  If the transaction
> > +	 * was not committed, the intent is freed by the intent item
> > +	 * unlock handler on abort.
> > +	 */
> > +	if (!dop->dop_committed)
> > +		return;
> > +
> > +	/* Abort intent items. */
> > +	list_for_each_entry(dfp, &dop->dop_pending, dfp_list) {
> > +		if (dfp->dfp_committed)
> > +			dfp->dfp_type->abort_intent(dfp->dfp_intent);
> > +	}
> > +
> > +	/* Shut down FS. */
> > +	xfs_force_shutdown(tp->t_mountp, (error == -EFSCORRUPTED) ?
> > +			SHUTDOWN_CORRUPT_INCORE : SHUTDOWN_META_IO_ERROR);
> > +}
> > +
> > +/* Roll a transaction so we can do some deferred op processing. */
> > +STATIC int
> > +xfs_defer_trans_roll(
> > +	struct xfs_trans		**tp,
> > +	struct xfs_defer_ops		*dop,
> > +	struct xfs_inode		*ip)
> > +{
> > +	int				i;
> > +	int				error;
> > +
> > +	/* Log all the joined inodes except the one we passed in. */
> > +	for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dop->dop_inodes[i]; i++) {
> > +		if (dop->dop_inodes[i] == ip)
> > +			continue;
> > +		xfs_trans_log_inode(*tp, dop->dop_inodes[i], XFS_ILOG_CORE);
> > +	}
> > +
> > +	/* Roll the transaction. */
> > +	error = xfs_trans_roll(tp, ip);
> > +	if (error) {
> > +		xfs_defer_trans_abort(*tp, dop, error);
> > +		return error;
> > +	}
> > +	dop->dop_committed = true;
> > +
> > +	/* Log all the joined inodes except the one we passed in. */
> 
> Rejoin?

Er, yes. :)

> > +	for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dop->dop_inodes[i]; i++) {
> > +		if (dop->dop_inodes[i] == ip)
> > +			continue;
> > +		xfs_trans_ijoin(*tp, dop->dop_inodes[i], 0);
> > +	}
> > +
> > +	return error;
> > +}
> > +
> > +/* Do we have any work items to finish? */
> > +bool
> > +xfs_defer_has_unfinished_work(
> > +	struct xfs_defer_ops		*dop)
> > +{
> > +	return !list_empty(&dop->dop_pending) || !list_empty(&dop->dop_intake);
> > +}
> > +
> > +/*
> > + * Add this inode to the deferred op.  Each joined inode is relogged
> > + * each time we roll the transaction, in addition to any inode passed
> > + * to xfs_defer_finish().
> > + */
> > +int
> > +xfs_defer_join(
> > +	struct xfs_defer_ops		*dop,
> > +	struct xfs_inode		*ip)
> > +{
> > +	int				i;
> > +
> > +	for (i = 0; i < XFS_DEFER_OPS_NR_INODES; i++) {
> > +		if (dop->dop_inodes[i] == ip)
> > +			return 0;
> > +		else if (dop->dop_inodes[i] == NULL) {
> > +			dop->dop_inodes[i] = ip;
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	return -EFSCORRUPTED;
> > +}
> > +
> > +/*
> > + * Finish all the pending work.  This involves logging intent items for
> > + * any work items that wandered in since the last transaction roll (if
> > + * one has even happened), rolling the transaction, and finishing the
> > + * work items in the first item on the logged-and-pending list.
> > + *
> > + * If an inode is provided, relog it to the new transaction.
> > + */
> > +int
> > +xfs_defer_finish(
> > +	struct xfs_trans		**tp,
> > +	struct xfs_defer_ops		*dop,
> > +	struct xfs_inode		*ip)
> > +{
> > +	struct xfs_defer_pending	*dfp;
> > +	struct list_head		*li;
> > +	struct list_head		*n;
> > +	void				*done_item = NULL;
> > +	void				*state;
> > +	int				error = 0;
> > +	void				(*cleanup_fn)(struct xfs_trans *, void *, int);
> > +
> > +	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +
> > +	/* Until we run out of pending work to finish... */
> > +	while (xfs_defer_has_unfinished_work(dop)) {
> > +		/* Log intents for work items sitting in the intake. */
> > +		xfs_defer_intake_work(*tp, dop);
> > +
> > +		/* Roll the transaction. */
> > +		error = xfs_defer_trans_roll(tp, dop, ip);
> > +		if (error)
> > +			goto out;
> > +
> > +		/* Mark all pending intents as committed. */
> > +		list_for_each_entry_reverse(dfp, &dop->dop_pending, dfp_list) {
> > +			if (dfp->dfp_committed)
> > +				break;
> > +			dfp->dfp_committed = true;
> > +		}
> > +
> > +		/* Log an intent-done item for the first pending item. */
> > +		dfp = list_first_entry(&dop->dop_pending,
> > +				struct xfs_defer_pending, dfp_list);
> > +		done_item = dfp->dfp_type->create_done(*tp, dfp->dfp_intent,
> > +				dfp->dfp_count);
> > +		cleanup_fn = dfp->dfp_type->finish_cleanup;
> > +
> > +		/* Finish the work items. */
> > +		state = NULL;
> > +		list_for_each_safe(li, n, &dfp->dfp_work) {
> > +			list_del(li);
> > +			dfp->dfp_count--;
> > +			error = dfp->dfp_type->finish_item(*tp, dop, li,
> > +					done_item, &state);
> > +			if (error == -EAGAIN) {
> > +				/*
> > +				 * If the caller needs to try again, put the
> > +				 * item back on the pending list and jump out
> > +				 * for further processing.
> 
> A little confused by the terminology here. Perhaps better to say "back
> on the work list" rather than "pending list?"

Yes.

> Also, what is the meaning/purpose of -EAGAIN here? This isn't used by
> the extent free bits so I'm missing some context.

Generally, the ->finish_item() uses -EAGAIN to signal that it couldn't finish
the work item and that it's necessary to log a new redo item and try again.

Practically, the only user of this mechanism is the refcountbt adjust function.
It might be the case that we want to adjust N blocks, but some pathological
user has creatively used reflink to create many refcount records.  In that
case we could blow out the transaction reservation logging all the updates.

To avoid that, the refcount code tries to guess (conservatively) when it
might be getting close and returns a short *adjusted.  See the call sites of
xfs_refcount_still_have_space().  Next, xfs_trans_log_finish_refcount_update()
will notice the short adjust returned and fixes up the CUD item to have a
reduced cud_nextents and to reflect where the operation stopped.  Then,
xfs_refcount_update_finish_item() notices the short return, updates the work
item list, and returns -EAGAIN.  Finally, xfs_defer_finish() sees the -EAGAIN
and requeues the work item so that we resume refcount adjusting after the
transaction rolls.

> For example, is there
> an issue with carrying a done_item with an unexpected list count?

AFAICT, nothing in log recovery ever checks that the list counts of the
intent and done items actually match, let alone the extents logged with
them.  It only seems to care if there's an efd such that efd->efd_efi_id ==
efi->efi_id, in which case it won't replay the efi.

I don't know if that was a deliberate part of the log design, but the
lack of checking helps us here.

> Is it
> expected that xfs_defer_finish() will not return until -EAGAIN is
> "cleared" (does relogging below and rolling somehow address this)?

Yes, relogging and rolling gives us a fresh transaction with which to
continue updating.

> > +				 */
> > +				list_add(li, &dfp->dfp_work);
> > +				dfp->dfp_count++;
> > +				break;
> > +			} else if (error) {
> > +				/*
> > +				 * Clean up after ourselves and jump out.
> > +				 * xfs_defer_cancel will take care of freeing
> > +				 * all these lists and stuff.
> > +				 */
> > +				if (cleanup_fn)
> > +					cleanup_fn(*tp, state, error);
> > +				xfs_defer_trans_abort(*tp, dop, error);
> > +				goto out;
> > +			}
> > +		}
> > +		if (error == -EAGAIN) {
> > +			/*
> > +			 * Log a new intent, relog all the remaining work
> > +			 * item to the new intent, attach the new intent to
> > +			 * the dfp, and leave the dfp at the head of the list
> > +			 * for further processing.
> > +			 */
> 
> Similar to the above, could you elaborate on the mechanics of this with
> respect to the log?  E.g., the comment kind of just repeats what the
> code does as opposed to explain why it's here. Is the point here to log
> a new intent in the same transaction as the done item to ensure that we
> (atomically) indicate that certain operations need to be replayed if
> this transaction hits the log and then we crash?

Yes.

"This effectively replaces the old intent item with a new one listing only
the work items that were not completed when ->finish_item() returned -EAGAIN.
After the subsequent transaction roll, we'll resume where we left off with a
fresh transaction."

Thank you for the review!

--D

> Brian
> 
> > +			dfp->dfp_intent = dfp->dfp_type->create_intent(*tp,
> > +					dfp->dfp_count);
> > +			list_for_each(li, &dfp->dfp_work)
> > +				dfp->dfp_type->log_item(*tp, dfp->dfp_intent,
> > +						li);
> > +		} else {
> > +			/* Done with the dfp, free it. */
> > +			list_del(&dfp->dfp_list);
> > +			kmem_free(dfp);
> > +		}
> > +
> > +		if (cleanup_fn)
> > +			cleanup_fn(*tp, state, error);
> > +	}
> > +
> > +out:
> > +	return error;
> > +}
> > +
> > +/*
> > + * Free up any items left in the list.
> > + */
> > +void
> > +xfs_defer_cancel(
> > +	struct xfs_defer_ops		*dop)
> > +{
> > +	struct xfs_defer_pending	*dfp;
> > +	struct xfs_defer_pending	*pli;
> > +	struct list_head		*pwi;
> > +	struct list_head		*n;
> > +
> > +	/*
> > +	 * Free the pending items.  Caller should already have arranged
> > +	 * for the intent items to be released.
> > +	 */
> > +	list_for_each_entry_safe(dfp, pli, &dop->dop_intake, dfp_list) {
> > +		list_del(&dfp->dfp_list);
> > +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> > +			list_del(pwi);
> > +			dfp->dfp_count--;
> > +			dfp->dfp_type->cancel_item(pwi);
> > +		}
> > +		ASSERT(dfp->dfp_count == 0);
> > +		kmem_free(dfp);
> > +	}
> > +	list_for_each_entry_safe(dfp, pli, &dop->dop_pending, dfp_list) {
> > +		list_del(&dfp->dfp_list);
> > +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> > +			list_del(pwi);
> > +			dfp->dfp_count--;
> > +			dfp->dfp_type->cancel_item(pwi);
> > +		}
> > +		ASSERT(dfp->dfp_count == 0);
> > +		kmem_free(dfp);
> > +	}
> > +}
> > +
> > +/* Add an item for later deferred processing. */
> > +void
> > +xfs_defer_add(
> > +	struct xfs_defer_ops		*dop,
> > +	enum xfs_defer_ops_type		type,
> > +	struct list_head		*li)
> > +{
> > +	struct xfs_defer_pending	*dfp = NULL;
> > +
> > +	/*
> > +	 * Add the item to a pending item at the end of the intake list.
> > +	 * If the last pending item has the same type, reuse it.  Else,
> > +	 * create a new pending item at the end of the intake list.
> > +	 */
> > +	if (!list_empty(&dop->dop_intake)) {
> > +		dfp = list_last_entry(&dop->dop_intake,
> > +				struct xfs_defer_pending, dfp_list);
> > +		if (dfp->dfp_type->type != type ||
> > +		    (dfp->dfp_type->max_items &&
> > +		     dfp->dfp_count >= dfp->dfp_type->max_items))
> > +			dfp = NULL;
> > +	}
> > +	if (!dfp) {
> > +		dfp = kmem_alloc(sizeof(struct xfs_defer_pending),
> > +				KM_SLEEP | KM_NOFS);
> > +		dfp->dfp_type = defer_op_types[type];
> > +		dfp->dfp_committed = false;
> > +		dfp->dfp_intent = NULL;
> > +		dfp->dfp_count = 0;
> > +		INIT_LIST_HEAD(&dfp->dfp_work);
> > +		list_add_tail(&dfp->dfp_list, &dop->dop_intake);
> > +	}
> > +
> > +	list_add_tail(li, &dfp->dfp_work);
> > +	dfp->dfp_count++;
> > +}
> > +
> > +/* Initialize a deferred operation list. */
> > +void
> > +xfs_defer_init_op_type(
> > +	const struct xfs_defer_op_type	*type)
> > +{
> > +	defer_op_types[type->type] = type;
> > +}
> > +
> > +/* Initialize a deferred operation. */
> > +void
> > +xfs_defer_init(
> > +	struct xfs_defer_ops		*dop,
> > +	xfs_fsblock_t			*fbp)
> > +{
> > +	dop->dop_committed = false;
> > +	dop->dop_low = false;
> > +	memset(&dop->dop_inodes, 0, sizeof(dop->dop_inodes));
> > +	*fbp = NULLFSBLOCK;
> > +	INIT_LIST_HEAD(&dop->dop_intake);
> > +	INIT_LIST_HEAD(&dop->dop_pending);
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > new file mode 100644
> > index 0000000..85c7a3a
> > --- /dev/null
> > +++ b/fs/xfs/libxfs/xfs_defer.h
> > @@ -0,0 +1,96 @@
> > +/*
> > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#ifndef __XFS_DEFER_H__
> > +#define	__XFS_DEFER_H__
> > +
> > +struct xfs_defer_op_type;
> > +
> > +/*
> > + * Save a log intent item and a list of extents, so that we can replay
> > + * whatever action had to happen to the extent list and file the log done
> > + * item.
> > + */
> > +struct xfs_defer_pending {
> > +	const struct xfs_defer_op_type	*dfp_type;	/* function pointers */
> > +	struct list_head		dfp_list;	/* pending items */
> > +	bool				dfp_committed;	/* committed trans? */
> > +	void				*dfp_intent;	/* log intent item */
> > +	struct list_head		dfp_work;	/* work items */
> > +	unsigned int			dfp_count;	/* # extent items */
> > +};
> > +
> > +/*
> > + * Header for deferred operation list.
> > + *
> > + * dop_low is used by the allocator to activate the lowspace algorithm -
> > + * when free space is running low the extent allocator may choose to
> > + * allocate an extent from an AG without leaving sufficient space for
> > + * a btree split when inserting the new extent.  In this case the allocator
> > + * will enable the lowspace algorithm which is supposed to allow further
> > + * allocations (such as btree splits and newroots) to allocate from
> > + * sequential AGs.  In order to avoid locking AGs out of order the lowspace
> > + * algorithm will start searching for free space from AG 0.  If the correct
> > + * transaction reservations have been made then this algorithm will eventually
> > + * find all the space it needs.
> > + */
> > +enum xfs_defer_ops_type {
> > +	XFS_DEFER_OPS_TYPE_MAX,
> > +};
> > +
> > +#define XFS_DEFER_OPS_NR_INODES	2	/* join up to two inodes */
> > +
> > +struct xfs_defer_ops {
> > +	bool			dop_committed;	/* did any trans commit? */
> > +	bool			dop_low;	/* alloc in low mode */
> > +	struct list_head	dop_intake;	/* unlogged pending work */
> > +	struct list_head	dop_pending;	/* logged pending work */
> > +
> > +	/* relog these inodes with each roll */
> > +	struct xfs_inode	*dop_inodes[XFS_DEFER_OPS_NR_INODES];
> > +};
> > +
> > +void xfs_defer_add(struct xfs_defer_ops *dop, enum xfs_defer_ops_type type,
> > +		struct list_head *h);
> > +int xfs_defer_finish(struct xfs_trans **tp, struct xfs_defer_ops *dop,
> > +		struct xfs_inode *ip);
> > +void xfs_defer_cancel(struct xfs_defer_ops *dop);
> > +void xfs_defer_init(struct xfs_defer_ops *dop, xfs_fsblock_t *fbp);
> > +bool xfs_defer_has_unfinished_work(struct xfs_defer_ops *dop);
> > +int xfs_defer_join(struct xfs_defer_ops *dop, struct xfs_inode *ip);
> > +
> > +/* Description of a deferred type. */
> > +struct xfs_defer_op_type {
> > +	enum xfs_defer_ops_type	type;
> > +	unsigned int		max_items;
> > +	void (*abort_intent)(void *);
> > +	void *(*create_done)(struct xfs_trans *, void *, unsigned int);
> > +	int (*finish_item)(struct xfs_trans *, struct xfs_defer_ops *,
> > +			struct list_head *, void *, void **);
> > +	void (*finish_cleanup)(struct xfs_trans *, void *, int);
> > +	void (*cancel_item)(struct list_head *);
> > +	int (*diff_items)(void *, struct list_head *, struct list_head *);
> > +	void *(*create_intent)(struct xfs_trans *, uint);
> > +	void (*log_item)(struct xfs_trans *, void *, struct list_head *);
> > +};
> > +
> > +void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
> > +void xfs_defer_init_types(void);
> > +
> > +#endif /* __XFS_DEFER_H__ */
> > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > new file mode 100644
> > index 0000000..849088d
> > --- /dev/null
> > +++ b/fs/xfs/xfs_defer_item.c
> > @@ -0,0 +1,36 @@
> > +/*
> > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_bit.h"
> > +#include "xfs_sb.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_trans.h"
> > +
> > +/* Initialize the deferred operation types. */
> > +void
> > +xfs_defer_init_types(void)
> > +{
> > +}
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 09722a7..bf63f6d 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -46,6 +46,7 @@
> >  #include "xfs_quota.h"
> >  #include "xfs_sysfs.h"
> >  #include "xfs_ondisk.h"
> > +#include "xfs_defer.h"
> >  
> >  #include <linux/namei.h>
> >  #include <linux/init.h>
> > @@ -1850,6 +1851,7 @@ init_xfs_fs(void)
> >  	printk(KERN_INFO XFS_VERSION_STRING " with "
> >  			 XFS_BUILD_OPTIONS " enabled\n");
> >  
> > +	xfs_defer_init_types();
> >  	xfs_dir_startup();
> >  
> >  	error = xfs_init_zones();
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 014/119] xfs: introduce interval queries on btrees
  2016-06-22 15:18   ` Brian Foster
@ 2016-06-27 21:07     ` Darrick J. Wong
  2016-06-28 12:32       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-27 21:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Wed, Jun 22, 2016 at 11:18:00AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:19:21PM -0700, Darrick J. Wong wrote:
> > Create a function to enable querying of btree records mapping to a
> > range of keys.  This will be used in subsequent patches to allow
> > querying the reverse mapping btree to find the extents mapped to a
> > range of physical blocks, though the generic code can be used for
> > any range query.
> > 
> > v2: add some shortcuts so that we can jump out of processing once
> > we know there won't be any more records to find.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_btree.c |  249 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_btree.h |   22 +++-
> >  fs/xfs/xfs_trace.h        |    1 
> >  3 files changed, 267 insertions(+), 5 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index afcafd6..5f5cf23 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -4509,3 +4509,252 @@ xfs_btree_calc_size(
> >  	}
> >  	return rval;
> >  }
> > +
> > +/* Query a regular btree for all records overlapping a given interval. */
> 
> Can you elaborate on the search algorithm used? (More for reference
> against the overlapped query, as that one is more complex).

Ok.  Both query_range functions aim to return all records intersecting the
given range.

For non-overlapped btrees, we start with a LE lookup of the low key and
return each record we find a the record with a key greater than the
high key.

For overlapped btrees, we follow the procedure in the "Interval trees"
section of _Introduction to Algorithms_, which is 14.3 in the 2nd and
3rd editions.  The query algorithm is roughly as follows:

For any leaf btree node, generate the low and high keys for the record.
If there's a range overlap with the query's low and high keys, pass the
record to the iterator function.

For any internal btree node, compare the low and high keys for each pointer
against the query's low and high keys.  If there's an overlap, follow the
pointer downwards in the tree.

(I could render the figures in the book as ASCII art if anyone wants.)

> 
> > +STATIC int
> > +xfs_btree_simple_query_range(
> > +	struct xfs_btree_cur		*cur,
> > +	union xfs_btree_irec		*low_rec,
> > +	union xfs_btree_irec		*high_rec,
> > +	xfs_btree_query_range_fn	fn,
> > +	void				*priv)
> > +{
> > +	union xfs_btree_rec		*recp;
> > +	union xfs_btree_rec		rec;
> > +	union xfs_btree_key		low_key;
> > +	union xfs_btree_key		high_key;
> > +	union xfs_btree_key		rec_key;
> > +	__int64_t			diff;
> > +	int				stat;
> > +	bool				firstrec = true;
> > +	int				error;
> > +
> > +	ASSERT(cur->bc_ops->init_high_key_from_rec);
> > +
> > +	/* Find the keys of both ends of the interval. */
> > +	cur->bc_rec = *high_rec;
> > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> > +
> > +	cur->bc_rec = *low_rec;
> > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> > +
> > +	/* Find the leftmost record. */
> > +	stat = 0;
> > +	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat);
> > +	if (error)
> > +		goto out;
> > +
> > +	while (stat) {
> > +		/* Find the record. */
> > +		error = xfs_btree_get_rec(cur, &recp, &stat);
> > +		if (error || !stat)
> > +			break;
> > +
> > +		/* Can we tell if this record is too low? */
> > +		if (firstrec) {
> > +			cur->bc_rec = *low_rec;
> > +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> > +			diff = cur->bc_ops->key_diff(cur, &rec_key);
> > +			if (diff < 0)
> > +				goto advloop;
> > +		}
> > +		firstrec = false;
> 
> This could move up into the if block.

Ok.

> > +
> > +		/* Have we gone past the end? */
> > +		cur->bc_rec = *high_rec;
> > +		cur->bc_ops->init_key_from_rec(&rec_key, recp);
> 
> I'd move this up to immediately after the xfs_btree_get_rec() call and
> eliminate the duplicate in the 'if (firstrec)' block above.

Ok.  That key ought to be named rec_hkey too.

> > +		diff = cur->bc_ops->key_diff(cur, &rec_key);
> > +		if (diff > 0)
> > +			break;
> > +
> > +		/* Callback */
> > +		error = fn(cur, recp, priv);
> > +		if (error < 0 || error == XFS_BTREE_QUERY_RANGE_ABORT)
> > +			break;
> > +
> > +advloop:
> > +		/* Move on to the next record. */
> > +		error = xfs_btree_increment(cur, 0, &stat);
> > +		if (error)
> > +			break;
> > +	}
> > +
> > +out:
> > +	return error;
> > +}
> > +
> > +/*
> > + * Query an overlapped interval btree for all records overlapping a given
> > + * interval.
> > + */
> 
> Same comment here, can you elaborate on the search algorithm? Also, I
> think an example or generic description of the rules around what records
> this query returns (e.g., low_rec/high_rec vs. record low/high keys)
> would be useful, particularly since I, at least, don't have much context
> on the rmap+reflink scenarios quite yet.

Let's say you have a bunch of (overlapped) rmap records:

1: +- file A startblock B offset C length D -----------+
2:      +- file E startblock F offset G length H --------------+
3:      +- file I startblock F offset J length K --+
4:                                                        +- file L... --+

Now say we want to map block (B+D) into file A at offset (C+D).  Ideally, we'd
simply increment the length of record 1.  But how do we find that record that
ends at (B+D-1)?  A LE lookup of (B+D-1) would return record 3 because the
keys are ordered first by startblock.  An interval query would return records
1 and 2 because they both overlap (B+D-1), and from that we can pick out
record 1 as the appropriate left neighbor.

In the non-overlapped case you can do a LE lookup and decrement the cursor
because a record's interval must end before the next record.

> > +STATIC int
> > +xfs_btree_overlapped_query_range(
> > +	struct xfs_btree_cur		*cur,
> > +	union xfs_btree_irec		*low_rec,
> > +	union xfs_btree_irec		*high_rec,
> > +	xfs_btree_query_range_fn	fn,
> > +	void				*priv)
> > +{
> > +	union xfs_btree_ptr		ptr;
> > +	union xfs_btree_ptr		*pp;
> > +	union xfs_btree_key		rec_key;
> > +	union xfs_btree_key		low_key;
> > +	union xfs_btree_key		high_key;
> > +	union xfs_btree_key		*lkp;
> > +	union xfs_btree_key		*hkp;
> > +	union xfs_btree_rec		rec;
> > +	union xfs_btree_rec		*recp;
> > +	struct xfs_btree_block		*block;
> > +	__int64_t			ldiff;
> > +	__int64_t			hdiff;
> > +	int				level;
> > +	struct xfs_buf			*bp;
> > +	int				i;
> > +	int				error;
> > +
> > +	/* Find the keys of both ends of the interval. */
> > +	cur->bc_rec = *high_rec;
> > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> > +
> > +	cur->bc_rec = *low_rec;
> > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> > +
> > +	/* Load the root of the btree. */
> > +	level = cur->bc_nlevels - 1;
> > +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > +	error = xfs_btree_lookup_get_block(cur, level, &ptr, &block);
> > +	if (error)
> > +		return error;
> > +	xfs_btree_get_block(cur, level, &bp);
> > +	trace_xfs_btree_overlapped_query_range(cur, level, bp);
> > +#ifdef DEBUG
> > +	error = xfs_btree_check_block(cur, block, level, bp);
> > +	if (error)
> > +		goto out;
> > +#endif
> > +	cur->bc_ptrs[level] = 1;
> > +
> > +	while (level < cur->bc_nlevels) {
> > +		block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
> > +
> > +		if (level == 0) {
> > +			/* End of leaf, pop back towards the root. */
> > +			if (cur->bc_ptrs[level] >
> > +			    be16_to_cpu(block->bb_numrecs)) {
> > +leaf_pop_up:
> > +				if (level < cur->bc_nlevels - 1)
> > +					cur->bc_ptrs[level + 1]++;
> > +				level++;
> > +				continue;
> > +			}
> > +
> > +			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> > +
> > +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> > +			ldiff = cur->bc_ops->diff_two_keys(cur, &low_key,
> > +					&rec_key);
> > +
> > +			cur->bc_ops->init_key_from_rec(&rec_key, recp);
> > +			hdiff = cur->bc_ops->diff_two_keys(cur, &rec_key,
> > +					&high_key);
> > +
> 
> This looked a little funny to me because I expected diff_two_keys() to
> basically be param1 - param2. Looking ahead at the rmapbt code, it is in
> fact the other way around. I'm not sure we have precedent for either
> way, tbh. I still have to stare at this some more, but I wonder if a
> "does record overlap" helper (with comments) would help clean this up a
> bit.

You're correct this is exactly the opposite of the compare functions in
the C library and the rest of the kernel.  I'll fix that up.

> > +			/* If the record matches, callback */
> > +			if (ldiff >= 0 && hdiff >= 0) {

Ok, I'll make it a little clearer what we're testing here:

/*
 * If (record's high key >= query's low key) and
 *    (query's high key >= record's low key), then
 * this record overlaps the query range, so callback.
 */


> > +				error = fn(cur, recp, priv);
> > +				if (error < 0 ||
> > +				    error == XFS_BTREE_QUERY_RANGE_ABORT)
> > +					break;
> > +			} else if (hdiff < 0) {
> > +				/* Record is larger than high key; pop. */
> > +				goto leaf_pop_up;
> > +			}
> > +			cur->bc_ptrs[level]++;
> > +			continue;
> > +		}
> > +
> > +		/* End of node, pop back towards the root. */
> > +		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> > +node_pop_up:
> > +			if (level < cur->bc_nlevels - 1)
> > +				cur->bc_ptrs[level + 1]++;
> > +			level++;
> > +			continue;
> 
> Looks like same code as leaf_pop_up. I wonder if we can bury this at the
> end of the loop with a common label.

Yep.

> > +		}
> > +
> > +		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
> > +		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
> > +		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> > +
> > +		ldiff = cur->bc_ops->diff_two_keys(cur, &low_key, hkp);
> > +		hdiff = cur->bc_ops->diff_two_keys(cur, lkp, &high_key);
> > +
> > +		/* If the key matches, drill another level deeper. */
> > +		if (ldiff >= 0 && hdiff >= 0) {
> > +			level--;
> > +			error = xfs_btree_lookup_get_block(cur, level, pp,
> > +					&block);
> > +			if (error)
> > +				goto out;
> > +			xfs_btree_get_block(cur, level, &bp);
> > +			trace_xfs_btree_overlapped_query_range(cur, level, bp);
> > +#ifdef DEBUG
> > +			error = xfs_btree_check_block(cur, block, level, bp);
> > +			if (error)
> > +				goto out;
> > +#endif
> > +			cur->bc_ptrs[level] = 1;
> > +			continue;
> > +		} else if (hdiff < 0) {
> > +			/* The low key is larger than the upper range; pop. */
> > +			goto node_pop_up;
> > +		}
> > +		cur->bc_ptrs[level]++;
> > +	}
> > +
> > +out:
> > +	/*
> > +	 * If we don't end this function with the cursor pointing at a record
> > +	 * block, a subsequent non-error cursor deletion will not release
> > +	 * node-level buffers, causing a buffer leak.  This is quite possible
> > +	 * with a zero-results range query, so release the buffers if we
> > +	 * failed to return any results.
> > +	 */
> > +	if (cur->bc_bufs[0] == NULL) {
> > +		for (i = 0; i < cur->bc_nlevels; i++) {
> > +			if (cur->bc_bufs[i]) {
> > +				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> > +				cur->bc_bufs[i] = NULL;
> > +				cur->bc_ptrs[i] = 0;
> > +				cur->bc_ra[i] = 0;
> > +			}
> > +		}
> > +	}
> > +
> > +	return error;
> > +}
> > +
> > +/*
> > + * Query a btree for all records overlapping a given interval of keys.  The
> > + * supplied function will be called with each record found; return one of the
> > + * XFS_BTREE_QUERY_RANGE_{CONTINUE,ABORT} values or the usual negative error
> > + * code.  This function returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a
> > + * negative error code.
> > + */
> > +int
> > +xfs_btree_query_range(
> > +	struct xfs_btree_cur		*cur,
> > +	union xfs_btree_irec		*low_rec,
> > +	union xfs_btree_irec		*high_rec,
> > +	xfs_btree_query_range_fn	fn,
> > +	void				*priv)
> > +{
> > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +		return xfs_btree_simple_query_range(cur, low_rec,
> > +				high_rec, fn, priv);
> > +	return xfs_btree_overlapped_query_range(cur, low_rec, high_rec,
> > +			fn, priv);
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index a5ec6c7..898fee5 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -206,6 +206,12 @@ struct xfs_btree_ops {
> >  #define LASTREC_DELREC	2
> >  
> >  
> > +union xfs_btree_irec {
> > +	xfs_alloc_rec_incore_t		a;
> > +	xfs_bmbt_irec_t			b;
> > +	xfs_inobt_rec_incore_t		i;
> > +};
> > +
> 
> We might as well kill off the typedef usage here.

Ok.  Thx for the review!

--D

> 
> Brian
> 
> >  /*
> >   * Btree cursor structure.
> >   * This collects all information needed by the btree code in one place.
> > @@ -216,11 +222,7 @@ typedef struct xfs_btree_cur
> >  	struct xfs_mount	*bc_mp;	/* file system mount struct */
> >  	const struct xfs_btree_ops *bc_ops;
> >  	uint			bc_flags; /* btree features - below */
> > -	union {
> > -		xfs_alloc_rec_incore_t	a;
> > -		xfs_bmbt_irec_t		b;
> > -		xfs_inobt_rec_incore_t	i;
> > -	}		bc_rec;		/* current insert/search record value */
> > +	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
> >  	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
> >  	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
> >  	__uint8_t	bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
> > @@ -494,4 +496,14 @@ xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
> >  uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
> >  		unsigned long len);
> >  
> > +/* return codes */
> > +#define XFS_BTREE_QUERY_RANGE_CONTINUE	0	/* keep iterating */
> > +#define XFS_BTREE_QUERY_RANGE_ABORT	1	/* stop iterating */
> > +typedef int (*xfs_btree_query_range_fn)(struct xfs_btree_cur *cur,
> > +		union xfs_btree_rec *rec, void *priv);
> > +
> > +int xfs_btree_query_range(struct xfs_btree_cur *cur,
> > +		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
> > +		xfs_btree_query_range_fn fn, void *priv);
> > +
> >  #endif	/* __XFS_BTREE_H__ */
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index ffea28c..f0ac9c9 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2218,6 +2218,7 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
> >  	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> >  	TP_ARGS(cur, level, bp))
> >  DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> > +DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
> >  
> >  #endif /* _TRACE_XFS_H */
> >  
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free
  2016-06-27 13:15   ` Brian Foster
@ 2016-06-27 21:41     ` Darrick J. Wong
  2016-06-27 22:00       ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-27 21:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 09:15:08AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:19:47PM -0700, Darrick J. Wong wrote:
> > Connect the xfs_defer mechanism with the pieces that we'll need to
> > handle deferred extent freeing.  We'll wire up the existing code to
> > our new deferred mechanism later.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Could we merge this with the xfs_trans_*efi/efd* bits? We'd need to
> preserve some calls for recovery, but it looks like other parts are only
> used by the deferred ops infrastructure at this point.

Yes, we could replace xfs_bmap_free_create_{intent,done} with
xfs_trans_get_ef[id] and lose the silly functions.  I'll go take
care of all of them.

--D

> 
> Brian
> 
> >  fs/xfs/libxfs/xfs_defer.h |    1 
> >  fs/xfs/xfs_defer_item.c   |  108 +++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 109 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > index 85c7a3a..743fc32 100644
> > --- a/fs/xfs/libxfs/xfs_defer.h
> > +++ b/fs/xfs/libxfs/xfs_defer.h
> > @@ -51,6 +51,7 @@ struct xfs_defer_pending {
> >   * find all the space it needs.
> >   */
> >  enum xfs_defer_ops_type {
> > +	XFS_DEFER_OPS_TYPE_FREE,
> >  	XFS_DEFER_OPS_TYPE_MAX,
> >  };
> >  
> > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > index 4c2ba28..127a54e 100644
> > --- a/fs/xfs/xfs_defer_item.c
> > +++ b/fs/xfs/xfs_defer_item.c
> > @@ -29,9 +29,117 @@
> >  #include "xfs_defer.h"
> >  #include "xfs_trans.h"
> >  #include "xfs_trace.h"
> > +#include "xfs_bmap.h"
> > +#include "xfs_extfree_item.h"
> > +
> > +/* Extent Freeing */
> > +
> > +/* Sort bmap items by AG. */
> > +static int
> > +xfs_bmap_free_diff_items(
> > +	void				*priv,
> > +	struct list_head		*a,
> > +	struct list_head		*b)
> > +{
> > +	struct xfs_mount		*mp = priv;
> > +	struct xfs_bmap_free_item	*ra;
> > +	struct xfs_bmap_free_item	*rb;
> > +
> > +	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
> > +	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
> > +	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
> > +		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
> > +}
> > +
> > +/* Get an EFI. */
> > +STATIC void *
> > +xfs_bmap_free_create_intent(
> > +	struct xfs_trans		*tp,
> > +	unsigned int			count)
> > +{
> > +	return xfs_trans_get_efi(tp, count);
> > +}
> > +
> > +/* Log a free extent to the intent item. */
> > +STATIC void
> > +xfs_bmap_free_log_item(
> > +	struct xfs_trans		*tp,
> > +	void				*intent,
> > +	struct list_head		*item)
> > +{
> > +	struct xfs_bmap_free_item	*free;
> > +
> > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > +	xfs_trans_log_efi_extent(tp, intent, free->xbfi_startblock,
> > +			free->xbfi_blockcount);
> > +}
> > +
> > +/* Get an EFD so we can process all the free extents. */
> > +STATIC void *
> > +xfs_bmap_free_create_done(
> > +	struct xfs_trans		*tp,
> > +	void				*intent,
> > +	unsigned int			count)
> > +{
> > +	return xfs_trans_get_efd(tp, intent, count);
> > +}
> > +
> > +/* Process a free extent. */
> > +STATIC int
> > +xfs_bmap_free_finish_item(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_ops		*dop,
> > +	struct list_head		*item,
> > +	void				*done_item,
> > +	void				**state)
> > +{
> > +	struct xfs_bmap_free_item	*free;
> > +	int				error;
> > +
> > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > +	error = xfs_trans_free_extent(tp, done_item,
> > +			free->xbfi_startblock,
> > +			free->xbfi_blockcount);
> > +	kmem_free(free);
> > +	return error;
> > +}
> > +
> > +/* Abort all pending EFIs. */
> > +STATIC void
> > +xfs_bmap_free_abort_intent(
> > +	void				*intent)
> > +{
> > +	xfs_efi_release(intent);
> > +}
> > +
> > +/* Cancel a free extent. */
> > +STATIC void
> > +xfs_bmap_free_cancel_item(
> > +	struct list_head		*item)
> > +{
> > +	struct xfs_bmap_free_item	*free;
> > +
> > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > +	kmem_free(free);
> > +}
> > +
> > +const struct xfs_defer_op_type xfs_extent_free_defer_type = {
> > +	.type		= XFS_DEFER_OPS_TYPE_FREE,
> > +	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
> > +	.diff_items	= xfs_bmap_free_diff_items,
> > +	.create_intent	= xfs_bmap_free_create_intent,
> > +	.abort_intent	= xfs_bmap_free_abort_intent,
> > +	.log_item	= xfs_bmap_free_log_item,
> > +	.create_done	= xfs_bmap_free_create_done,
> > +	.finish_item	= xfs_bmap_free_finish_item,
> > +	.cancel_item	= xfs_bmap_free_cancel_item,
> > +};
> > +
> > +/* Deferred Item Initialization */
> >  
> >  /* Initialize the deferred operation types. */
> >  void
> >  xfs_defer_init_types(void)
> >  {
> > +	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
> >  }
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free
  2016-06-27 21:41     ` Darrick J. Wong
@ 2016-06-27 22:00       ` Darrick J. Wong
  2016-06-28 12:32         ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-27 22:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 02:41:45PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2016 at 09:15:08AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:19:47PM -0700, Darrick J. Wong wrote:
> > > Connect the xfs_defer mechanism with the pieces that we'll need to
> > > handle deferred extent freeing.  We'll wire up the existing code to
> > > our new deferred mechanism later.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Could we merge this with the xfs_trans_*efi/efd* bits? We'd need to
> > preserve some calls for recovery, but it looks like other parts are only
> > used by the deferred ops infrastructure at this point.
> 
> Yes, we could replace xfs_bmap_free_create_{intent,done} with
> xfs_trans_get_ef[id] and lose the silly functions.  I'll go take
> care of all of them.

Hah, gcc complains about the mismatch in pointer types for the second
argument.

fs/xfs/xfs_defer_item.c:504:17: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
  .create_done = xfs_trans_get_bud,
                 ^
fs/xfs/xfs_defer_item.c:504:17: note: (near initialization for ‘xfs_bmap_update_defer_type.create_done’)

I guess one could put in an ugly cast to coerce the types at a cost of
uglifying the code.  <shrug> Opinions?

--D

> 
> --D
> 
> > 
> > Brian
> > 
> > >  fs/xfs/libxfs/xfs_defer.h |    1 
> > >  fs/xfs/xfs_defer_item.c   |  108 +++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 109 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > index 85c7a3a..743fc32 100644
> > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > @@ -51,6 +51,7 @@ struct xfs_defer_pending {
> > >   * find all the space it needs.
> > >   */
> > >  enum xfs_defer_ops_type {
> > > +	XFS_DEFER_OPS_TYPE_FREE,
> > >  	XFS_DEFER_OPS_TYPE_MAX,
> > >  };
> > >  
> > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > index 4c2ba28..127a54e 100644
> > > --- a/fs/xfs/xfs_defer_item.c
> > > +++ b/fs/xfs/xfs_defer_item.c
> > > @@ -29,9 +29,117 @@
> > >  #include "xfs_defer.h"
> > >  #include "xfs_trans.h"
> > >  #include "xfs_trace.h"
> > > +#include "xfs_bmap.h"
> > > +#include "xfs_extfree_item.h"
> > > +
> > > +/* Extent Freeing */
> > > +
> > > +/* Sort bmap items by AG. */
> > > +static int
> > > +xfs_bmap_free_diff_items(
> > > +	void				*priv,
> > > +	struct list_head		*a,
> > > +	struct list_head		*b)
> > > +{
> > > +	struct xfs_mount		*mp = priv;
> > > +	struct xfs_bmap_free_item	*ra;
> > > +	struct xfs_bmap_free_item	*rb;
> > > +
> > > +	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
> > > +	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
> > > +	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
> > > +		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
> > > +}
> > > +
> > > +/* Get an EFI. */
> > > +STATIC void *
> > > +xfs_bmap_free_create_intent(
> > > +	struct xfs_trans		*tp,
> > > +	unsigned int			count)
> > > +{
> > > +	return xfs_trans_get_efi(tp, count);
> > > +}
> > > +
> > > +/* Log a free extent to the intent item. */
> > > +STATIC void
> > > +xfs_bmap_free_log_item(
> > > +	struct xfs_trans		*tp,
> > > +	void				*intent,
> > > +	struct list_head		*item)
> > > +{
> > > +	struct xfs_bmap_free_item	*free;
> > > +
> > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > +	xfs_trans_log_efi_extent(tp, intent, free->xbfi_startblock,
> > > +			free->xbfi_blockcount);
> > > +}
> > > +
> > > +/* Get an EFD so we can process all the free extents. */
> > > +STATIC void *
> > > +xfs_bmap_free_create_done(
> > > +	struct xfs_trans		*tp,
> > > +	void				*intent,
> > > +	unsigned int			count)
> > > +{
> > > +	return xfs_trans_get_efd(tp, intent, count);
> > > +}
> > > +
> > > +/* Process a free extent. */
> > > +STATIC int
> > > +xfs_bmap_free_finish_item(
> > > +	struct xfs_trans		*tp,
> > > +	struct xfs_defer_ops		*dop,
> > > +	struct list_head		*item,
> > > +	void				*done_item,
> > > +	void				**state)
> > > +{
> > > +	struct xfs_bmap_free_item	*free;
> > > +	int				error;
> > > +
> > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > +	error = xfs_trans_free_extent(tp, done_item,
> > > +			free->xbfi_startblock,
> > > +			free->xbfi_blockcount);
> > > +	kmem_free(free);
> > > +	return error;
> > > +}
> > > +
> > > +/* Abort all pending EFIs. */
> > > +STATIC void
> > > +xfs_bmap_free_abort_intent(
> > > +	void				*intent)
> > > +{
> > > +	xfs_efi_release(intent);
> > > +}
> > > +
> > > +/* Cancel a free extent. */
> > > +STATIC void
> > > +xfs_bmap_free_cancel_item(
> > > +	struct list_head		*item)
> > > +{
> > > +	struct xfs_bmap_free_item	*free;
> > > +
> > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > +	kmem_free(free);
> > > +}
> > > +
> > > +const struct xfs_defer_op_type xfs_extent_free_defer_type = {
> > > +	.type		= XFS_DEFER_OPS_TYPE_FREE,
> > > +	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
> > > +	.diff_items	= xfs_bmap_free_diff_items,
> > > +	.create_intent	= xfs_bmap_free_create_intent,
> > > +	.abort_intent	= xfs_bmap_free_abort_intent,
> > > +	.log_item	= xfs_bmap_free_log_item,
> > > +	.create_done	= xfs_bmap_free_create_done,
> > > +	.finish_item	= xfs_bmap_free_finish_item,
> > > +	.cancel_item	= xfs_bmap_free_cancel_item,
> > > +};
> > > +
> > > +/* Deferred Item Initialization */
> > >  
> > >  /* Initialize the deferred operation types. */
> > >  void
> > >  xfs_defer_init_types(void)
> > >  {
> > > +	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
> > >  }
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion
  2016-06-21 13:00   ` Brian Foster
@ 2016-06-27 22:30     ` Darrick J. Wong
  2016-06-28 12:31       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-27 22:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Tue, Jun 21, 2016 at 09:00:45AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:19:08PM -0700, Darrick J. Wong wrote:
> > When a btree block has to be split, we pass the new block's ptr from
> > xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
> > however, we pass the block's key through the cursor's record.  It is a
> > little weird to "initialize" a record from a key since the non-key
> > attributes will have garbage values.
> > 
> > When we go to add support for interval queries, we have to be able to
> > pass the lowest and highest keys accessible via a pointer.  There's no
> > clean way to pass this back through the cursor's record field.
> > Therefore, pass the key directly back to xfs_btree_insert() the same
> > way that we pass the btree_ptr.
> > 
> > As a bonus, we no longer need init_rec_from_key and can drop it from the
> > codebase.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc_btree.c  |   12 ----------
> >  fs/xfs/libxfs/xfs_bmap_btree.c   |   12 ----------
> >  fs/xfs/libxfs/xfs_btree.c        |   44 +++++++++++++++++++-------------------
> >  fs/xfs/libxfs/xfs_btree.h        |    2 --
> >  fs/xfs/libxfs/xfs_ialloc_btree.c |   10 ---------
> >  5 files changed, 22 insertions(+), 58 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 046fbcf..a096539 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> ...
> > @@ -2929,16 +2927,16 @@ xfs_btree_insrec(
> >  	struct xfs_btree_cur	*cur,	/* btree cursor */
> >  	int			level,	/* level to insert record at */
> >  	union xfs_btree_ptr	*ptrp,	/* i/o: block number inserted */
> > -	union xfs_btree_rec	*recp,	/* i/o: record data inserted */
> > +	union xfs_btree_key	*key,	/* i/o: block key for ptrp */
> >  	struct xfs_btree_cur	**curp,	/* output: new cursor replacing cur */
> >  	int			*stat)	/* success/failure */
> >  {
> >  	struct xfs_btree_block	*block;	/* btree block */
> >  	struct xfs_buf		*bp;	/* buffer for block */
> > -	union xfs_btree_key	key;	/* btree key */
> >  	union xfs_btree_ptr	nptr;	/* new block ptr */
> >  	struct xfs_btree_cur	*ncur;	/* new btree cursor */
> > -	union xfs_btree_rec	nrec;	/* new record count */
> > +	union xfs_btree_key	nkey;	/* new block key */
> > +	union xfs_btree_rec	rec;	/* record to insert */
> >  	int			optr;	/* old key/record index */
> >  	int			ptr;	/* key/record index */
> >  	int			numrecs;/* number of records */
> > @@ -2947,8 +2945,14 @@ xfs_btree_insrec(
> >  	int			i;
> >  #endif
> >  
> > +	/* Make a key out of the record data to be inserted, and save it. */
> > +	if (level == 0) {
> > +		cur->bc_ops->init_rec_from_cur(cur, &rec);
> > +		cur->bc_ops->init_key_from_rec(key, &rec);
> > +	}
> 
> The level == 0 check looks a bit hacky to me. IOW, I think it's cleaner
> that the key is initialized once in the caller rather than check for a
> particular iteration down in xfs_btree_insrec(). That said,
> xfs_btree_insrec() still needs rec initialized in the level == 0 case.
> 
> I wonder if we could create an inline xfs_btree_init_key_from_cur()
> helper to combine the above calls, invoke it once in xfs_btree_insert(),
> then push down the ->init_rec_from_cur() calls to the contexts further
> down in this function where rec is actually required. There are only two
> and one of them is DEBUG code. Thoughts?

How about I make btree_insert set both &key and &rec at the start and
pass them both into btree_insrec?  That would eliminate the hacky check
above and fix the dummy tracing hook too, in case it ever does anything.

> 
> > +
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > -	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, recp);
> > +	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, &rec);
> >  
> 
> So these look like unimplemented dummy tracing hooks. It sounds like
> previously rec could have a junk value after a btree split, but now it
> looks like rec is junk for every non-zero level. Kind of annoying, I
> wonder if we can just kill these.. :/

<shrug> I have no opinion either way. :)

--D
> 
> Brian
> 
> >  	ncur = NULL;
> >  
> > @@ -2973,9 +2977,6 @@ xfs_btree_insrec(
> >  		return 0;
> >  	}
> >  
> > -	/* Make a key out of the record data to be inserted, and save it. */
> > -	cur->bc_ops->init_key_from_rec(&key, recp);
> > -
> >  	optr = ptr;
> >  
> >  	XFS_BTREE_STATS_INC(cur, insrec);
> > @@ -2992,10 +2993,10 @@ xfs_btree_insrec(
> >  	/* Check that the new entry is being inserted in the right place. */
> >  	if (ptr <= numrecs) {
> >  		if (level == 0) {
> > -			ASSERT(cur->bc_ops->recs_inorder(cur, recp,
> > +			ASSERT(cur->bc_ops->recs_inorder(cur, &rec,
> >  				xfs_btree_rec_addr(cur, ptr, block)));
> >  		} else {
> > -			ASSERT(cur->bc_ops->keys_inorder(cur, &key,
> > +			ASSERT(cur->bc_ops->keys_inorder(cur, key,
> >  				xfs_btree_key_addr(cur, ptr, block)));
> >  		}
> >  	}
> > @@ -3008,7 +3009,7 @@ xfs_btree_insrec(
> >  	xfs_btree_set_ptr_null(cur, &nptr);
> >  	if (numrecs == cur->bc_ops->get_maxrecs(cur, level)) {
> >  		error = xfs_btree_make_block_unfull(cur, level, numrecs,
> > -					&optr, &ptr, &nptr, &ncur, &nrec, stat);
> > +					&optr, &ptr, &nptr, &ncur, &nkey, stat);
> >  		if (error || *stat == 0)
> >  			goto error0;
> >  	}
> > @@ -3058,7 +3059,7 @@ xfs_btree_insrec(
> >  #endif
> >  
> >  		/* Now put the new data in, bump numrecs and log it. */
> > -		xfs_btree_copy_keys(cur, kp, &key, 1);
> > +		xfs_btree_copy_keys(cur, kp, key, 1);
> >  		xfs_btree_copy_ptrs(cur, pp, ptrp, 1);
> >  		numrecs++;
> >  		xfs_btree_set_numrecs(block, numrecs);
> > @@ -3079,7 +3080,7 @@ xfs_btree_insrec(
> >  		xfs_btree_shift_recs(cur, rp, 1, numrecs - ptr + 1);
> >  
> >  		/* Now put the new data in, bump numrecs and log it. */
> > -		xfs_btree_copy_recs(cur, rp, recp, 1);
> > +		xfs_btree_copy_recs(cur, rp, &rec, 1);
> >  		xfs_btree_set_numrecs(block, ++numrecs);
> >  		xfs_btree_log_recs(cur, bp, ptr, numrecs);
> >  #ifdef DEBUG
> > @@ -3095,7 +3096,7 @@ xfs_btree_insrec(
> >  
> >  	/* If we inserted at the start of a block, update the parents' keys. */
> >  	if (optr == 1) {
> > -		error = xfs_btree_updkey(cur, &key, level + 1);
> > +		error = xfs_btree_updkey(cur, key, level + 1);
> >  		if (error)
> >  			goto error0;
> >  	}
> > @@ -3105,7 +3106,7 @@ xfs_btree_insrec(
> >  	 * we are at the far right edge of the tree, update it.
> >  	 */
> >  	if (xfs_btree_is_lastrec(cur, block, level)) {
> > -		cur->bc_ops->update_lastrec(cur, block, recp,
> > +		cur->bc_ops->update_lastrec(cur, block, &rec,
> >  					    ptr, LASTREC_INSREC);
> >  	}
> >  
> > @@ -3115,7 +3116,7 @@ xfs_btree_insrec(
> >  	 */
> >  	*ptrp = nptr;
> >  	if (!xfs_btree_ptr_is_null(cur, &nptr)) {
> > -		*recp = nrec;
> > +		*key = nkey;
> >  		*curp = ncur;
> >  	}
> >  
> > @@ -3146,14 +3147,13 @@ xfs_btree_insert(
> >  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
> >  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
> >  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> > -	union xfs_btree_rec	rec;	/* record to insert */
> > +	union xfs_btree_key	key;	/* key of block to insert */
> >  
> >  	level = 0;
> >  	ncur = NULL;
> >  	pcur = cur;
> >  
> >  	xfs_btree_set_ptr_null(cur, &nptr);
> > -	cur->bc_ops->init_rec_from_cur(cur, &rec);
> >  
> >  	/*
> >  	 * Loop going up the tree, starting at the leaf level.
> > @@ -3165,7 +3165,7 @@ xfs_btree_insert(
> >  		 * Insert nrec/nptr into this level of the tree.
> >  		 * Note if we fail, nptr will be null.
> >  		 */
> > -		error = xfs_btree_insrec(pcur, level, &nptr, &rec, &ncur, &i);
> > +		error = xfs_btree_insrec(pcur, level, &nptr, &key, &ncur, &i);
> >  		if (error) {
> >  			if (pcur != cur)
> >  				xfs_btree_del_cursor(pcur, XFS_BTREE_ERROR);
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index b955e5d..b99c018 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -158,8 +158,6 @@ struct xfs_btree_ops {
> >  	/* init values of btree structures */
> >  	void	(*init_key_from_rec)(union xfs_btree_key *key,
> >  				     union xfs_btree_rec *rec);
> > -	void	(*init_rec_from_key)(union xfs_btree_key *key,
> > -				     union xfs_btree_rec *rec);
> >  	void	(*init_rec_from_cur)(struct xfs_btree_cur *cur,
> >  				     union xfs_btree_rec *rec);
> >  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index 89c21d7..88da2ad 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -146,14 +146,6 @@ xfs_inobt_init_key_from_rec(
> >  }
> >  
> >  STATIC void
> > -xfs_inobt_init_rec_from_key(
> > -	union xfs_btree_key	*key,
> > -	union xfs_btree_rec	*rec)
> > -{
> > -	rec->inobt.ir_startino = key->inobt.ir_startino;
> > -}
> > -
> > -STATIC void
> >  xfs_inobt_init_rec_from_cur(
> >  	struct xfs_btree_cur	*cur,
> >  	union xfs_btree_rec	*rec)
> > @@ -314,7 +306,6 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
> >  	.get_minrecs		= xfs_inobt_get_minrecs,
> >  	.get_maxrecs		= xfs_inobt_get_maxrecs,
> >  	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
> > -	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
> >  	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
> >  	.init_ptr_from_cur	= xfs_inobt_init_ptr_from_cur,
> >  	.key_diff		= xfs_inobt_key_diff,
> > @@ -336,7 +327,6 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
> >  	.get_minrecs		= xfs_inobt_get_minrecs,
> >  	.get_maxrecs		= xfs_inobt_get_maxrecs,
> >  	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
> > -	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
> >  	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
> >  	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
> >  	.key_diff		= xfs_inobt_key_diff,
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
  2016-06-22 15:17   ` Brian Foster
@ 2016-06-28  3:26     ` Darrick J. Wong
  2016-06-28 12:32       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-28  3:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Wed, Jun 22, 2016 at 11:17:06AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> > On a filesystem with both reflink and reverse mapping enabled, it's
> > possible to have multiple rmap records referring to the same blocks on
> > disk.  When overlapping intervals are possible, querying a classic
> > btree to find all records intersecting a given interval is inefficient
> > because we cannot use the left side of the search interval to filter
> > out non-matching records the same way that we can use the existing
> > btree key to filter out records coming after the right side of the
> > search interval.  This will become important once we want to use the
> > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
> > 
> > (For the non-overlapping case, we can perform such queries trivially
> > by starting at the left side of the interval and walking the tree
> > until we pass the right side.)
> > 
> > Therefore, extend the btree code to come closer to supporting
> > intervals as a first-class record attribute.  This involves widening
> > the btree node's key space to store both the lowest key reachable via
> > the node pointer (as the btree does now) and the highest key reachable
> > via the same pointer and teaching the btree modifying functions to
> > keep the highest-key records up to date.
> > 
> > This behavior can be turned on via a new btree ops flag so that btrees
> > that cannot store overlapping intervals don't pay the overhead costs
> > in terms of extra code and disk format changes.
> > 
> > v2: When we're deleting a record in a btree that supports overlapped
> > interval records and the deletion results in two btree blocks being
> > joined, we defer updating the high/low keys until after all possible
> > joining (at higher levels in the tree) have finished.  At this point,
> > the btree pointers at all levels have been updated to remove the empty
> > blocks and we can update the low and high keys.
> > 
> > When we're doing this, we must be careful to update the keys of all
> > node pointers up to the root instead of stopping at the first set of
> > keys that don't need updating.  This is because it's possible for a
> > single deletion to cause joining of multiple levels of tree, and so
> > we need to update everything going back to the root.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> I think I get the gist of this and it mostly looks Ok to me. A few
> questions and minor comments...

Ok.

> >  fs/xfs/libxfs/xfs_btree.c |  379 +++++++++++++++++++++++++++++++++++++++++----
> >  fs/xfs/libxfs/xfs_btree.h |   16 ++
> >  fs/xfs/xfs_trace.h        |   36 ++++
> >  3 files changed, 395 insertions(+), 36 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index a096539..afcafd6 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -52,6 +52,11 @@ static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
> >  	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
> >  
> >  
> > +struct xfs_btree_double_key {
> > +	union xfs_btree_key	low;
> > +	union xfs_btree_key	high;
> > +};
> > +
> >  STATIC int				/* error (0 or EFSCORRUPTED) */
> >  xfs_btree_check_lblock(
> >  	struct xfs_btree_cur	*cur,	/* btree cursor */
> > @@ -428,6 +433,30 @@ xfs_btree_dup_cursor(
> >   * into a btree block (xfs_btree_*_offset) or return a pointer to the given
> >   * record, key or pointer (xfs_btree_*_addr).  Note that all addressing
> >   * inside the btree block is done using indices starting at one, not zero!
> > + *
> > + * If XFS_BTREE_OVERLAPPING is set, then this btree supports keys containing
> > + * overlapping intervals.  In such a tree, records are still sorted lowest to
> > + * highest and indexed by the smallest key value that refers to the record.
> > + * However, nodes are different: each pointer has two associated keys -- one
> > + * indexing the lowest key available in the block(s) below (the same behavior
> > + * as the key in a regular btree) and another indexing the highest key
> > + * available in the block(s) below.  Because records are /not/ sorted by the
> > + * highest key, all leaf block updates require us to compute the highest key
> > + * that matches any record in the leaf and to recursively update the high keys
> > + * in the nodes going further up in the tree, if necessary.  Nodes look like
> > + * this:
> > + *
> > + *		+--------+-----+-----+-----+-----+-----+-------+-------+-----+
> > + * Non-Leaf:	| header | lo1 | hi1 | lo2 | hi2 | ... | ptr 1 | ptr 2 | ... |
> > + *		+--------+-----+-----+-----+-----+-----+-------+-------+-----+
> > + *
> > + * To perform an interval query on an overlapped tree, perform the usual
> > + * depth-first search and use the low and high keys to decide if we can skip
> > + * that particular node.  If a leaf node is reached, return the records that
> > + * intersect the interval.  Note that an interval query may return numerous
> > + * entries.  For a non-overlapped tree, simply search for the record associated
> > + * with the lowest key and iterate forward until a non-matching record is
> > + * found.
> >   */
> >  
> >  /*
> > @@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct xfs_btree_cur *cur)
> >  	return XFS_BTREE_SBLOCK_LEN;
> >  }
> >  
> > +/* Return size of btree block keys for this btree instance. */
> > +static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur)
> > +{
> > +	size_t			len;
> > +
> > +	len = cur->bc_ops->key_len;
> > +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +		len *= 2;
> > +	return len;
> > +}
> > +
> >  /*
> >   * Return size of btree block pointers for this btree instance.
> >   */
> > @@ -475,7 +515,19 @@ xfs_btree_key_offset(
> >  	int			n)
> >  {
> >  	return xfs_btree_block_len(cur) +
> > -		(n - 1) * cur->bc_ops->key_len;
> > +		(n - 1) * xfs_btree_key_len(cur);
> > +}
> > +
> > +/*
> > + * Calculate offset of the n-th high key in a btree block.
> > + */
> > +STATIC size_t
> > +xfs_btree_high_key_offset(
> > +	struct xfs_btree_cur	*cur,
> > +	int			n)
> > +{
> > +	return xfs_btree_block_len(cur) +
> > +		(n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len;
> >  }
> >  
> >  /*
> > @@ -488,7 +540,7 @@ xfs_btree_ptr_offset(
> >  	int			level)
> >  {
> >  	return xfs_btree_block_len(cur) +
> > -		cur->bc_ops->get_maxrecs(cur, level) * cur->bc_ops->key_len +
> > +		cur->bc_ops->get_maxrecs(cur, level) * xfs_btree_key_len(cur) +
> >  		(n - 1) * xfs_btree_ptr_len(cur);
> >  }
> >  
> > @@ -519,6 +571,19 @@ xfs_btree_key_addr(
> >  }
> >  
> >  /*
> > + * Return a pointer to the n-th high key in the btree block.
> > + */
> > +STATIC union xfs_btree_key *
> > +xfs_btree_high_key_addr(
> > +	struct xfs_btree_cur	*cur,
> > +	int			n,
> > +	struct xfs_btree_block	*block)
> > +{
> > +	return (union xfs_btree_key *)
> > +		((char *)block + xfs_btree_high_key_offset(cur, n));
> > +}
> > +
> > +/*
> >   * Return a pointer to the n-th block pointer in the btree block.
> >   */
> >  STATIC union xfs_btree_ptr *
> > @@ -1217,7 +1282,7 @@ xfs_btree_copy_keys(
> >  	int			numkeys)
> >  {
> >  	ASSERT(numkeys >= 0);
> > -	memcpy(dst_key, src_key, numkeys * cur->bc_ops->key_len);
> > +	memcpy(dst_key, src_key, numkeys * xfs_btree_key_len(cur));
> >  }
> >  
> >  /*
> > @@ -1263,8 +1328,8 @@ xfs_btree_shift_keys(
> >  	ASSERT(numkeys >= 0);
> >  	ASSERT(dir == 1 || dir == -1);
> >  
> > -	dst_key = (char *)key + (dir * cur->bc_ops->key_len);
> > -	memmove(dst_key, key, numkeys * cur->bc_ops->key_len);
> > +	dst_key = (char *)key + (dir * xfs_btree_key_len(cur));
> > +	memmove(dst_key, key, numkeys * xfs_btree_key_len(cur));
> >  }
> >  
> >  /*
> > @@ -1879,6 +1944,180 @@ error0:
> >  	return error;
> >  }
> >  
> > +/* Determine the low and high keys of a leaf block */
> > +STATIC void
> > +xfs_btree_find_leaf_keys(
> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_btree_block	*block,
> > +	union xfs_btree_key	*low,
> > +	union xfs_btree_key	*high)
> > +{
> > +	int			n;
> > +	union xfs_btree_rec	*rec;
> > +	union xfs_btree_key	max_hkey;
> > +	union xfs_btree_key	hkey;
> > +
> > +	rec = xfs_btree_rec_addr(cur, 1, block);
> > +	cur->bc_ops->init_key_from_rec(low, rec);
> > +
> > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +		return;
> > +
> > +	cur->bc_ops->init_high_key_from_rec(&max_hkey, rec);
> > +	for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
> > +		rec = xfs_btree_rec_addr(cur, n, block);
> > +		cur->bc_ops->init_high_key_from_rec(&hkey, rec);
> > +		if (cur->bc_ops->diff_two_keys(cur, &max_hkey, &hkey) > 0)
> > +			max_hkey = hkey;
> > +	}
> > +
> > +	*high = max_hkey;
> > +}
> > +
> > +/* Determine the low and high keys of a node block */
> > +STATIC void
> > +xfs_btree_find_node_keys(
> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_btree_block	*block,
> > +	union xfs_btree_key	*low,
> > +	union xfs_btree_key	*high)
> > +{
> > +	int			n;
> > +	union xfs_btree_key	*hkey;
> > +	union xfs_btree_key	*max_hkey;
> > +
> > +	*low = *xfs_btree_key_addr(cur, 1, block);
> > +
> > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +		return;
> > +
> > +	max_hkey = xfs_btree_high_key_addr(cur, 1, block);
> > +	for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
> > +		hkey = xfs_btree_high_key_addr(cur, n, block);
> > +		if (cur->bc_ops->diff_two_keys(cur, max_hkey, hkey) > 0)
> > +			max_hkey = hkey;
> > +	}
> > +
> > +	*high = *max_hkey;
> > +}
> > +
> > +/*
> > + * Update parental low & high keys from some block all the way back to the
> > + * root of the btree.
> > + */
> > +STATIC int
> > +__xfs_btree_updkeys(
> > +	struct xfs_btree_cur	*cur,
> > +	int			level,
> > +	struct xfs_btree_block	*block,
> > +	struct xfs_buf		*bp0,
> > +	bool			force_all)
> > +{
> > +	union xfs_btree_key	lkey;	/* keys from current level */
> > +	union xfs_btree_key	hkey;
> > +	union xfs_btree_key	*nlkey;	/* keys from the next level up */
> > +	union xfs_btree_key	*nhkey;
> > +	struct xfs_buf		*bp;
> > +	int			ptr = -1;
> 
> ptr doesn't appear to require initialization.

Ok.

> 
> > +
> > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +		return 0;
> > +
> > +	if (level + 1 >= cur->bc_nlevels)
> > +		return 0;
> 
> This could use a comment to indicate we're checking for a parent level
> to update.

Ok.

> 
> > +
> > +	trace_xfs_btree_updkeys(cur, level, bp0);
> > +
> > +	if (level == 0)
> > +		xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey);
> > +	else
> > +		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
> > +	for (level++; level < cur->bc_nlevels; level++) {
> > +		block = xfs_btree_get_block(cur, level, &bp);
> > +		trace_xfs_btree_updkeys(cur, level, bp);
> > +		ptr = cur->bc_ptrs[level];
> > +		nlkey = xfs_btree_key_addr(cur, ptr, block);
> > +		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
> > +		if (!(cur->bc_ops->diff_two_keys(cur, nlkey, &lkey) != 0 ||
> > +		      cur->bc_ops->diff_two_keys(cur, nhkey, &hkey) != 0) &&
> > +		    !force_all)
> > +			break;
> > +		memcpy(nlkey, &lkey, cur->bc_ops->key_len);
> > +		memcpy(nhkey, &hkey, cur->bc_ops->key_len);
> > +		xfs_btree_log_keys(cur, bp, ptr, ptr);
> > +		if (level + 1 >= cur->bc_nlevels)
> > +			break;
> > +		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Update all the keys from a sibling block at some level in the cursor back
> > + * to the root, stopping when we find a key pair that doesn't need updating.
> > + */
> > +STATIC int
> > +xfs_btree_sibling_updkeys(
> > +	struct xfs_btree_cur	*cur,
> > +	int			level,
> > +	int			ptr,
> > +	struct xfs_btree_block	*block,
> > +	struct xfs_buf		*bp0)
> > +{
> > +	struct xfs_btree_cur	*ncur;
> > +	int			stat;
> > +	int			error;
> > +
> > +	error = xfs_btree_dup_cursor(cur, &ncur);
> > +	if (error)
> > +		return error;
> > +
> > +	if (level + 1 >= ncur->bc_nlevels)
> > +		error = -EDOM;
> > +	else if (ptr == XFS_BB_RIGHTSIB)
> > +		error = xfs_btree_increment(ncur, level + 1, &stat);
> > +	else if (ptr == XFS_BB_LEFTSIB)
> > +		error = xfs_btree_decrement(ncur, level + 1, &stat);
> > +	else
> > +		error = -EBADE;
> 
> So we inc/dec the cursor at the next level up the tree, then update the
> keys up that path with the __xfs_btree_updkeys() call below. The inc/dec
> calls explicitly say that they don't alter the cursor below the level,
> so it looks like we'd end up with a weird cursor path here.
> 
> Digging around further, it looks like we pass the sibling bp/block
> pointers from the caller and thus __xfs_btree_updkeys() should do the
> correct thing, but this is not very clear. If I'm on the right track,
> I'd suggest to add a big fat comment here. :)

Yep.

/*
 * The caller passed us the sibling block in bp0/block, but the
 * (duplicate) cursor points to original block and not the sibling.
 * Therefore we must adjust the cursor at the next level higher
 * to point to the sibling block we were handed.  Only then can
 * we go up the tree updating keys.
 */

> > +	if (error || !stat)
> > +		return error;
> 
> Looks like a potential cursor leak on error.

Oops!

> > +
> > +	error = __xfs_btree_updkeys(ncur, level, block, bp0, false);
> > +	xfs_btree_del_cursor(ncur, XFS_BTREE_NOERROR);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Update all the keys from some level in cursor back to the root, stopping
> > + * when we find a key pair that don't need updating.
> > + */
> > +STATIC int
> > +xfs_btree_updkeys(
> > +	struct xfs_btree_cur	*cur,
> > +	int			level)
> > +{
> > +	struct xfs_buf		*bp;
> > +	struct xfs_btree_block	*block;
> > +
> > +	block = xfs_btree_get_block(cur, level, &bp);
> > +	return __xfs_btree_updkeys(cur, level, block, bp, false);
> > +}
> > +
> > +/* Update all the keys from some level in cursor back to the root. */
> > +STATIC int
> > +xfs_btree_updkeys_force(
> > +	struct xfs_btree_cur	*cur,
> > +	int			level)
> > +{
> > +	struct xfs_buf		*bp;
> > +	struct xfs_btree_block	*block;
> > +
> > +	block = xfs_btree_get_block(cur, level, &bp);
> > +	return __xfs_btree_updkeys(cur, level, block, bp, true);
> > +}
> > +
> >  /*
> >   * Update keys at all levels from here to the root along the cursor's path.
> >   */
> > @@ -1893,6 +2132,9 @@ xfs_btree_updkey(
> >  	union xfs_btree_key	*kp;
> >  	int			ptr;
> >  
> > +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +		return 0;
> > +
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> >  	XFS_BTREE_TRACE_ARGIK(cur, level, keyp);
> >  
> > @@ -1970,7 +2212,8 @@ xfs_btree_update(
> >  					    ptr, LASTREC_UPDATE);
> >  	}
> >  
> > -	/* Updating first rec in leaf. Pass new key value up to our parent. */
> > +	/* Pass new key value up to our parent. */
> > +	xfs_btree_updkeys(cur, 0);
> >  	if (ptr == 1) {
> >  		union xfs_btree_key	key;
> >  
> > @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
> >  		rkp = &key;
> >  	}
> >  
> > -	/* Update the parent key values of right. */
> > +	/* Update the parent key values of left and right. */
> > +	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> > +	xfs_btree_updkeys(cur, level);
> >  	error = xfs_btree_updkey(cur, rkp, level + 1);
> >  	if (error)
> >  		goto error0;
> > @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
> >  	if (error)
> >  		goto error1;
> >  
> > +	/* Update left and right parent pointers */
> > +	xfs_btree_updkeys(cur, level);
> > +	xfs_btree_updkeys(tcur, level);
> 
> In this case, we grab the last record of the block, increment from there
> and update using the cursor. This is much more straightforward, imo.
> Could we use this approach in the left shift case as well?

Yes, I think so.  I might have started refactoring btree_sibling_updkeys
out of existence and got distracted, since there isn't anything that uses
the RIGHTSIB ptr value.

> >  	error = xfs_btree_updkey(tcur, rkp, level + 1);
> >  	if (error)
> >  		goto error1;
> > @@ -2356,7 +2604,7 @@ __xfs_btree_split(
> >  	struct xfs_btree_cur	*cur,
> >  	int			level,
> >  	union xfs_btree_ptr	*ptrp,
> > -	union xfs_btree_key	*key,
> > +	struct xfs_btree_double_key	*key,
> >  	struct xfs_btree_cur	**curp,
> >  	int			*stat)		/* success/failure */
> >  {
> > @@ -2452,9 +2700,6 @@ __xfs_btree_split(
> >  
> >  		xfs_btree_log_keys(cur, rbp, 1, rrecs);
> >  		xfs_btree_log_ptrs(cur, rbp, 1, rrecs);
> > -
> > -		/* Grab the keys to the entries moved to the right block */
> > -		xfs_btree_copy_keys(cur, key, rkp, 1);
> >  	} else {
> >  		/* It's a leaf.  Move records.  */
> >  		union xfs_btree_rec	*lrp;	/* left record pointer */
> > @@ -2465,12 +2710,8 @@ __xfs_btree_split(
> >  
> >  		xfs_btree_copy_recs(cur, rrp, lrp, rrecs);
> >  		xfs_btree_log_recs(cur, rbp, 1, rrecs);
> > -
> > -		cur->bc_ops->init_key_from_rec(key,
> > -			xfs_btree_rec_addr(cur, 1, right));
> >  	}
> >  
> > -
> >  	/*
> >  	 * Find the left block number by looking in the buffer.
> >  	 * Adjust numrecs, sibling pointers.
> > @@ -2484,6 +2725,12 @@ __xfs_btree_split(
> >  	xfs_btree_set_numrecs(left, lrecs);
> >  	xfs_btree_set_numrecs(right, xfs_btree_get_numrecs(right) + rrecs);
> >  
> > +	/* Find the low & high keys for the new block. */
> > +	if (level > 0)
> > +		xfs_btree_find_node_keys(cur, right, &key->low, &key->high);
> > +	else
> > +		xfs_btree_find_leaf_keys(cur, right, &key->low, &key->high);
> > +
> 
> Why not push these into the above if/else where the previous key
> copy/init calls were removed from?

We don't set bb_numrecs on the right block until the line above the new
hunk, and the btree_find_*_keys functions require numrecs to be set.

The removed key copy/init calls only looked at keys[1].

That said, it's trivial to move the set_numrecs calls above the if statement.

> >  	xfs_btree_log_block(cur, rbp, XFS_BB_ALL_BITS);
> >  	xfs_btree_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB);
> >  
> > @@ -2499,6 +2746,10 @@ __xfs_btree_split(
> >  		xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB);
> >  		xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB);
> >  	}
> > +
> > +	/* Update the left block's keys... */
> > +	xfs_btree_updkeys(cur, level);
> > +
> >  	/*
> >  	 * If the cursor is really in the right block, move it there.
> >  	 * If it's just pointing past the last entry in left, then we'll
> > @@ -2537,7 +2788,7 @@ struct xfs_btree_split_args {
> >  	struct xfs_btree_cur	*cur;
> >  	int			level;
> >  	union xfs_btree_ptr	*ptrp;
> > -	union xfs_btree_key	*key;
> > +	struct xfs_btree_double_key	*key;
> >  	struct xfs_btree_cur	**curp;
> >  	int			*stat;		/* success/failure */
> >  	int			result;
> > @@ -2586,7 +2837,7 @@ xfs_btree_split(
> >  	struct xfs_btree_cur	*cur,
> >  	int			level,
> >  	union xfs_btree_ptr	*ptrp,
> > -	union xfs_btree_key	*key,
> > +	struct xfs_btree_double_key	*key,
> >  	struct xfs_btree_cur	**curp,
> >  	int			*stat)		/* success/failure */
> >  {
> > @@ -2806,27 +3057,27 @@ xfs_btree_new_root(
> >  		bp = lbp;
> >  		nptr = 2;
> >  	}
> > +
> >  	/* Fill in the new block's btree header and log it. */
> >  	xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2);
> >  	xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS);
> >  	ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) &&
> >  			!xfs_btree_ptr_is_null(cur, &rptr));
> > -
> 
> ?

Don't know why I did that.  I like having one blank line before a chunk
of code, but there's no reason to remove that one.

> >  	/* Fill in the key data in the new root. */
> >  	if (xfs_btree_get_level(left) > 0) {
> > -		xfs_btree_copy_keys(cur,
> > +		xfs_btree_find_node_keys(cur, left,
> >  				xfs_btree_key_addr(cur, 1, new),
> > -				xfs_btree_key_addr(cur, 1, left), 1);
> > -		xfs_btree_copy_keys(cur,
> > +				xfs_btree_high_key_addr(cur, 1, new));
> > +		xfs_btree_find_node_keys(cur, right,
> >  				xfs_btree_key_addr(cur, 2, new),
> > -				xfs_btree_key_addr(cur, 1, right), 1);
> > +				xfs_btree_high_key_addr(cur, 2, new));
> >  	} else {
> > -		cur->bc_ops->init_key_from_rec(
> > -				xfs_btree_key_addr(cur, 1, new),
> > -				xfs_btree_rec_addr(cur, 1, left));
> > -		cur->bc_ops->init_key_from_rec(
> > -				xfs_btree_key_addr(cur, 2, new),
> > -				xfs_btree_rec_addr(cur, 1, right));
> > +		xfs_btree_find_leaf_keys(cur, left,
> > +			xfs_btree_key_addr(cur, 1, new),
> > +			xfs_btree_high_key_addr(cur, 1, new));
> > +		xfs_btree_find_leaf_keys(cur, right,
> > +			xfs_btree_key_addr(cur, 2, new),
> > +			xfs_btree_high_key_addr(cur, 2, new));
> >  	}
> >  	xfs_btree_log_keys(cur, nbp, 1, 2);
> >  
> > @@ -2837,6 +3088,7 @@ xfs_btree_new_root(
> >  		xfs_btree_ptr_addr(cur, 2, new), &rptr, 1);
> >  	xfs_btree_log_ptrs(cur, nbp, 1, 2);
> >  
> > +
> 
> Extra line.

Removed.

> >  	/* Fix up the cursor. */
> >  	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
> >  	cur->bc_ptrs[cur->bc_nlevels] = nptr;
> > @@ -2862,7 +3114,7 @@ xfs_btree_make_block_unfull(
> >  	int			*index,	/* new tree index */
> >  	union xfs_btree_ptr	*nptr,	/* new btree ptr */
> >  	struct xfs_btree_cur	**ncur,	/* new btree cursor */
> > -	union xfs_btree_key	*key, /* key of new block */
> > +	struct xfs_btree_double_key	*key,	/* key of new block */
> >  	int			*stat)
> >  {
> >  	int			error = 0;
> > @@ -2918,6 +3170,22 @@ xfs_btree_make_block_unfull(
> >  	return 0;
> >  }
> >  
> > +/* Copy a double key into a btree block. */
> > +static void
> > +xfs_btree_copy_double_keys(
> > +	struct xfs_btree_cur	*cur,
> > +	int			ptr,
> > +	struct xfs_btree_block	*block,
> > +	struct xfs_btree_double_key	*key)
> > +{
> > +	memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low,
> > +			cur->bc_ops->key_len);
> > +
> > +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +		memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high,
> > +				cur->bc_ops->key_len);
> > +}
> > +
> >  /*
> >   * Insert one record/level.  Return information to the caller
> >   * allowing the next level up to proceed if necessary.
> > @@ -2927,7 +3195,7 @@ xfs_btree_insrec(
> >  	struct xfs_btree_cur	*cur,	/* btree cursor */
> >  	int			level,	/* level to insert record at */
> >  	union xfs_btree_ptr	*ptrp,	/* i/o: block number inserted */
> > -	union xfs_btree_key	*key,	/* i/o: block key for ptrp */
> > +	struct xfs_btree_double_key	*key, /* i/o: block key for ptrp */
> >  	struct xfs_btree_cur	**curp,	/* output: new cursor replacing cur */
> >  	int			*stat)	/* success/failure */
> >  {
> > @@ -2935,7 +3203,7 @@ xfs_btree_insrec(
> >  	struct xfs_buf		*bp;	/* buffer for block */
> >  	union xfs_btree_ptr	nptr;	/* new block ptr */
> >  	struct xfs_btree_cur	*ncur;	/* new btree cursor */
> > -	union xfs_btree_key	nkey;	/* new block key */
> > +	struct xfs_btree_double_key	nkey;	/* new block key */
> >  	union xfs_btree_rec	rec;	/* record to insert */
> >  	int			optr;	/* old key/record index */
> >  	int			ptr;	/* key/record index */
> > @@ -2944,11 +3212,12 @@ xfs_btree_insrec(
> >  #ifdef DEBUG
> >  	int			i;
> >  #endif
> > +	xfs_daddr_t		old_bn;
> >  
> >  	/* Make a key out of the record data to be inserted, and save it. */
> >  	if (level == 0) {
> >  		cur->bc_ops->init_rec_from_cur(cur, &rec);
> > -		cur->bc_ops->init_key_from_rec(key, &rec);
> > +		cur->bc_ops->init_key_from_rec(&key->low, &rec);
> >  	}
> >  
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > @@ -2983,6 +3252,7 @@ xfs_btree_insrec(
> >  
> >  	/* Get pointers to the btree buffer and block. */
> >  	block = xfs_btree_get_block(cur, level, &bp);
> > +	old_bn = bp ? bp->b_bn : XFS_BUF_DADDR_NULL;
> >  	numrecs = xfs_btree_get_numrecs(block);
> >  
> >  #ifdef DEBUG
> > @@ -2996,7 +3266,7 @@ xfs_btree_insrec(
> >  			ASSERT(cur->bc_ops->recs_inorder(cur, &rec,
> >  				xfs_btree_rec_addr(cur, ptr, block)));
> >  		} else {
> > -			ASSERT(cur->bc_ops->keys_inorder(cur, key,
> > +			ASSERT(cur->bc_ops->keys_inorder(cur, &key->low,
> >  				xfs_btree_key_addr(cur, ptr, block)));
> >  		}
> >  	}
> > @@ -3059,7 +3329,7 @@ xfs_btree_insrec(
> >  #endif
> >  
> >  		/* Now put the new data in, bump numrecs and log it. */
> > -		xfs_btree_copy_keys(cur, kp, key, 1);
> > +		xfs_btree_copy_double_keys(cur, ptr, block, key);
> >  		xfs_btree_copy_ptrs(cur, pp, ptrp, 1);
> >  		numrecs++;
> >  		xfs_btree_set_numrecs(block, numrecs);
> > @@ -3095,8 +3365,24 @@ xfs_btree_insrec(
> >  	xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);
> >  
> >  	/* If we inserted at the start of a block, update the parents' keys. */
> 
> This comment is associated with the codeblock that has been pushed
> further down, no?

Correct.  I think that got mismerged somewhere along the way.

> > +	if (ncur && bp->b_bn != old_bn) {
> > +		/*
> > +		 * We just inserted into a new tree block, which means that
> > +		 * the key for the block is in nkey, not the tree.
> > +		 */
> > +		if (level == 0)
> > +			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> > +					&nkey.high);
> > +		else
> > +			xfs_btree_find_node_keys(cur, block, &nkey.low,
> > +					&nkey.high);
> > +	} else {
> > +		/* Updating the left block, do it the standard way. */
> > +		xfs_btree_updkeys(cur, level);
> > +	}
> > +
> 
> Not quite sure I follow the purpose of this hunk. Is this for the case
> where a btree split occurs, nkey is filled in for the new/right block
> and then (after nkey is filled in) the new record ends up being added to
> the new block? If so, what about the case where ncur is not created?
> (It looks like that's possible from the code, but I could easily be
> missing some context as to why that's not the case.)

Yes, the first part of the if-else hunk is to fill out nkey when we've
split a btree block.  Now that I look at it again, I think that whole
weird conditional could be replaced with the same xfs_btree_ptr_is_null()
check later on.  I think it can also be combined with it.

Commentage for now:

/*
 * If we just inserted a new tree block, we have to find the low
 * and high keys for the new block and arrange to pass them back
 * separately.  If we're just updating a block we can use the
 * regular tree update mechanism.
 */

> In any event, I think we could elaborate a bit in the comment on why
> this is necessary. I'd also move it above the top-level if/else.
> 
> >  	if (optr == 1) {
> > -		error = xfs_btree_updkey(cur, key, level + 1);
> > +		error = xfs_btree_updkey(cur, &key->low, level + 1);
> >  		if (error)
> >  			goto error0;
> >  	}
> > @@ -3147,7 +3433,7 @@ xfs_btree_insert(
> >  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
> >  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
> >  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> > -	union xfs_btree_key	key;	/* key of block to insert */
> > +	struct xfs_btree_double_key	key;	/* key of block to insert */
> 
> Probably should fix up the function param alignment here and the couple
> other or so places we make this change.

I changed the name to xfs_btree_bigkey, which avoids the alignment problems.

--D

> 
> Brian
> 
> >  
> >  	level = 0;
> >  	ncur = NULL;
> > @@ -3552,6 +3838,7 @@ xfs_btree_delrec(
> >  	 * If we deleted the leftmost entry in the block, update the
> >  	 * key values above us in the tree.
> >  	 */
> > +	xfs_btree_updkeys(cur, level);
> >  	if (ptr == 1) {
> >  		error = xfs_btree_updkey(cur, keyp, level + 1);
> >  		if (error)
> > @@ -3882,6 +4169,16 @@ xfs_btree_delrec(
> >  	if (level > 0)
> >  		cur->bc_ptrs[level]--;
> >  
> > +	/*
> > +	 * We combined blocks, so we have to update the parent keys if the
> > +	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
> > +	 * points to the old block so that the caller knows which record to
> > +	 * delete.  Therefore, the caller must be savvy enough to call updkeys
> > +	 * for us if we return stat == 2.  The other exit points from this
> > +	 * function don't require deletions further up the tree, so they can
> > +	 * call updkeys directly.
> > +	 */
> > +
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> >  	/* Return value means the next level up has something to do. */
> >  	*stat = 2;
> > @@ -3907,6 +4204,7 @@ xfs_btree_delete(
> >  	int			error;	/* error return value */
> >  	int			level;
> >  	int			i;
> > +	bool			joined = false;
> >  
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> >  
> > @@ -3920,8 +4218,17 @@ xfs_btree_delete(
> >  		error = xfs_btree_delrec(cur, level, &i);
> >  		if (error)
> >  			goto error0;
> > +		if (i == 2)
> > +			joined = true;
> >  	}
> >  
> > +	/*
> > +	 * If we combined blocks as part of deleting the record, delrec won't
> > +	 * have updated the parent keys so we have to do that here.
> > +	 */
> > +	if (joined)
> > +		xfs_btree_updkeys_force(cur, 0);
> > +
> >  	if (i == 0) {
> >  		for (level = 1; level < cur->bc_nlevels; level++) {
> >  			if (cur->bc_ptrs[level] == 0) {
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index b99c018..a5ec6c7 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -126,6 +126,9 @@ struct xfs_btree_ops {
> >  	size_t	key_len;
> >  	size_t	rec_len;
> >  
> > +	/* flags */
> > +	uint	flags;
> > +
> >  	/* cursor operations */
> >  	struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *);
> >  	void	(*update_cursor)(struct xfs_btree_cur *src,
> > @@ -162,11 +165,21 @@ struct xfs_btree_ops {
> >  				     union xfs_btree_rec *rec);
> >  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
> >  				     union xfs_btree_ptr *ptr);
> > +	void	(*init_high_key_from_rec)(union xfs_btree_key *key,
> > +					  union xfs_btree_rec *rec);
> >  
> >  	/* difference between key value and cursor value */
> >  	__int64_t (*key_diff)(struct xfs_btree_cur *cur,
> >  			      union xfs_btree_key *key);
> >  
> > +	/*
> > +	 * Difference between key2 and key1 -- positive if key2 > key1,
> > +	 * negative if key2 < key1, and zero if equal.
> > +	 */
> > +	__int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
> > +				   union xfs_btree_key *key1,
> > +				   union xfs_btree_key *key2);
> > +
> >  	const struct xfs_buf_ops	*buf_ops;
> >  
> >  #if defined(DEBUG) || defined(XFS_WARN)
> > @@ -182,6 +195,9 @@ struct xfs_btree_ops {
> >  #endif
> >  };
> >  
> > +/* btree ops flags */
> > +#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
> > +
> >  /*
> >   * Reasons for the update_lastrec method to be called.
> >   */
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 68f27f7..ffea28c 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -38,6 +38,7 @@ struct xlog_recover_item;
> >  struct xfs_buf_log_format;
> >  struct xfs_inode_log_format;
> >  struct xfs_bmbt_irec;
> > +struct xfs_btree_cur;
> >  
> >  DECLARE_EVENT_CLASS(xfs_attr_list_class,
> >  	TP_PROTO(struct xfs_attr_list_context *ctx),
> > @@ -2183,6 +2184,41 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
> >  DEFINE_DISCARD_EVENT(xfs_discard_exclude);
> >  DEFINE_DISCARD_EVENT(xfs_discard_busy);
> >  
> > +/* btree cursor events */
> > +DECLARE_EVENT_CLASS(xfs_btree_cur_class,
> > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp),
> > +	TP_ARGS(cur, level, bp),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_btnum_t, btnum)
> > +		__field(int, level)
> > +		__field(int, nlevels)
> > +		__field(int, ptr)
> > +		__field(xfs_daddr_t, daddr)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = cur->bc_mp->m_super->s_dev;
> > +		__entry->btnum = cur->bc_btnum;
> > +		__entry->level = level;
> > +		__entry->nlevels = cur->bc_nlevels;
> > +		__entry->ptr = cur->bc_ptrs[level];
> > +		__entry->daddr = bp->b_bn;
> > +	),
> > +	TP_printk("dev %d:%d btnum %d level %d/%d ptr %d daddr 0x%llx",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->btnum,
> > +		  __entry->level,
> > +		  __entry->nlevels,
> > +		  __entry->ptr,
> > +		  (unsigned long long)__entry->daddr)
> > +)
> > +
> > +#define DEFINE_BTREE_CUR_EVENT(name) \
> > +DEFINE_EVENT(xfs_btree_cur_class, name, \
> > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> > +	TP_ARGS(cur, level, bp))
> > +DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> > +
> >  #endif /* _TRACE_XFS_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion
  2016-06-27 22:30     ` Darrick J. Wong
@ 2016-06-28 12:31       ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-28 12:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 03:30:23PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 21, 2016 at 09:00:45AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:19:08PM -0700, Darrick J. Wong wrote:
> > > When a btree block has to be split, we pass the new block's ptr from
> > > xfs_btree_split() back to xfs_btree_insert() via a pointer parameter;
> > > however, we pass the block's key through the cursor's record.  It is a
> > > little weird to "initialize" a record from a key since the non-key
> > > attributes will have garbage values.
> > > 
> > > When we go to add support for interval queries, we have to be able to
> > > pass the lowest and highest keys accessible via a pointer.  There's no
> > > clean way to pass this back through the cursor's record field.
> > > Therefore, pass the key directly back to xfs_btree_insert() the same
> > > way that we pass the btree_ptr.
> > > 
> > > As a bonus, we no longer need init_rec_from_key and can drop it from the
> > > codebase.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc_btree.c  |   12 ----------
> > >  fs/xfs/libxfs/xfs_bmap_btree.c   |   12 ----------
> > >  fs/xfs/libxfs/xfs_btree.c        |   44 +++++++++++++++++++-------------------
> > >  fs/xfs/libxfs/xfs_btree.h        |    2 --
> > >  fs/xfs/libxfs/xfs_ialloc_btree.c |   10 ---------
> > >  5 files changed, 22 insertions(+), 58 deletions(-)
> > > 
> > > 
> > ...
> > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > > index 046fbcf..a096539 100644
> > > --- a/fs/xfs/libxfs/xfs_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_btree.c
> > ...
> > > @@ -2929,16 +2927,16 @@ xfs_btree_insrec(
> > >  	struct xfs_btree_cur	*cur,	/* btree cursor */
> > >  	int			level,	/* level to insert record at */
> > >  	union xfs_btree_ptr	*ptrp,	/* i/o: block number inserted */
> > > -	union xfs_btree_rec	*recp,	/* i/o: record data inserted */
> > > +	union xfs_btree_key	*key,	/* i/o: block key for ptrp */
> > >  	struct xfs_btree_cur	**curp,	/* output: new cursor replacing cur */
> > >  	int			*stat)	/* success/failure */
> > >  {
> > >  	struct xfs_btree_block	*block;	/* btree block */
> > >  	struct xfs_buf		*bp;	/* buffer for block */
> > > -	union xfs_btree_key	key;	/* btree key */
> > >  	union xfs_btree_ptr	nptr;	/* new block ptr */
> > >  	struct xfs_btree_cur	*ncur;	/* new btree cursor */
> > > -	union xfs_btree_rec	nrec;	/* new record count */
> > > +	union xfs_btree_key	nkey;	/* new block key */
> > > +	union xfs_btree_rec	rec;	/* record to insert */
> > >  	int			optr;	/* old key/record index */
> > >  	int			ptr;	/* key/record index */
> > >  	int			numrecs;/* number of records */
> > > @@ -2947,8 +2945,14 @@ xfs_btree_insrec(
> > >  	int			i;
> > >  #endif
> > >  
> > > +	/* Make a key out of the record data to be inserted, and save it. */
> > > +	if (level == 0) {
> > > +		cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > +		cur->bc_ops->init_key_from_rec(key, &rec);
> > > +	}
> > 
> > The level == 0 check looks a bit hacky to me. IOW, I think it's cleaner
> > that the key is initialized once in the caller rather than check for a
> > particular iteration down in xfs_btree_insrec(). That said,
> > xfs_btree_insrec() still needs rec initialized in the level == 0 case.
> > 
> > I wonder if we could create an inline xfs_btree_init_key_from_cur()
> > helper to combine the above calls, invoke it once in xfs_btree_insert(),
> > then push down the ->init_rec_from_cur() calls to the contexts further
> > down in this function where rec is actually required. There are only two
> > and one of them is DEBUG code. Thoughts?
> 
> How about I make btree_insert set both &key and &rec at the start and
> pass them both into btree_insrec?  That would eliminate the hacky check
> above and fix the dummy tracing hook too, in case it ever does anything.
> 

That seems fine to me.

Brian

> > 
> > > +
> > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > -	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, recp);
> > > +	XFS_BTREE_TRACE_ARGIPR(cur, level, *ptrp, &rec);
> > >  
> > 
> > So these look like unimplemented dummy tracing hooks. It sounds like
> > previously rec could have a junk value after a btree split, but now it
> > looks like rec is junk for every non-zero level. Kind of annoying, I
> > wonder if we can just kill these.. :/
> 
> <shrug> I have no opinion either way. :)
> 
> --D
> > 
> > Brian
> > 
> > >  	ncur = NULL;
> > >  
> > > @@ -2973,9 +2977,6 @@ xfs_btree_insrec(
> > >  		return 0;
> > >  	}
> > >  
> > > -	/* Make a key out of the record data to be inserted, and save it. */
> > > -	cur->bc_ops->init_key_from_rec(&key, recp);
> > > -
> > >  	optr = ptr;
> > >  
> > >  	XFS_BTREE_STATS_INC(cur, insrec);
> > > @@ -2992,10 +2993,10 @@ xfs_btree_insrec(
> > >  	/* Check that the new entry is being inserted in the right place. */
> > >  	if (ptr <= numrecs) {
> > >  		if (level == 0) {
> > > -			ASSERT(cur->bc_ops->recs_inorder(cur, recp,
> > > +			ASSERT(cur->bc_ops->recs_inorder(cur, &rec,
> > >  				xfs_btree_rec_addr(cur, ptr, block)));
> > >  		} else {
> > > -			ASSERT(cur->bc_ops->keys_inorder(cur, &key,
> > > +			ASSERT(cur->bc_ops->keys_inorder(cur, key,
> > >  				xfs_btree_key_addr(cur, ptr, block)));
> > >  		}
> > >  	}
> > > @@ -3008,7 +3009,7 @@ xfs_btree_insrec(
> > >  	xfs_btree_set_ptr_null(cur, &nptr);
> > >  	if (numrecs == cur->bc_ops->get_maxrecs(cur, level)) {
> > >  		error = xfs_btree_make_block_unfull(cur, level, numrecs,
> > > -					&optr, &ptr, &nptr, &ncur, &nrec, stat);
> > > +					&optr, &ptr, &nptr, &ncur, &nkey, stat);
> > >  		if (error || *stat == 0)
> > >  			goto error0;
> > >  	}
> > > @@ -3058,7 +3059,7 @@ xfs_btree_insrec(
> > >  #endif
> > >  
> > >  		/* Now put the new data in, bump numrecs and log it. */
> > > -		xfs_btree_copy_keys(cur, kp, &key, 1);
> > > +		xfs_btree_copy_keys(cur, kp, key, 1);
> > >  		xfs_btree_copy_ptrs(cur, pp, ptrp, 1);
> > >  		numrecs++;
> > >  		xfs_btree_set_numrecs(block, numrecs);
> > > @@ -3079,7 +3080,7 @@ xfs_btree_insrec(
> > >  		xfs_btree_shift_recs(cur, rp, 1, numrecs - ptr + 1);
> > >  
> > >  		/* Now put the new data in, bump numrecs and log it. */
> > > -		xfs_btree_copy_recs(cur, rp, recp, 1);
> > > +		xfs_btree_copy_recs(cur, rp, &rec, 1);
> > >  		xfs_btree_set_numrecs(block, ++numrecs);
> > >  		xfs_btree_log_recs(cur, bp, ptr, numrecs);
> > >  #ifdef DEBUG
> > > @@ -3095,7 +3096,7 @@ xfs_btree_insrec(
> > >  
> > >  	/* If we inserted at the start of a block, update the parents' keys. */
> > >  	if (optr == 1) {
> > > -		error = xfs_btree_updkey(cur, &key, level + 1);
> > > +		error = xfs_btree_updkey(cur, key, level + 1);
> > >  		if (error)
> > >  			goto error0;
> > >  	}
> > > @@ -3105,7 +3106,7 @@ xfs_btree_insrec(
> > >  	 * we are at the far right edge of the tree, update it.
> > >  	 */
> > >  	if (xfs_btree_is_lastrec(cur, block, level)) {
> > > -		cur->bc_ops->update_lastrec(cur, block, recp,
> > > +		cur->bc_ops->update_lastrec(cur, block, &rec,
> > >  					    ptr, LASTREC_INSREC);
> > >  	}
> > >  
> > > @@ -3115,7 +3116,7 @@ xfs_btree_insrec(
> > >  	 */
> > >  	*ptrp = nptr;
> > >  	if (!xfs_btree_ptr_is_null(cur, &nptr)) {
> > > -		*recp = nrec;
> > > +		*key = nkey;
> > >  		*curp = ncur;
> > >  	}
> > >  
> > > @@ -3146,14 +3147,13 @@ xfs_btree_insert(
> > >  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
> > >  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
> > >  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> > > -	union xfs_btree_rec	rec;	/* record to insert */
> > > +	union xfs_btree_key	key;	/* key of block to insert */
> > >  
> > >  	level = 0;
> > >  	ncur = NULL;
> > >  	pcur = cur;
> > >  
> > >  	xfs_btree_set_ptr_null(cur, &nptr);
> > > -	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > >  
> > >  	/*
> > >  	 * Loop going up the tree, starting at the leaf level.
> > > @@ -3165,7 +3165,7 @@ xfs_btree_insert(
> > >  		 * Insert nrec/nptr into this level of the tree.
> > >  		 * Note if we fail, nptr will be null.
> > >  		 */
> > > -		error = xfs_btree_insrec(pcur, level, &nptr, &rec, &ncur, &i);
> > > +		error = xfs_btree_insrec(pcur, level, &nptr, &key, &ncur, &i);
> > >  		if (error) {
> > >  			if (pcur != cur)
> > >  				xfs_btree_del_cursor(pcur, XFS_BTREE_ERROR);
> > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > > index b955e5d..b99c018 100644
> > > --- a/fs/xfs/libxfs/xfs_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_btree.h
> > > @@ -158,8 +158,6 @@ struct xfs_btree_ops {
> > >  	/* init values of btree structures */
> > >  	void	(*init_key_from_rec)(union xfs_btree_key *key,
> > >  				     union xfs_btree_rec *rec);
> > > -	void	(*init_rec_from_key)(union xfs_btree_key *key,
> > > -				     union xfs_btree_rec *rec);
> > >  	void	(*init_rec_from_cur)(struct xfs_btree_cur *cur,
> > >  				     union xfs_btree_rec *rec);
> > >  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > index 89c21d7..88da2ad 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > @@ -146,14 +146,6 @@ xfs_inobt_init_key_from_rec(
> > >  }
> > >  
> > >  STATIC void
> > > -xfs_inobt_init_rec_from_key(
> > > -	union xfs_btree_key	*key,
> > > -	union xfs_btree_rec	*rec)
> > > -{
> > > -	rec->inobt.ir_startino = key->inobt.ir_startino;
> > > -}
> > > -
> > > -STATIC void
> > >  xfs_inobt_init_rec_from_cur(
> > >  	struct xfs_btree_cur	*cur,
> > >  	union xfs_btree_rec	*rec)
> > > @@ -314,7 +306,6 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
> > >  	.get_minrecs		= xfs_inobt_get_minrecs,
> > >  	.get_maxrecs		= xfs_inobt_get_maxrecs,
> > >  	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
> > > -	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
> > >  	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
> > >  	.init_ptr_from_cur	= xfs_inobt_init_ptr_from_cur,
> > >  	.key_diff		= xfs_inobt_key_diff,
> > > @@ -336,7 +327,6 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
> > >  	.get_minrecs		= xfs_inobt_get_minrecs,
> > >  	.get_maxrecs		= xfs_inobt_get_maxrecs,
> > >  	.init_key_from_rec	= xfs_inobt_init_key_from_rec,
> > > -	.init_rec_from_key	= xfs_inobt_init_rec_from_key,
> > >  	.init_rec_from_cur	= xfs_inobt_init_rec_from_cur,
> > >  	.init_ptr_from_cur	= xfs_finobt_init_ptr_from_cur,
> > >  	.key_diff		= xfs_inobt_key_diff,
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
  2016-06-28  3:26     ` Darrick J. Wong
@ 2016-06-28 12:32       ` Brian Foster
  2016-06-28 17:36         ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-28 12:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 08:26:21PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 22, 2016 at 11:17:06AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> > > On a filesystem with both reflink and reverse mapping enabled, it's
> > > possible to have multiple rmap records referring to the same blocks on
> > > disk.  When overlapping intervals are possible, querying a classic
> > > btree to find all records intersecting a given interval is inefficient
> > > because we cannot use the left side of the search interval to filter
> > > out non-matching records the same way that we can use the existing
> > > btree key to filter out records coming after the right side of the
> > > search interval.  This will become important once we want to use the
> > > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
> > > 
> > > (For the non-overlapping case, we can perform such queries trivially
> > > by starting at the left side of the interval and walking the tree
> > > until we pass the right side.)
> > > 
> > > Therefore, extend the btree code to come closer to supporting
> > > intervals as a first-class record attribute.  This involves widening
> > > the btree node's key space to store both the lowest key reachable via
> > > the node pointer (as the btree does now) and the highest key reachable
> > > via the same pointer and teaching the btree modifying functions to
> > > keep the highest-key records up to date.
> > > 
> > > This behavior can be turned on via a new btree ops flag so that btrees
> > > that cannot store overlapping intervals don't pay the overhead costs
> > > in terms of extra code and disk format changes.
> > > 
> > > v2: When we're deleting a record in a btree that supports overlapped
> > > interval records and the deletion results in two btree blocks being
> > > joined, we defer updating the high/low keys until after all possible
> > > joining (at higher levels in the tree) have finished.  At this point,
> > > the btree pointers at all levels have been updated to remove the empty
> > > blocks and we can update the low and high keys.
> > > 
> > > When we're doing this, we must be careful to update the keys of all
> > > node pointers up to the root instead of stopping at the first set of
> > > keys that don't need updating.  This is because it's possible for a
> > > single deletion to cause joining of multiple levels of tree, and so
> > > we need to update everything going back to the root.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > I think I get the gist of this and it mostly looks Ok to me. A few
> > questions and minor comments...
> 
> Ok.
> 
> > >  fs/xfs/libxfs/xfs_btree.c |  379 +++++++++++++++++++++++++++++++++++++++++----
> > >  fs/xfs/libxfs/xfs_btree.h |   16 ++
> > >  fs/xfs/xfs_trace.h        |   36 ++++
> > >  3 files changed, 395 insertions(+), 36 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > > index a096539..afcafd6 100644
> > > --- a/fs/xfs/libxfs/xfs_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_btree.c
...
> > > @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
> > >  		rkp = &key;
> > >  	}
> > >  
> > > -	/* Update the parent key values of right. */
> > > +	/* Update the parent key values of left and right. */
> > > +	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> > > +	xfs_btree_updkeys(cur, level);
> > >  	error = xfs_btree_updkey(cur, rkp, level + 1);
> > >  	if (error)
> > >  		goto error0;
> > > @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
> > >  	if (error)
> > >  		goto error1;
> > >  
> > > +	/* Update left and right parent pointers */
> > > +	xfs_btree_updkeys(cur, level);
> > > +	xfs_btree_updkeys(tcur, level);
> > 
> > In this case, we grab the last record of the block, increment from there
> > and update using the cursor. This is much more straightforward, imo.
> > Could we use this approach in the left shift case as well?
> 
> Yes, I think so.  I might have started refactoring btree_sibling_updkeys
> out of existence and got distracted, since there isn't anything that uses
> the RIGHTSIB ptr value.
> 

Ok, I think that would be much cleaner.

> > >  	error = xfs_btree_updkey(tcur, rkp, level + 1);
> > >  	if (error)
> > >  		goto error1;
> > > @@ -2356,7 +2604,7 @@ __xfs_btree_split(
> > >  	struct xfs_btree_cur	*cur,
> > >  	int			level,
> > >  	union xfs_btree_ptr	*ptrp,
> > > -	union xfs_btree_key	*key,
> > > +	struct xfs_btree_double_key	*key,
> > >  	struct xfs_btree_cur	**curp,
> > >  	int			*stat)		/* success/failure */
> > >  {
> > > @@ -2452,9 +2700,6 @@ __xfs_btree_split(
> > >  
> > >  		xfs_btree_log_keys(cur, rbp, 1, rrecs);
> > >  		xfs_btree_log_ptrs(cur, rbp, 1, rrecs);
> > > -
> > > -		/* Grab the keys to the entries moved to the right block */
> > > -		xfs_btree_copy_keys(cur, key, rkp, 1);
> > >  	} else {
> > >  		/* It's a leaf.  Move records.  */
> > >  		union xfs_btree_rec	*lrp;	/* left record pointer */
> > > @@ -2465,12 +2710,8 @@ __xfs_btree_split(
> > >  
> > >  		xfs_btree_copy_recs(cur, rrp, lrp, rrecs);
> > >  		xfs_btree_log_recs(cur, rbp, 1, rrecs);
> > > -
> > > -		cur->bc_ops->init_key_from_rec(key,
> > > -			xfs_btree_rec_addr(cur, 1, right));
> > >  	}
> > >  
> > > -
> > >  	/*
> > >  	 * Find the left block number by looking in the buffer.
> > >  	 * Adjust numrecs, sibling pointers.
> > > @@ -2484,6 +2725,12 @@ __xfs_btree_split(
> > >  	xfs_btree_set_numrecs(left, lrecs);
> > >  	xfs_btree_set_numrecs(right, xfs_btree_get_numrecs(right) + rrecs);
> > >  
> > > +	/* Find the low & high keys for the new block. */
> > > +	if (level > 0)
> > > +		xfs_btree_find_node_keys(cur, right, &key->low, &key->high);
> > > +	else
> > > +		xfs_btree_find_leaf_keys(cur, right, &key->low, &key->high);
> > > +
> > 
> > Why not push these into the above if/else where the previous key
> > copy/init calls were removed from?
> 
> We don't set bb_numrecs on the right block until the line above the new
> hunk, and the btree_find_*_keys functions require numrecs to be set.
> 
> The removed key copy/init calls only looked at keys[1].
> 
> That said, it's trivial to move the set_numrecs calls above the if statement.
> 

Ok, thanks. No need to shuffle it around. I'd suggest a one-liner
comment though so somebody doesn't blindly refactor this down the road.
It also sounds like the find keys functions could use ASSERT() checks
for a sane bb_numrecs.

> > >  	xfs_btree_log_block(cur, rbp, XFS_BB_ALL_BITS);
> > >  	xfs_btree_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB);
> > >  
...
> > > @@ -3095,8 +3365,24 @@ xfs_btree_insrec(
> > >  	xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);
> > >  
> > >  	/* If we inserted at the start of a block, update the parents' keys. */
> > 
> > This comment is associated with the codeblock that has been pushed
> > further down, no?
> 
> Correct.  I think that got mismerged somewhere along the way.
> 
> > > +	if (ncur && bp->b_bn != old_bn) {
> > > +		/*
> > > +		 * We just inserted into a new tree block, which means that
> > > +		 * the key for the block is in nkey, not the tree.
> > > +		 */
> > > +		if (level == 0)
> > > +			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> > > +					&nkey.high);
> > > +		else
> > > +			xfs_btree_find_node_keys(cur, block, &nkey.low,
> > > +					&nkey.high);
> > > +	} else {
> > > +		/* Updating the left block, do it the standard way. */
> > > +		xfs_btree_updkeys(cur, level);
> > > +	}
> > > +
> > 
> > Not quite sure I follow the purpose of this hunk. Is this for the case
> > where a btree split occurs, nkey is filled in for the new/right block
> > and then (after nkey is filled in) the new record ends up being added to
> > the new block? If so, what about the case where ncur is not created?
> > (It looks like that's possible from the code, but I could easily be
> > missing some context as to why that's not the case.)
> 
> Yes, the first part of the if-else hunk is to fill out nkey when we've
> split a btree block.  Now that I look at it again, I think that whole
> weird conditional could be replaced with the same xfs_btree_ptr_is_null()
> check later on.  I think it can also be combined with it.
> 

Ok.

> Commentage for now:
> 
> /*
>  * If we just inserted a new tree block, we have to find the low
>  * and high keys for the new block and arrange to pass them back
>  * separately.  If we're just updating a block we can use the
>  * regular tree update mechanism.
>  */
> 

Couldn't you just point out that nkey may not be coherent with the new
block if the new record was inserted therein..?

> > In any event, I think we could elaborate a bit in the comment on why
> > this is necessary. I'd also move it above the top-level if/else.
> > 
> > >  	if (optr == 1) {
> > > -		error = xfs_btree_updkey(cur, key, level + 1);
> > > +		error = xfs_btree_updkey(cur, &key->low, level + 1);
> > >  		if (error)
> > >  			goto error0;
> > >  	}
> > > @@ -3147,7 +3433,7 @@ xfs_btree_insert(
> > >  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
> > >  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
> > >  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> > > -	union xfs_btree_key	key;	/* key of block to insert */
> > > +	struct xfs_btree_double_key	key;	/* key of block to insert */
> > 
> > Probably should fix up the function param alignment here and the couple
> > other or so places we make this change.
> 
> I changed the name to xfs_btree_bigkey, which avoids the alignment problems.
> 

Sounds good.

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  
> > >  	level = 0;
> > >  	ncur = NULL;
> > > @@ -3552,6 +3838,7 @@ xfs_btree_delrec(
> > >  	 * If we deleted the leftmost entry in the block, update the
> > >  	 * key values above us in the tree.
> > >  	 */
> > > +	xfs_btree_updkeys(cur, level);
> > >  	if (ptr == 1) {
> > >  		error = xfs_btree_updkey(cur, keyp, level + 1);
> > >  		if (error)
> > > @@ -3882,6 +4169,16 @@ xfs_btree_delrec(
> > >  	if (level > 0)
> > >  		cur->bc_ptrs[level]--;
> > >  
> > > +	/*
> > > +	 * We combined blocks, so we have to update the parent keys if the
> > > +	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
> > > +	 * points to the old block so that the caller knows which record to
> > > +	 * delete.  Therefore, the caller must be savvy enough to call updkeys
> > > +	 * for us if we return stat == 2.  The other exit points from this
> > > +	 * function don't require deletions further up the tree, so they can
> > > +	 * call updkeys directly.
> > > +	 */
> > > +
> > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > >  	/* Return value means the next level up has something to do. */
> > >  	*stat = 2;
> > > @@ -3907,6 +4204,7 @@ xfs_btree_delete(
> > >  	int			error;	/* error return value */
> > >  	int			level;
> > >  	int			i;
> > > +	bool			joined = false;
> > >  
> > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > >  
> > > @@ -3920,8 +4218,17 @@ xfs_btree_delete(
> > >  		error = xfs_btree_delrec(cur, level, &i);
> > >  		if (error)
> > >  			goto error0;
> > > +		if (i == 2)
> > > +			joined = true;
> > >  	}
> > >  
> > > +	/*
> > > +	 * If we combined blocks as part of deleting the record, delrec won't
> > > +	 * have updated the parent keys so we have to do that here.
> > > +	 */
> > > +	if (joined)
> > > +		xfs_btree_updkeys_force(cur, 0);
> > > +
> > >  	if (i == 0) {
> > >  		for (level = 1; level < cur->bc_nlevels; level++) {
> > >  			if (cur->bc_ptrs[level] == 0) {
> > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > > index b99c018..a5ec6c7 100644
> > > --- a/fs/xfs/libxfs/xfs_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_btree.h
> > > @@ -126,6 +126,9 @@ struct xfs_btree_ops {
> > >  	size_t	key_len;
> > >  	size_t	rec_len;
> > >  
> > > +	/* flags */
> > > +	uint	flags;
> > > +
> > >  	/* cursor operations */
> > >  	struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *);
> > >  	void	(*update_cursor)(struct xfs_btree_cur *src,
> > > @@ -162,11 +165,21 @@ struct xfs_btree_ops {
> > >  				     union xfs_btree_rec *rec);
> > >  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
> > >  				     union xfs_btree_ptr *ptr);
> > > +	void	(*init_high_key_from_rec)(union xfs_btree_key *key,
> > > +					  union xfs_btree_rec *rec);
> > >  
> > >  	/* difference between key value and cursor value */
> > >  	__int64_t (*key_diff)(struct xfs_btree_cur *cur,
> > >  			      union xfs_btree_key *key);
> > >  
> > > +	/*
> > > +	 * Difference between key2 and key1 -- positive if key2 > key1,
> > > +	 * negative if key2 < key1, and zero if equal.
> > > +	 */
> > > +	__int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
> > > +				   union xfs_btree_key *key1,
> > > +				   union xfs_btree_key *key2);
> > > +
> > >  	const struct xfs_buf_ops	*buf_ops;
> > >  
> > >  #if defined(DEBUG) || defined(XFS_WARN)
> > > @@ -182,6 +195,9 @@ struct xfs_btree_ops {
> > >  #endif
> > >  };
> > >  
> > > +/* btree ops flags */
> > > +#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
> > > +
> > >  /*
> > >   * Reasons for the update_lastrec method to be called.
> > >   */
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 68f27f7..ffea28c 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -38,6 +38,7 @@ struct xlog_recover_item;
> > >  struct xfs_buf_log_format;
> > >  struct xfs_inode_log_format;
> > >  struct xfs_bmbt_irec;
> > > +struct xfs_btree_cur;
> > >  
> > >  DECLARE_EVENT_CLASS(xfs_attr_list_class,
> > >  	TP_PROTO(struct xfs_attr_list_context *ctx),
> > > @@ -2183,6 +2184,41 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
> > >  DEFINE_DISCARD_EVENT(xfs_discard_exclude);
> > >  DEFINE_DISCARD_EVENT(xfs_discard_busy);
> > >  
> > > +/* btree cursor events */
> > > +DECLARE_EVENT_CLASS(xfs_btree_cur_class,
> > > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp),
> > > +	TP_ARGS(cur, level, bp),
> > > +	TP_STRUCT__entry(
> > > +		__field(dev_t, dev)
> > > +		__field(xfs_btnum_t, btnum)
> > > +		__field(int, level)
> > > +		__field(int, nlevels)
> > > +		__field(int, ptr)
> > > +		__field(xfs_daddr_t, daddr)
> > > +	),
> > > +	TP_fast_assign(
> > > +		__entry->dev = cur->bc_mp->m_super->s_dev;
> > > +		__entry->btnum = cur->bc_btnum;
> > > +		__entry->level = level;
> > > +		__entry->nlevels = cur->bc_nlevels;
> > > +		__entry->ptr = cur->bc_ptrs[level];
> > > +		__entry->daddr = bp->b_bn;
> > > +	),
> > > +	TP_printk("dev %d:%d btnum %d level %d/%d ptr %d daddr 0x%llx",
> > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > +		  __entry->btnum,
> > > +		  __entry->level,
> > > +		  __entry->nlevels,
> > > +		  __entry->ptr,
> > > +		  (unsigned long long)__entry->daddr)
> > > +)
> > > +
> > > +#define DEFINE_BTREE_CUR_EVENT(name) \
> > > +DEFINE_EVENT(xfs_btree_cur_class, name, \
> > > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> > > +	TP_ARGS(cur, level, bp))
> > > +DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> > > +
> > >  #endif /* _TRACE_XFS_H */
> > >  
> > >  #undef TRACE_INCLUDE_PATH
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 014/119] xfs: introduce interval queries on btrees
  2016-06-27 21:07     ` Darrick J. Wong
@ 2016-06-28 12:32       ` Brian Foster
  2016-06-28 16:29         ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-28 12:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 02:07:46PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 22, 2016 at 11:18:00AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:19:21PM -0700, Darrick J. Wong wrote:
> > > Create a function to enable querying of btree records mapping to a
> > > range of keys.  This will be used in subsequent patches to allow
> > > querying the reverse mapping btree to find the extents mapped to a
> > > range of physical blocks, though the generic code can be used for
> > > any range query.
> > > 
> > > v2: add some shortcuts so that we can jump out of processing once
> > > we know there won't be any more records to find.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_btree.c |  249 +++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_btree.h |   22 +++-
> > >  fs/xfs/xfs_trace.h        |    1 
> > >  3 files changed, 267 insertions(+), 5 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > > index afcafd6..5f5cf23 100644
> > > --- a/fs/xfs/libxfs/xfs_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_btree.c
> > > @@ -4509,3 +4509,252 @@ xfs_btree_calc_size(
> > >  	}
> > >  	return rval;
> > >  }
> > > +
> > > +/* Query a regular btree for all records overlapping a given interval. */
> > 
> > Can you elaborate on the search algorithm used? (More for reference
> > against the overlapped query, as that one is more complex).
> 
> Ok.  Both query_range functions aim to return all records intersecting the
> given range.
> 
> For non-overlapped btrees, we start with a LE lookup of the low key and
> return each record we find a the record with a key greater than the
> high key.
> 
> For overlapped btrees, we follow the procedure in the "Interval trees"
> section of _Introduction to Algorithms_, which is 14.3 in the 2nd and
> 3rd editions.  The query algorithm is roughly as follows:
> 
> For any leaf btree node, generate the low and high keys for the record.
> If there's a range overlap with the query's low and high keys, pass the
> record to the iterator function.
> 
> For any internal btree node, compare the low and high keys for each pointer
> against the query's low and high keys.  If there's an overlap, follow the
> pointer downwards in the tree.
> 
> (I could render the figures in the book as ASCII art if anyone wants.)
> 

Thanks. I meant more to update the comments above each function. :) No
need to go as far as ASCII art I don't think (the external reference
might be good though). I was really just looking for something that says
"this function is supposed to do <whatever>" so somebody reading through
it has a starting point of reference.

> > 
> > > +STATIC int
> > > +xfs_btree_simple_query_range(
> > > +	struct xfs_btree_cur		*cur,
> > > +	union xfs_btree_irec		*low_rec,
> > > +	union xfs_btree_irec		*high_rec,
> > > +	xfs_btree_query_range_fn	fn,
> > > +	void				*priv)
> > > +{
> > > +	union xfs_btree_rec		*recp;
> > > +	union xfs_btree_rec		rec;
> > > +	union xfs_btree_key		low_key;
> > > +	union xfs_btree_key		high_key;
> > > +	union xfs_btree_key		rec_key;
> > > +	__int64_t			diff;
> > > +	int				stat;
> > > +	bool				firstrec = true;
> > > +	int				error;
> > > +
> > > +	ASSERT(cur->bc_ops->init_high_key_from_rec);
> > > +
> > > +	/* Find the keys of both ends of the interval. */
> > > +	cur->bc_rec = *high_rec;
> > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> > > +
> > > +	cur->bc_rec = *low_rec;
> > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> > > +
> > > +	/* Find the leftmost record. */
> > > +	stat = 0;
> > > +	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat);
> > > +	if (error)
> > > +		goto out;
> > > +
> > > +	while (stat) {
> > > +		/* Find the record. */
> > > +		error = xfs_btree_get_rec(cur, &recp, &stat);
> > > +		if (error || !stat)
> > > +			break;
> > > +
> > > +		/* Can we tell if this record is too low? */
> > > +		if (firstrec) {
> > > +			cur->bc_rec = *low_rec;
> > > +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> > > +			diff = cur->bc_ops->key_diff(cur, &rec_key);
> > > +			if (diff < 0)
> > > +				goto advloop;
> > > +		}
> > > +		firstrec = false;
> > 
> > This could move up into the if block.
> 
> Ok.
> 
> > > +
> > > +		/* Have we gone past the end? */
> > > +		cur->bc_rec = *high_rec;
> > > +		cur->bc_ops->init_key_from_rec(&rec_key, recp);
> > 
> > I'd move this up to immediately after the xfs_btree_get_rec() call and
> > eliminate the duplicate in the 'if (firstrec)' block above.
> 
> Ok.  That key ought to be named rec_hkey too.
> 
> > > +		diff = cur->bc_ops->key_diff(cur, &rec_key);
> > > +		if (diff > 0)
> > > +			break;
> > > +
> > > +		/* Callback */
> > > +		error = fn(cur, recp, priv);
> > > +		if (error < 0 || error == XFS_BTREE_QUERY_RANGE_ABORT)
> > > +			break;
> > > +
> > > +advloop:
> > > +		/* Move on to the next record. */
> > > +		error = xfs_btree_increment(cur, 0, &stat);
> > > +		if (error)
> > > +			break;
> > > +	}
> > > +
> > > +out:
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Query an overlapped interval btree for all records overlapping a given
> > > + * interval.
> > > + */
> > 
> > Same comment here, can you elaborate on the search algorithm? Also, I
> > think an example or generic description of the rules around what records
> > this query returns (e.g., low_rec/high_rec vs. record low/high keys)
> > would be useful, particularly since I, at least, don't have much context
> > on the rmap+reflink scenarios quite yet.
> 
> Let's say you have a bunch of (overlapped) rmap records:
> 
> 1: +- file A startblock B offset C length D -----------+
> 2:      +- file E startblock F offset G length H --------------+
> 3:      +- file I startblock F offset J length K --+
> 4:                                                        +- file L... --+
> 
> Now say we want to map block (B+D) into file A at offset (C+D).  Ideally, we'd
> simply increment the length of record 1.  But how do we find that record that
> ends at (B+D-1)?  A LE lookup of (B+D-1) would return record 3 because the
> keys are ordered first by startblock.  An interval query would return records
> 1 and 2 because they both overlap (B+D-1), and from that we can pick out
> record 1 as the appropriate left neighbor.
> 

Great, thanks.. can you include this content in the comment above the
function as well?

Brian

> In the non-overlapped case you can do a LE lookup and decrement the cursor
> because a record's interval must end before the next record.
> 
> > > +STATIC int
> > > +xfs_btree_overlapped_query_range(
> > > +	struct xfs_btree_cur		*cur,
> > > +	union xfs_btree_irec		*low_rec,
> > > +	union xfs_btree_irec		*high_rec,
> > > +	xfs_btree_query_range_fn	fn,
> > > +	void				*priv)
> > > +{
> > > +	union xfs_btree_ptr		ptr;
> > > +	union xfs_btree_ptr		*pp;
> > > +	union xfs_btree_key		rec_key;
> > > +	union xfs_btree_key		low_key;
> > > +	union xfs_btree_key		high_key;
> > > +	union xfs_btree_key		*lkp;
> > > +	union xfs_btree_key		*hkp;
> > > +	union xfs_btree_rec		rec;
> > > +	union xfs_btree_rec		*recp;
> > > +	struct xfs_btree_block		*block;
> > > +	__int64_t			ldiff;
> > > +	__int64_t			hdiff;
> > > +	int				level;
> > > +	struct xfs_buf			*bp;
> > > +	int				i;
> > > +	int				error;
> > > +
> > > +	/* Find the keys of both ends of the interval. */
> > > +	cur->bc_rec = *high_rec;
> > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> > > +
> > > +	cur->bc_rec = *low_rec;
> > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> > > +
> > > +	/* Load the root of the btree. */
> > > +	level = cur->bc_nlevels - 1;
> > > +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > > +	error = xfs_btree_lookup_get_block(cur, level, &ptr, &block);
> > > +	if (error)
> > > +		return error;
> > > +	xfs_btree_get_block(cur, level, &bp);
> > > +	trace_xfs_btree_overlapped_query_range(cur, level, bp);
> > > +#ifdef DEBUG
> > > +	error = xfs_btree_check_block(cur, block, level, bp);
> > > +	if (error)
> > > +		goto out;
> > > +#endif
> > > +	cur->bc_ptrs[level] = 1;
> > > +
> > > +	while (level < cur->bc_nlevels) {
> > > +		block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
> > > +
> > > +		if (level == 0) {
> > > +			/* End of leaf, pop back towards the root. */
> > > +			if (cur->bc_ptrs[level] >
> > > +			    be16_to_cpu(block->bb_numrecs)) {
> > > +leaf_pop_up:
> > > +				if (level < cur->bc_nlevels - 1)
> > > +					cur->bc_ptrs[level + 1]++;
> > > +				level++;
> > > +				continue;
> > > +			}
> > > +
> > > +			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> > > +
> > > +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> > > +			ldiff = cur->bc_ops->diff_two_keys(cur, &low_key,
> > > +					&rec_key);
> > > +
> > > +			cur->bc_ops->init_key_from_rec(&rec_key, recp);
> > > +			hdiff = cur->bc_ops->diff_two_keys(cur, &rec_key,
> > > +					&high_key);
> > > +
> > 
> > This looked a little funny to me because I expected diff_two_keys() to
> > basically be param1 - param2. Looking ahead at the rmapbt code, it is in
> > fact the other way around. I'm not sure we have precedent for either
> > way, tbh. I still have to stare at this some more, but I wonder if a
> > "does record overlap" helper (with comments) would help clean this up a
> > bit.
> 
> You're correct this is exactly the opposite of the compare functions in
> the C library and the rest of the kernel.  I'll fix that up.
> 
> > > +			/* If the record matches, callback */
> > > +			if (ldiff >= 0 && hdiff >= 0) {
> 
> Ok, I'll make it a little clearer what we're testing here:
> 
> /*
>  * If (record's high key >= query's low key) and
>  *    (query's high key >= record's low key), then
>  * this record overlaps the query range, so callback.
>  */
> 
> 
> > > +				error = fn(cur, recp, priv);
> > > +				if (error < 0 ||
> > > +				    error == XFS_BTREE_QUERY_RANGE_ABORT)
> > > +					break;
> > > +			} else if (hdiff < 0) {
> > > +				/* Record is larger than high key; pop. */
> > > +				goto leaf_pop_up;
> > > +			}
> > > +			cur->bc_ptrs[level]++;
> > > +			continue;
> > > +		}
> > > +
> > > +		/* End of node, pop back towards the root. */
> > > +		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> > > +node_pop_up:
> > > +			if (level < cur->bc_nlevels - 1)
> > > +				cur->bc_ptrs[level + 1]++;
> > > +			level++;
> > > +			continue;
> > 
> > Looks like same code as leaf_pop_up. I wonder if we can bury this at the
> > end of the loop with a common label.
> 
> Yep.
> 
> > > +		}
> > > +
> > > +		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
> > > +		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
> > > +		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> > > +
> > > +		ldiff = cur->bc_ops->diff_two_keys(cur, &low_key, hkp);
> > > +		hdiff = cur->bc_ops->diff_two_keys(cur, lkp, &high_key);
> > > +
> > > +		/* If the key matches, drill another level deeper. */
> > > +		if (ldiff >= 0 && hdiff >= 0) {
> > > +			level--;
> > > +			error = xfs_btree_lookup_get_block(cur, level, pp,
> > > +					&block);
> > > +			if (error)
> > > +				goto out;
> > > +			xfs_btree_get_block(cur, level, &bp);
> > > +			trace_xfs_btree_overlapped_query_range(cur, level, bp);
> > > +#ifdef DEBUG
> > > +			error = xfs_btree_check_block(cur, block, level, bp);
> > > +			if (error)
> > > +				goto out;
> > > +#endif
> > > +			cur->bc_ptrs[level] = 1;
> > > +			continue;
> > > +		} else if (hdiff < 0) {
> > > +			/* The low key is larger than the upper range; pop. */
> > > +			goto node_pop_up;
> > > +		}
> > > +		cur->bc_ptrs[level]++;
> > > +	}
> > > +
> > > +out:
> > > +	/*
> > > +	 * If we don't end this function with the cursor pointing at a record
> > > +	 * block, a subsequent non-error cursor deletion will not release
> > > +	 * node-level buffers, causing a buffer leak.  This is quite possible
> > > +	 * with a zero-results range query, so release the buffers if we
> > > +	 * failed to return any results.
> > > +	 */
> > > +	if (cur->bc_bufs[0] == NULL) {
> > > +		for (i = 0; i < cur->bc_nlevels; i++) {
> > > +			if (cur->bc_bufs[i]) {
> > > +				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> > > +				cur->bc_bufs[i] = NULL;
> > > +				cur->bc_ptrs[i] = 0;
> > > +				cur->bc_ra[i] = 0;
> > > +			}
> > > +		}
> > > +	}
> > > +
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Query a btree for all records overlapping a given interval of keys.  The
> > > + * supplied function will be called with each record found; return one of the
> > > + * XFS_BTREE_QUERY_RANGE_{CONTINUE,ABORT} values or the usual negative error
> > > + * code.  This function returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a
> > > + * negative error code.
> > > + */
> > > +int
> > > +xfs_btree_query_range(
> > > +	struct xfs_btree_cur		*cur,
> > > +	union xfs_btree_irec		*low_rec,
> > > +	union xfs_btree_irec		*high_rec,
> > > +	xfs_btree_query_range_fn	fn,
> > > +	void				*priv)
> > > +{
> > > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > > +		return xfs_btree_simple_query_range(cur, low_rec,
> > > +				high_rec, fn, priv);
> > > +	return xfs_btree_overlapped_query_range(cur, low_rec, high_rec,
> > > +			fn, priv);
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > > index a5ec6c7..898fee5 100644
> > > --- a/fs/xfs/libxfs/xfs_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_btree.h
> > > @@ -206,6 +206,12 @@ struct xfs_btree_ops {
> > >  #define LASTREC_DELREC	2
> > >  
> > >  
> > > +union xfs_btree_irec {
> > > +	xfs_alloc_rec_incore_t		a;
> > > +	xfs_bmbt_irec_t			b;
> > > +	xfs_inobt_rec_incore_t		i;
> > > +};
> > > +
> > 
> > We might as well kill off the typedef usage here.
> 
> Ok.  Thx for the review!
> 
> --D
> 
> > 
> > Brian
> > 
> > >  /*
> > >   * Btree cursor structure.
> > >   * This collects all information needed by the btree code in one place.
> > > @@ -216,11 +222,7 @@ typedef struct xfs_btree_cur
> > >  	struct xfs_mount	*bc_mp;	/* file system mount struct */
> > >  	const struct xfs_btree_ops *bc_ops;
> > >  	uint			bc_flags; /* btree features - below */
> > > -	union {
> > > -		xfs_alloc_rec_incore_t	a;
> > > -		xfs_bmbt_irec_t		b;
> > > -		xfs_inobt_rec_incore_t	i;
> > > -	}		bc_rec;		/* current insert/search record value */
> > > +	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
> > >  	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
> > >  	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
> > >  	__uint8_t	bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
> > > @@ -494,4 +496,14 @@ xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
> > >  uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
> > >  		unsigned long len);
> > >  
> > > +/* return codes */
> > > +#define XFS_BTREE_QUERY_RANGE_CONTINUE	0	/* keep iterating */
> > > +#define XFS_BTREE_QUERY_RANGE_ABORT	1	/* stop iterating */
> > > +typedef int (*xfs_btree_query_range_fn)(struct xfs_btree_cur *cur,
> > > +		union xfs_btree_rec *rec, void *priv);
> > > +
> > > +int xfs_btree_query_range(struct xfs_btree_cur *cur,
> > > +		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
> > > +		xfs_btree_query_range_fn fn, void *priv);
> > > +
> > >  #endif	/* __XFS_BTREE_H__ */
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index ffea28c..f0ac9c9 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -2218,6 +2218,7 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
> > >  	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> > >  	TP_ARGS(cur, level, bp))
> > >  DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> > > +DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
> > >  
> > >  #endif /* _TRACE_XFS_H */
> > >  
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 016/119] xfs: move deferred operations into a separate file
  2016-06-27 19:14     ` Darrick J. Wong
@ 2016-06-28 12:32       ` Brian Foster
  2016-06-28 18:51         ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-28 12:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 12:14:01PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2016 at 09:14:54AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:19:34PM -0700, Darrick J. Wong wrote:
> > > All the code around struct xfs_bmap_free basically implements a
> > > deferred operation framework through which we can roll transactions
> > > (to unlock buffers and avoid violating lock order rules) while
> > > managing all the necessary log redo items.  Previously we only used
> > > this code to free extents after some sort of mapping operation, but
> > > with the advent of rmap and reflink, we suddenly need to do more than
> > > that.
> > > 
> > > With that in mind, xfs_bmap_free really becomes a deferred ops control
> > > structure.  Rename the structure and move the deferred ops into their
> > > own file to avoid further bloating of the bmap code.
> > > 
> > > v2: actually sort the work items by AG to avoid deadlocks
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > So if I'm following this correctly, we 1.) abstract the bmap freeing
> > infrastructure into a generic mechanism and 2.) enhance it a bit to
> > provide things like partial intent completion, etc.
> 
> [Back from vacation]
> 
> Yup.  The partial intent completion code is for use by the refcount adjust
> function because in the worst case an adjustment of N blocks could require
> N record updates.
> 

Ok, technically those bits could be punted off to the reflink series.

> > If so and for future
> > reference, this would probably be easier to review if the abstraction
> > and enhancement were done separately. It's probably not worth that at
> > this point, however...
> 
> It wouldn't be difficult to separate them; the partial intent completion
> are the two code blocks below that handle the -EAGAIN case.
> 

That's kind of what I figured, since otherwise most of the rest of the
code maps to the xfs_bmap_*() stuff.

> (On the other hand it's so little code that I figured I might as well
> just do the whole file all at once.)
> 

It's more a matter of simplifying review when a change is explicitly
refactoring vs. having to read through and identify where the
enhancements actually are. It leaves a cleaner git history and tends to
simplify backporting as well, fwiw.

That said, I don't mind leaving this one as is at this point.

> > >  fs/xfs/Makefile           |    2 
> > >  fs/xfs/libxfs/xfs_defer.c |  471 +++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_defer.h |   96 +++++++++
> > >  fs/xfs/xfs_defer_item.c   |   36 +++
> > >  fs/xfs/xfs_super.c        |    2 
> > >  5 files changed, 607 insertions(+)
> > >  create mode 100644 fs/xfs/libxfs/xfs_defer.c
> > >  create mode 100644 fs/xfs/libxfs/xfs_defer.h
> > >  create mode 100644 fs/xfs/xfs_defer_item.c
> > > 
> > > 
> > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > index 3542d94..ad46a2d 100644
> > > --- a/fs/xfs/Makefile
> > > +++ b/fs/xfs/Makefile
> > > @@ -39,6 +39,7 @@ xfs-y				+= $(addprefix libxfs/, \
> > >  				   xfs_btree.o \
> > >  				   xfs_da_btree.o \
> > >  				   xfs_da_format.o \
> > > +				   xfs_defer.o \
> > >  				   xfs_dir2.o \
> > >  				   xfs_dir2_block.o \
> > >  				   xfs_dir2_data.o \
> > > @@ -66,6 +67,7 @@ xfs-y				+= xfs_aops.o \
> > >  				   xfs_attr_list.o \
> > >  				   xfs_bmap_util.o \
> > >  				   xfs_buf.o \
> > > +				   xfs_defer_item.o \
> > >  				   xfs_dir2_readdir.o \
> > >  				   xfs_discard.o \
> > >  				   xfs_error.o \
> > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > new file mode 100644
> > > index 0000000..ad14e33e
> > > --- /dev/null
> > > +++ b/fs/xfs/libxfs/xfs_defer.c
...
> > > +int
> > > +xfs_defer_finish(
> > > +	struct xfs_trans		**tp,
> > > +	struct xfs_defer_ops		*dop,
> > > +	struct xfs_inode		*ip)
> > > +{
> > > +	struct xfs_defer_pending	*dfp;
> > > +	struct list_head		*li;
> > > +	struct list_head		*n;
> > > +	void				*done_item = NULL;
> > > +	void				*state;
> > > +	int				error = 0;
> > > +	void				(*cleanup_fn)(struct xfs_trans *, void *, int);
> > > +
> > > +	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > +
> > > +	/* Until we run out of pending work to finish... */
> > > +	while (xfs_defer_has_unfinished_work(dop)) {
> > > +		/* Log intents for work items sitting in the intake. */
> > > +		xfs_defer_intake_work(*tp, dop);
> > > +
> > > +		/* Roll the transaction. */
> > > +		error = xfs_defer_trans_roll(tp, dop, ip);
> > > +		if (error)
> > > +			goto out;
> > > +
> > > +		/* Mark all pending intents as committed. */
> > > +		list_for_each_entry_reverse(dfp, &dop->dop_pending, dfp_list) {
> > > +			if (dfp->dfp_committed)
> > > +				break;
> > > +			dfp->dfp_committed = true;
> > > +		}
> > > +
> > > +		/* Log an intent-done item for the first pending item. */
> > > +		dfp = list_first_entry(&dop->dop_pending,
> > > +				struct xfs_defer_pending, dfp_list);
> > > +		done_item = dfp->dfp_type->create_done(*tp, dfp->dfp_intent,
> > > +				dfp->dfp_count);
> > > +		cleanup_fn = dfp->dfp_type->finish_cleanup;
> > > +
> > > +		/* Finish the work items. */
> > > +		state = NULL;
> > > +		list_for_each_safe(li, n, &dfp->dfp_work) {
> > > +			list_del(li);
> > > +			dfp->dfp_count--;
> > > +			error = dfp->dfp_type->finish_item(*tp, dop, li,
> > > +					done_item, &state);
> > > +			if (error == -EAGAIN) {
> > > +				/*
> > > +				 * If the caller needs to try again, put the
> > > +				 * item back on the pending list and jump out
> > > +				 * for further processing.
> > 
> > A little confused by the terminology here. Perhaps better to say "back
> > on the work list" rather than "pending list?"
> 
> Yes.
> 
> > Also, what is the meaning/purpose of -EAGAIN here? This isn't used by
> > the extent free bits so I'm missing some context.
> 
> Generally, the ->finish_item() uses -EAGAIN to signal that it couldn't finish
> the work item and that it's necessary to log a new redo item and try again.
> 

Ah, Ok. So it is explicitly part of the dfops interface/infrastructure.
I think that is worth documenting above with a comment (i.e., "certain
callers might require many transactions, use -EAGAIN to indicate ...
blah blah").

> Practically, the only user of this mechanism is the refcountbt adjust function.
> It might be the case that we want to adjust N blocks, but some pathological
> user has creatively used reflink to create many refcount records.  In that
> case we could blow out the transaction reservation logging all the updates.
> 
> To avoid that, the refcount code tries to guess (conservatively) when it
> might be getting close and returns a short *adjusted.  See the call sites of
> xfs_refcount_still_have_space().  Next, xfs_trans_log_finish_refcount_update()
> will notice the short adjust returned and fixes up the CUD item to have a
> reduced cud_nextents and to reflect where the operation stopped.  Then,
> xfs_refcount_update_finish_item() notices the short return, updates the work
> item list, and returns -EAGAIN.  Finally, xfs_defer_finish() sees the -EAGAIN
> and requeues the work item so that we resume refcount adjusting after the
> transaction rolls.
> 

Hmm, this makes me think that maybe it is better to split this up into
two patches for now after all. I'm expecting this is going to be merged
along with the rmap bits before the refcount stuff and I'm not a huge
fan of putting in infrastructure code without users, moreso without
fully understanding how/why said code is going to be used (and I'm not
really looking to jump ahead into the refcount stuff yet).

> > For example, is there
> > an issue with carrying a done_item with an unexpected list count?
> 
> AFAICT, nothing in log recovery ever checks that the list counts of the
> intent and done items actually match, let alone the extents logged with
> them.  It only seems to care if there's an efd such that efd->efd_efi_id ==
> efi->efi_id, in which case it won't replay the efi.
> 

Yeah, I didn't notice any issues with respect to EFI/EFD handling,
though I didn't look too hard because it doesn't use this -EAGAIN
mechanism. If it did, I think you might hit the odd ASSERT() check here
or there (see xfs_efd_item_format()), but that's probably not
catastrophic. I think it also affects the size of the transaction
written to the log, fwiw.

I ask more because it's unexpected to have a structure with a list count
that doesn't match the actual number of items and I don't see it called
out anywhere. This might be another good reason to punt this part off to
the reflink series...

> I don't know if that was a deliberate part of the log design, but the
> lack of checking helps us here.
> 
> > Is it
> > expected that xfs_defer_finish() will not return until -EAGAIN is
> > "cleared" (does relogging below and rolling somehow address this)?
> 
> Yes, relogging and rolling gives us a fresh transaction with which to
> continue updating.
> 
> > > +				 */
> > > +				list_add(li, &dfp->dfp_work);
> > > +				dfp->dfp_count++;
> > > +				break;
> > > +			} else if (error) {
> > > +				/*
> > > +				 * Clean up after ourselves and jump out.
> > > +				 * xfs_defer_cancel will take care of freeing
> > > +				 * all these lists and stuff.
> > > +				 */
> > > +				if (cleanup_fn)
> > > +					cleanup_fn(*tp, state, error);
> > > +				xfs_defer_trans_abort(*tp, dop, error);
> > > +				goto out;
> > > +			}
> > > +		}
> > > +		if (error == -EAGAIN) {
> > > +			/*
> > > +			 * Log a new intent, relog all the remaining work
> > > +			 * item to the new intent, attach the new intent to
> > > +			 * the dfp, and leave the dfp at the head of the list
> > > +			 * for further processing.
> > > +			 */
> > 
> > Similar to the above, could you elaborate on the mechanics of this with
> > respect to the log?  E.g., the comment kind of just repeats what the
> > code does as opposed to explain why it's here. Is the point here to log
> > a new intent in the same transaction as the done item to ensure that we
> > (atomically) indicate that certain operations need to be replayed if
> > this transaction hits the log and then we crash?
> 
> Yes.
> 
> "This effectively replaces the old intent item with a new one listing only
> the work items that were not completed when ->finish_item() returned -EAGAIN.
> After the subsequent transaction roll, we'll resume where we left off with a
> fresh transaction."
> 

I'd point out the relevance of doing so in the same transaction,
otherwise sounds good.

Brian

> Thank you for the review!
> 
> --D
> 
> > Brian
> > 
> > > +			dfp->dfp_intent = dfp->dfp_type->create_intent(*tp,
> > > +					dfp->dfp_count);
> > > +			list_for_each(li, &dfp->dfp_work)
> > > +				dfp->dfp_type->log_item(*tp, dfp->dfp_intent,
> > > +						li);
> > > +		} else {
> > > +			/* Done with the dfp, free it. */
> > > +			list_del(&dfp->dfp_list);
> > > +			kmem_free(dfp);
> > > +		}
> > > +
> > > +		if (cleanup_fn)
> > > +			cleanup_fn(*tp, state, error);
> > > +	}
> > > +
> > > +out:
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Free up any items left in the list.
> > > + */
> > > +void
> > > +xfs_defer_cancel(
> > > +	struct xfs_defer_ops		*dop)
> > > +{
> > > +	struct xfs_defer_pending	*dfp;
> > > +	struct xfs_defer_pending	*pli;
> > > +	struct list_head		*pwi;
> > > +	struct list_head		*n;
> > > +
> > > +	/*
> > > +	 * Free the pending items.  Caller should already have arranged
> > > +	 * for the intent items to be released.
> > > +	 */
> > > +	list_for_each_entry_safe(dfp, pli, &dop->dop_intake, dfp_list) {
> > > +		list_del(&dfp->dfp_list);
> > > +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> > > +			list_del(pwi);
> > > +			dfp->dfp_count--;
> > > +			dfp->dfp_type->cancel_item(pwi);
> > > +		}
> > > +		ASSERT(dfp->dfp_count == 0);
> > > +		kmem_free(dfp);
> > > +	}
> > > +	list_for_each_entry_safe(dfp, pli, &dop->dop_pending, dfp_list) {
> > > +		list_del(&dfp->dfp_list);
> > > +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> > > +			list_del(pwi);
> > > +			dfp->dfp_count--;
> > > +			dfp->dfp_type->cancel_item(pwi);
> > > +		}
> > > +		ASSERT(dfp->dfp_count == 0);
> > > +		kmem_free(dfp);
> > > +	}
> > > +}
> > > +
> > > +/* Add an item for later deferred processing. */
> > > +void
> > > +xfs_defer_add(
> > > +	struct xfs_defer_ops		*dop,
> > > +	enum xfs_defer_ops_type		type,
> > > +	struct list_head		*li)
> > > +{
> > > +	struct xfs_defer_pending	*dfp = NULL;
> > > +
> > > +	/*
> > > +	 * Add the item to a pending item at the end of the intake list.
> > > +	 * If the last pending item has the same type, reuse it.  Else,
> > > +	 * create a new pending item at the end of the intake list.
> > > +	 */
> > > +	if (!list_empty(&dop->dop_intake)) {
> > > +		dfp = list_last_entry(&dop->dop_intake,
> > > +				struct xfs_defer_pending, dfp_list);
> > > +		if (dfp->dfp_type->type != type ||
> > > +		    (dfp->dfp_type->max_items &&
> > > +		     dfp->dfp_count >= dfp->dfp_type->max_items))
> > > +			dfp = NULL;
> > > +	}
> > > +	if (!dfp) {
> > > +		dfp = kmem_alloc(sizeof(struct xfs_defer_pending),
> > > +				KM_SLEEP | KM_NOFS);
> > > +		dfp->dfp_type = defer_op_types[type];
> > > +		dfp->dfp_committed = false;
> > > +		dfp->dfp_intent = NULL;
> > > +		dfp->dfp_count = 0;
> > > +		INIT_LIST_HEAD(&dfp->dfp_work);
> > > +		list_add_tail(&dfp->dfp_list, &dop->dop_intake);
> > > +	}
> > > +
> > > +	list_add_tail(li, &dfp->dfp_work);
> > > +	dfp->dfp_count++;
> > > +}
> > > +
> > > +/* Initialize a deferred operation list. */
> > > +void
> > > +xfs_defer_init_op_type(
> > > +	const struct xfs_defer_op_type	*type)
> > > +{
> > > +	defer_op_types[type->type] = type;
> > > +}
> > > +
> > > +/* Initialize a deferred operation. */
> > > +void
> > > +xfs_defer_init(
> > > +	struct xfs_defer_ops		*dop,
> > > +	xfs_fsblock_t			*fbp)
> > > +{
> > > +	dop->dop_committed = false;
> > > +	dop->dop_low = false;
> > > +	memset(&dop->dop_inodes, 0, sizeof(dop->dop_inodes));
> > > +	*fbp = NULLFSBLOCK;
> > > +	INIT_LIST_HEAD(&dop->dop_intake);
> > > +	INIT_LIST_HEAD(&dop->dop_pending);
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > new file mode 100644
> > > index 0000000..85c7a3a
> > > --- /dev/null
> > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > @@ -0,0 +1,96 @@
> > > +/*
> > > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > > + *
> > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU General Public License
> > > + * as published by the Free Software Foundation; either version 2
> > > + * of the License, or (at your option) any later version.
> > > + *
> > > + * This program is distributed in the hope that it would be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License
> > > + * along with this program; if not, write the Free Software Foundation,
> > > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > > + */
> > > +#ifndef __XFS_DEFER_H__
> > > +#define	__XFS_DEFER_H__
> > > +
> > > +struct xfs_defer_op_type;
> > > +
> > > +/*
> > > + * Save a log intent item and a list of extents, so that we can replay
> > > + * whatever action had to happen to the extent list and file the log done
> > > + * item.
> > > + */
> > > +struct xfs_defer_pending {
> > > +	const struct xfs_defer_op_type	*dfp_type;	/* function pointers */
> > > +	struct list_head		dfp_list;	/* pending items */
> > > +	bool				dfp_committed;	/* committed trans? */
> > > +	void				*dfp_intent;	/* log intent item */
> > > +	struct list_head		dfp_work;	/* work items */
> > > +	unsigned int			dfp_count;	/* # extent items */
> > > +};
> > > +
> > > +/*
> > > + * Header for deferred operation list.
> > > + *
> > > + * dop_low is used by the allocator to activate the lowspace algorithm -
> > > + * when free space is running low the extent allocator may choose to
> > > + * allocate an extent from an AG without leaving sufficient space for
> > > + * a btree split when inserting the new extent.  In this case the allocator
> > > + * will enable the lowspace algorithm which is supposed to allow further
> > > + * allocations (such as btree splits and newroots) to allocate from
> > > + * sequential AGs.  In order to avoid locking AGs out of order the lowspace
> > > + * algorithm will start searching for free space from AG 0.  If the correct
> > > + * transaction reservations have been made then this algorithm will eventually
> > > + * find all the space it needs.
> > > + */
> > > +enum xfs_defer_ops_type {
> > > +	XFS_DEFER_OPS_TYPE_MAX,
> > > +};
> > > +
> > > +#define XFS_DEFER_OPS_NR_INODES	2	/* join up to two inodes */
> > > +
> > > +struct xfs_defer_ops {
> > > +	bool			dop_committed;	/* did any trans commit? */
> > > +	bool			dop_low;	/* alloc in low mode */
> > > +	struct list_head	dop_intake;	/* unlogged pending work */
> > > +	struct list_head	dop_pending;	/* logged pending work */
> > > +
> > > +	/* relog these inodes with each roll */
> > > +	struct xfs_inode	*dop_inodes[XFS_DEFER_OPS_NR_INODES];
> > > +};
> > > +
> > > +void xfs_defer_add(struct xfs_defer_ops *dop, enum xfs_defer_ops_type type,
> > > +		struct list_head *h);
> > > +int xfs_defer_finish(struct xfs_trans **tp, struct xfs_defer_ops *dop,
> > > +		struct xfs_inode *ip);
> > > +void xfs_defer_cancel(struct xfs_defer_ops *dop);
> > > +void xfs_defer_init(struct xfs_defer_ops *dop, xfs_fsblock_t *fbp);
> > > +bool xfs_defer_has_unfinished_work(struct xfs_defer_ops *dop);
> > > +int xfs_defer_join(struct xfs_defer_ops *dop, struct xfs_inode *ip);
> > > +
> > > +/* Description of a deferred type. */
> > > +struct xfs_defer_op_type {
> > > +	enum xfs_defer_ops_type	type;
> > > +	unsigned int		max_items;
> > > +	void (*abort_intent)(void *);
> > > +	void *(*create_done)(struct xfs_trans *, void *, unsigned int);
> > > +	int (*finish_item)(struct xfs_trans *, struct xfs_defer_ops *,
> > > +			struct list_head *, void *, void **);
> > > +	void (*finish_cleanup)(struct xfs_trans *, void *, int);
> > > +	void (*cancel_item)(struct list_head *);
> > > +	int (*diff_items)(void *, struct list_head *, struct list_head *);
> > > +	void *(*create_intent)(struct xfs_trans *, uint);
> > > +	void (*log_item)(struct xfs_trans *, void *, struct list_head *);
> > > +};
> > > +
> > > +void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
> > > +void xfs_defer_init_types(void);
> > > +
> > > +#endif /* __XFS_DEFER_H__ */
> > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > new file mode 100644
> > > index 0000000..849088d
> > > --- /dev/null
> > > +++ b/fs/xfs/xfs_defer_item.c
> > > @@ -0,0 +1,36 @@
> > > +/*
> > > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > > + *
> > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU General Public License
> > > + * as published by the Free Software Foundation; either version 2
> > > + * of the License, or (at your option) any later version.
> > > + *
> > > + * This program is distributed in the hope that it would be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License
> > > + * along with this program; if not, write the Free Software Foundation,
> > > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > > + */
> > > +#include "xfs.h"
> > > +#include "xfs_fs.h"
> > > +#include "xfs_shared.h"
> > > +#include "xfs_format.h"
> > > +#include "xfs_log_format.h"
> > > +#include "xfs_trans_resv.h"
> > > +#include "xfs_bit.h"
> > > +#include "xfs_sb.h"
> > > +#include "xfs_mount.h"
> > > +#include "xfs_defer.h"
> > > +#include "xfs_trans.h"
> > > +
> > > +/* Initialize the deferred operation types. */
> > > +void
> > > +xfs_defer_init_types(void)
> > > +{
> > > +}
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 09722a7..bf63f6d 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -46,6 +46,7 @@
> > >  #include "xfs_quota.h"
> > >  #include "xfs_sysfs.h"
> > >  #include "xfs_ondisk.h"
> > > +#include "xfs_defer.h"
> > >  
> > >  #include <linux/namei.h>
> > >  #include <linux/init.h>
> > > @@ -1850,6 +1851,7 @@ init_xfs_fs(void)
> > >  	printk(KERN_INFO XFS_VERSION_STRING " with "
> > >  			 XFS_BUILD_OPTIONS " enabled\n");
> > >  
> > > +	xfs_defer_init_types();
> > >  	xfs_dir_startup();
> > >  
> > >  	error = xfs_init_zones();
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free
  2016-06-27 22:00       ` Darrick J. Wong
@ 2016-06-28 12:32         ` Brian Foster
  2016-06-28 16:33           ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-06-28 12:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 27, 2016 at 03:00:03PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2016 at 02:41:45PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 27, 2016 at 09:15:08AM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:19:47PM -0700, Darrick J. Wong wrote:
> > > > Connect the xfs_defer mechanism with the pieces that we'll need to
> > > > handle deferred extent freeing.  We'll wire up the existing code to
> > > > our new deferred mechanism later.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Could we merge this with the xfs_trans_*efi/efd* bits? We'd need to
> > > preserve some calls for recovery, but it looks like other parts are only
> > > used by the deferred ops infrastructure at this point.
> > 
> > Yes, we could replace xfs_bmap_free_create_{intent,done} with
> > xfs_trans_get_ef[id] and lose the silly functions.  I'll go take
> > care of all of them.
> 
> Hah, gcc complains about the mismatch in pointer types for the second
> argument.
> 
> fs/xfs/xfs_defer_item.c:504:17: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
>   .create_done = xfs_trans_get_bud,
>                  ^
> fs/xfs/xfs_defer_item.c:504:17: note: (near initialization for ‘xfs_bmap_update_defer_type.create_done’)
> 
> I guess one could put in an ugly cast to coerce the types at a cost of
> uglifying the code.  <shrug> Opinions?
> 

Not sure what you mean here... I should be more clear. What I was
thinking is to nuke xfs_defer_item.c, define the 'const struct
xfs_defer_op_type xfs_extent_free_defer_type' right in
xfs_trans_extfree.c (perhaps rename the file) and wire up the functions
appropriately. Just change the function signatures if you need to. For
something like xfs_trans_get_efd(), maybe refactor the guts into a
static helper that both the defer callback and xfs_trans_get_efd() can
use, since we need the latter for log recovery. Will something like that
work?

To put it simply, it looks like we have at least a few places where
defer item has a callback that calls yet another interface, but is the
only caller of the latter (e.g.,
xfs_bmap_free_create_intent()->xfs_trans_get_efi()). 

Brian

> --D
> 
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  fs/xfs/libxfs/xfs_defer.h |    1 
> > > >  fs/xfs/xfs_defer_item.c   |  108 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 109 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > index 85c7a3a..743fc32 100644
> > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > @@ -51,6 +51,7 @@ struct xfs_defer_pending {
> > > >   * find all the space it needs.
> > > >   */
> > > >  enum xfs_defer_ops_type {
> > > > +	XFS_DEFER_OPS_TYPE_FREE,
> > > >  	XFS_DEFER_OPS_TYPE_MAX,
> > > >  };
> > > >  
> > > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > > index 4c2ba28..127a54e 100644
> > > > --- a/fs/xfs/xfs_defer_item.c
> > > > +++ b/fs/xfs/xfs_defer_item.c
> > > > @@ -29,9 +29,117 @@
> > > >  #include "xfs_defer.h"
> > > >  #include "xfs_trans.h"
> > > >  #include "xfs_trace.h"
> > > > +#include "xfs_bmap.h"
> > > > +#include "xfs_extfree_item.h"
> > > > +
> > > > +/* Extent Freeing */
> > > > +
> > > > +/* Sort bmap items by AG. */
> > > > +static int
> > > > +xfs_bmap_free_diff_items(
> > > > +	void				*priv,
> > > > +	struct list_head		*a,
> > > > +	struct list_head		*b)
> > > > +{
> > > > +	struct xfs_mount		*mp = priv;
> > > > +	struct xfs_bmap_free_item	*ra;
> > > > +	struct xfs_bmap_free_item	*rb;
> > > > +
> > > > +	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
> > > > +	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
> > > > +	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
> > > > +		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
> > > > +}
> > > > +
> > > > +/* Get an EFI. */
> > > > +STATIC void *
> > > > +xfs_bmap_free_create_intent(
> > > > +	struct xfs_trans		*tp,
> > > > +	unsigned int			count)
> > > > +{
> > > > +	return xfs_trans_get_efi(tp, count);
> > > > +}
> > > > +
> > > > +/* Log a free extent to the intent item. */
> > > > +STATIC void
> > > > +xfs_bmap_free_log_item(
> > > > +	struct xfs_trans		*tp,
> > > > +	void				*intent,
> > > > +	struct list_head		*item)
> > > > +{
> > > > +	struct xfs_bmap_free_item	*free;
> > > > +
> > > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > > +	xfs_trans_log_efi_extent(tp, intent, free->xbfi_startblock,
> > > > +			free->xbfi_blockcount);
> > > > +}
> > > > +
> > > > +/* Get an EFD so we can process all the free extents. */
> > > > +STATIC void *
> > > > +xfs_bmap_free_create_done(
> > > > +	struct xfs_trans		*tp,
> > > > +	void				*intent,
> > > > +	unsigned int			count)
> > > > +{
> > > > +	return xfs_trans_get_efd(tp, intent, count);
> > > > +}
> > > > +
> > > > +/* Process a free extent. */
> > > > +STATIC int
> > > > +xfs_bmap_free_finish_item(
> > > > +	struct xfs_trans		*tp,
> > > > +	struct xfs_defer_ops		*dop,
> > > > +	struct list_head		*item,
> > > > +	void				*done_item,
> > > > +	void				**state)
> > > > +{
> > > > +	struct xfs_bmap_free_item	*free;
> > > > +	int				error;
> > > > +
> > > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > > +	error = xfs_trans_free_extent(tp, done_item,
> > > > +			free->xbfi_startblock,
> > > > +			free->xbfi_blockcount);
> > > > +	kmem_free(free);
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/* Abort all pending EFIs. */
> > > > +STATIC void
> > > > +xfs_bmap_free_abort_intent(
> > > > +	void				*intent)
> > > > +{
> > > > +	xfs_efi_release(intent);
> > > > +}
> > > > +
> > > > +/* Cancel a free extent. */
> > > > +STATIC void
> > > > +xfs_bmap_free_cancel_item(
> > > > +	struct list_head		*item)
> > > > +{
> > > > +	struct xfs_bmap_free_item	*free;
> > > > +
> > > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > > +	kmem_free(free);
> > > > +}
> > > > +
> > > > +const struct xfs_defer_op_type xfs_extent_free_defer_type = {
> > > > +	.type		= XFS_DEFER_OPS_TYPE_FREE,
> > > > +	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
> > > > +	.diff_items	= xfs_bmap_free_diff_items,
> > > > +	.create_intent	= xfs_bmap_free_create_intent,
> > > > +	.abort_intent	= xfs_bmap_free_abort_intent,
> > > > +	.log_item	= xfs_bmap_free_log_item,
> > > > +	.create_done	= xfs_bmap_free_create_done,
> > > > +	.finish_item	= xfs_bmap_free_finish_item,
> > > > +	.cancel_item	= xfs_bmap_free_cancel_item,
> > > > +};
> > > > +
> > > > +/* Deferred Item Initialization */
> > > >  
> > > >  /* Initialize the deferred operation types. */
> > > >  void
> > > >  xfs_defer_init_types(void)
> > > >  {
> > > > +	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
> > > >  }
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 014/119] xfs: introduce interval queries on btrees
  2016-06-28 12:32       ` Brian Foster
@ 2016-06-28 16:29         ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-28 16:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Tue, Jun 28, 2016 at 08:32:19AM -0400, Brian Foster wrote:
> On Mon, Jun 27, 2016 at 02:07:46PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 22, 2016 at 11:18:00AM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:19:21PM -0700, Darrick J. Wong wrote:
> > > > Create a function to enable querying of btree records mapping to a
> > > > range of keys.  This will be used in subsequent patches to allow
> > > > querying the reverse mapping btree to find the extents mapped to a
> > > > range of physical blocks, though the generic code can be used for
> > > > any range query.
> > > > 
> > > > v2: add some shortcuts so that we can jump out of processing once
> > > > we know there won't be any more records to find.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_btree.c |  249 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_btree.h |   22 +++-
> > > >  fs/xfs/xfs_trace.h        |    1 
> > > >  3 files changed, 267 insertions(+), 5 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > > > index afcafd6..5f5cf23 100644
> > > > --- a/fs/xfs/libxfs/xfs_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_btree.c
> > > > @@ -4509,3 +4509,252 @@ xfs_btree_calc_size(
> > > >  	}
> > > >  	return rval;
> > > >  }
> > > > +
> > > > +/* Query a regular btree for all records overlapping a given interval. */
> > > 
> > > Can you elaborate on the search algorithm used? (More for reference
> > > against the overlapped query, as that one is more complex).
> > 
> > Ok.  Both query_range functions aim to return all records intersecting the
> > given range.
> > 
> > For non-overlapped btrees, we start with a LE lookup of the low key and
> > return each record we find a the record with a key greater than the
> > high key.
> > 
> > For overlapped btrees, we follow the procedure in the "Interval trees"
> > section of _Introduction to Algorithms_, which is 14.3 in the 2nd and
> > 3rd editions.  The query algorithm is roughly as follows:
> > 
> > For any leaf btree node, generate the low and high keys for the record.
> > If there's a range overlap with the query's low and high keys, pass the
> > record to the iterator function.
> > 
> > For any internal btree node, compare the low and high keys for each pointer
> > against the query's low and high keys.  If there's an overlap, follow the
> > pointer downwards in the tree.
> > 
> > (I could render the figures in the book as ASCII art if anyone wants.)
> > 
> 
> Thanks. I meant more to update the comments above each function. :) No
> need to go as far as ASCII art I don't think (the external reference
> might be good though). I was really just looking for something that says
> "this function is supposed to do <whatever>" so somebody reading through
> it has a starting point of reference.

Ok, I pasted a (somewhat reworded) version of the above in the comments.

> > > 
> > > > +STATIC int
> > > > +xfs_btree_simple_query_range(
> > > > +	struct xfs_btree_cur		*cur,
> > > > +	union xfs_btree_irec		*low_rec,
> > > > +	union xfs_btree_irec		*high_rec,
> > > > +	xfs_btree_query_range_fn	fn,
> > > > +	void				*priv)
> > > > +{
> > > > +	union xfs_btree_rec		*recp;
> > > > +	union xfs_btree_rec		rec;
> > > > +	union xfs_btree_key		low_key;
> > > > +	union xfs_btree_key		high_key;
> > > > +	union xfs_btree_key		rec_key;
> > > > +	__int64_t			diff;
> > > > +	int				stat;
> > > > +	bool				firstrec = true;
> > > > +	int				error;
> > > > +
> > > > +	ASSERT(cur->bc_ops->init_high_key_from_rec);
> > > > +
> > > > +	/* Find the keys of both ends of the interval. */
> > > > +	cur->bc_rec = *high_rec;
> > > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > > +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> > > > +
> > > > +	cur->bc_rec = *low_rec;
> > > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > > +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> > > > +
> > > > +	/* Find the leftmost record. */
> > > > +	stat = 0;
> > > > +	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, &stat);
> > > > +	if (error)
> > > > +		goto out;
> > > > +
> > > > +	while (stat) {
> > > > +		/* Find the record. */
> > > > +		error = xfs_btree_get_rec(cur, &recp, &stat);
> > > > +		if (error || !stat)
> > > > +			break;
> > > > +
> > > > +		/* Can we tell if this record is too low? */
> > > > +		if (firstrec) {
> > > > +			cur->bc_rec = *low_rec;
> > > > +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> > > > +			diff = cur->bc_ops->key_diff(cur, &rec_key);
> > > > +			if (diff < 0)
> > > > +				goto advloop;
> > > > +		}
> > > > +		firstrec = false;
> > > 
> > > This could move up into the if block.
> > 
> > Ok.
> > 
> > > > +
> > > > +		/* Have we gone past the end? */
> > > > +		cur->bc_rec = *high_rec;
> > > > +		cur->bc_ops->init_key_from_rec(&rec_key, recp);
> > > 
> > > I'd move this up to immediately after the xfs_btree_get_rec() call and
> > > eliminate the duplicate in the 'if (firstrec)' block above.
> > 
> > Ok.  That key ought to be named rec_hkey too.
> > 
> > > > +		diff = cur->bc_ops->key_diff(cur, &rec_key);
> > > > +		if (diff > 0)
> > > > +			break;
> > > > +
> > > > +		/* Callback */
> > > > +		error = fn(cur, recp, priv);
> > > > +		if (error < 0 || error == XFS_BTREE_QUERY_RANGE_ABORT)
> > > > +			break;
> > > > +
> > > > +advloop:
> > > > +		/* Move on to the next record. */
> > > > +		error = xfs_btree_increment(cur, 0, &stat);
> > > > +		if (error)
> > > > +			break;
> > > > +	}
> > > > +
> > > > +out:
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Query an overlapped interval btree for all records overlapping a given
> > > > + * interval.
> > > > + */
> > > 
> > > Same comment here, can you elaborate on the search algorithm? Also, I
> > > think an example or generic description of the rules around what records
> > > this query returns (e.g., low_rec/high_rec vs. record low/high keys)
> > > would be useful, particularly since I, at least, don't have much context
> > > on the rmap+reflink scenarios quite yet.
> > 
> > Let's say you have a bunch of (overlapped) rmap records:
> > 
> > 1: +- file A startblock B offset C length D -----------+
> > 2:      +- file E startblock F offset G length H --------------+
> > 3:      +- file I startblock F offset J length K --+
> > 4:                                                        +- file L... --+
> > 
> > Now say we want to map block (B+D) into file A at offset (C+D).  Ideally, we'd
> > simply increment the length of record 1.  But how do we find that record that
> > ends at (B+D-1)?  A LE lookup of (B+D-1) would return record 3 because the
> > keys are ordered first by startblock.  An interval query would return records
> > 1 and 2 because they both overlap (B+D-1), and from that we can pick out
> > record 1 as the appropriate left neighbor.
> > 
> 
> Great, thanks.. can you include this content in the comment above the
> function as well?

Added this to the comments as well, since it documents the only justification
for any of this overlapped interval btree stuff. :)

--D

> 
> Brian
> 
> > In the non-overlapped case you can do a LE lookup and decrement the cursor
> > because a record's interval must end before the next record.
> > 
> > > > +STATIC int
> > > > +xfs_btree_overlapped_query_range(
> > > > +	struct xfs_btree_cur		*cur,
> > > > +	union xfs_btree_irec		*low_rec,
> > > > +	union xfs_btree_irec		*high_rec,
> > > > +	xfs_btree_query_range_fn	fn,
> > > > +	void				*priv)
> > > > +{
> > > > +	union xfs_btree_ptr		ptr;
> > > > +	union xfs_btree_ptr		*pp;
> > > > +	union xfs_btree_key		rec_key;
> > > > +	union xfs_btree_key		low_key;
> > > > +	union xfs_btree_key		high_key;
> > > > +	union xfs_btree_key		*lkp;
> > > > +	union xfs_btree_key		*hkp;
> > > > +	union xfs_btree_rec		rec;
> > > > +	union xfs_btree_rec		*recp;
> > > > +	struct xfs_btree_block		*block;
> > > > +	__int64_t			ldiff;
> > > > +	__int64_t			hdiff;
> > > > +	int				level;
> > > > +	struct xfs_buf			*bp;
> > > > +	int				i;
> > > > +	int				error;
> > > > +
> > > > +	/* Find the keys of both ends of the interval. */
> > > > +	cur->bc_rec = *high_rec;
> > > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > > +	cur->bc_ops->init_key_from_rec(&high_key, &rec);
> > > > +
> > > > +	cur->bc_rec = *low_rec;
> > > > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > > > +	cur->bc_ops->init_key_from_rec(&low_key, &rec);
> > > > +
> > > > +	/* Load the root of the btree. */
> > > > +	level = cur->bc_nlevels - 1;
> > > > +	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > > > +	error = xfs_btree_lookup_get_block(cur, level, &ptr, &block);
> > > > +	if (error)
> > > > +		return error;
> > > > +	xfs_btree_get_block(cur, level, &bp);
> > > > +	trace_xfs_btree_overlapped_query_range(cur, level, bp);
> > > > +#ifdef DEBUG
> > > > +	error = xfs_btree_check_block(cur, block, level, bp);
> > > > +	if (error)
> > > > +		goto out;
> > > > +#endif
> > > > +	cur->bc_ptrs[level] = 1;
> > > > +
> > > > +	while (level < cur->bc_nlevels) {
> > > > +		block = XFS_BUF_TO_BLOCK(cur->bc_bufs[level]);
> > > > +
> > > > +		if (level == 0) {
> > > > +			/* End of leaf, pop back towards the root. */
> > > > +			if (cur->bc_ptrs[level] >
> > > > +			    be16_to_cpu(block->bb_numrecs)) {
> > > > +leaf_pop_up:
> > > > +				if (level < cur->bc_nlevels - 1)
> > > > +					cur->bc_ptrs[level + 1]++;
> > > > +				level++;
> > > > +				continue;
> > > > +			}
> > > > +
> > > > +			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> > > > +
> > > > +			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
> > > > +			ldiff = cur->bc_ops->diff_two_keys(cur, &low_key,
> > > > +					&rec_key);
> > > > +
> > > > +			cur->bc_ops->init_key_from_rec(&rec_key, recp);
> > > > +			hdiff = cur->bc_ops->diff_two_keys(cur, &rec_key,
> > > > +					&high_key);
> > > > +
> > > 
> > > This looked a little funny to me because I expected diff_two_keys() to
> > > basically be param1 - param2. Looking ahead at the rmapbt code, it is in
> > > fact the other way around. I'm not sure we have precedent for either
> > > way, tbh. I still have to stare at this some more, but I wonder if a
> > > "does record overlap" helper (with comments) would help clean this up a
> > > bit.
> > 
> > You're correct this is exactly the opposite of the compare functions in
> > the C library and the rest of the kernel.  I'll fix that up.
> > 
> > > > +			/* If the record matches, callback */
> > > > +			if (ldiff >= 0 && hdiff >= 0) {
> > 
> > Ok, I'll make it a little clearer what we're testing here:
> > 
> > /*
> >  * If (record's high key >= query's low key) and
> >  *    (query's high key >= record's low key), then
> >  * this record overlaps the query range, so callback.
> >  */
> > 
> > 
> > > > +				error = fn(cur, recp, priv);
> > > > +				if (error < 0 ||
> > > > +				    error == XFS_BTREE_QUERY_RANGE_ABORT)
> > > > +					break;
> > > > +			} else if (hdiff < 0) {
> > > > +				/* Record is larger than high key; pop. */
> > > > +				goto leaf_pop_up;
> > > > +			}
> > > > +			cur->bc_ptrs[level]++;
> > > > +			continue;
> > > > +		}
> > > > +
> > > > +		/* End of node, pop back towards the root. */
> > > > +		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> > > > +node_pop_up:
> > > > +			if (level < cur->bc_nlevels - 1)
> > > > +				cur->bc_ptrs[level + 1]++;
> > > > +			level++;
> > > > +			continue;
> > > 
> > > Looks like same code as leaf_pop_up. I wonder if we can bury this at the
> > > end of the loop with a common label.
> > 
> > Yep.
> > 
> > > > +		}
> > > > +
> > > > +		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
> > > > +		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
> > > > +		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> > > > +
> > > > +		ldiff = cur->bc_ops->diff_two_keys(cur, &low_key, hkp);
> > > > +		hdiff = cur->bc_ops->diff_two_keys(cur, lkp, &high_key);
> > > > +
> > > > +		/* If the key matches, drill another level deeper. */
> > > > +		if (ldiff >= 0 && hdiff >= 0) {
> > > > +			level--;
> > > > +			error = xfs_btree_lookup_get_block(cur, level, pp,
> > > > +					&block);
> > > > +			if (error)
> > > > +				goto out;
> > > > +			xfs_btree_get_block(cur, level, &bp);
> > > > +			trace_xfs_btree_overlapped_query_range(cur, level, bp);
> > > > +#ifdef DEBUG
> > > > +			error = xfs_btree_check_block(cur, block, level, bp);
> > > > +			if (error)
> > > > +				goto out;
> > > > +#endif
> > > > +			cur->bc_ptrs[level] = 1;
> > > > +			continue;
> > > > +		} else if (hdiff < 0) {
> > > > +			/* The low key is larger than the upper range; pop. */
> > > > +			goto node_pop_up;
> > > > +		}
> > > > +		cur->bc_ptrs[level]++;
> > > > +	}
> > > > +
> > > > +out:
> > > > +	/*
> > > > +	 * If we don't end this function with the cursor pointing at a record
> > > > +	 * block, a subsequent non-error cursor deletion will not release
> > > > +	 * node-level buffers, causing a buffer leak.  This is quite possible
> > > > +	 * with a zero-results range query, so release the buffers if we
> > > > +	 * failed to return any results.
> > > > +	 */
> > > > +	if (cur->bc_bufs[0] == NULL) {
> > > > +		for (i = 0; i < cur->bc_nlevels; i++) {
> > > > +			if (cur->bc_bufs[i]) {
> > > > +				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> > > > +				cur->bc_bufs[i] = NULL;
> > > > +				cur->bc_ptrs[i] = 0;
> > > > +				cur->bc_ra[i] = 0;
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Query a btree for all records overlapping a given interval of keys.  The
> > > > + * supplied function will be called with each record found; return one of the
> > > > + * XFS_BTREE_QUERY_RANGE_{CONTINUE,ABORT} values or the usual negative error
> > > > + * code.  This function returns XFS_BTREE_QUERY_RANGE_ABORT, zero, or a
> > > > + * negative error code.
> > > > + */
> > > > +int
> > > > +xfs_btree_query_range(
> > > > +	struct xfs_btree_cur		*cur,
> > > > +	union xfs_btree_irec		*low_rec,
> > > > +	union xfs_btree_irec		*high_rec,
> > > > +	xfs_btree_query_range_fn	fn,
> > > > +	void				*priv)
> > > > +{
> > > > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > > > +		return xfs_btree_simple_query_range(cur, low_rec,
> > > > +				high_rec, fn, priv);
> > > > +	return xfs_btree_overlapped_query_range(cur, low_rec, high_rec,
> > > > +			fn, priv);
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > > > index a5ec6c7..898fee5 100644
> > > > --- a/fs/xfs/libxfs/xfs_btree.h
> > > > +++ b/fs/xfs/libxfs/xfs_btree.h
> > > > @@ -206,6 +206,12 @@ struct xfs_btree_ops {
> > > >  #define LASTREC_DELREC	2
> > > >  
> > > >  
> > > > +union xfs_btree_irec {
> > > > +	xfs_alloc_rec_incore_t		a;
> > > > +	xfs_bmbt_irec_t			b;
> > > > +	xfs_inobt_rec_incore_t		i;
> > > > +};
> > > > +
> > > 
> > > We might as well kill off the typedef usage here.
> > 
> > Ok.  Thx for the review!
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  /*
> > > >   * Btree cursor structure.
> > > >   * This collects all information needed by the btree code in one place.
> > > > @@ -216,11 +222,7 @@ typedef struct xfs_btree_cur
> > > >  	struct xfs_mount	*bc_mp;	/* file system mount struct */
> > > >  	const struct xfs_btree_ops *bc_ops;
> > > >  	uint			bc_flags; /* btree features - below */
> > > > -	union {
> > > > -		xfs_alloc_rec_incore_t	a;
> > > > -		xfs_bmbt_irec_t		b;
> > > > -		xfs_inobt_rec_incore_t	i;
> > > > -	}		bc_rec;		/* current insert/search record value */
> > > > +	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
> > > >  	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
> > > >  	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
> > > >  	__uint8_t	bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
> > > > @@ -494,4 +496,14 @@ xfs_extlen_t xfs_btree_calc_size(struct xfs_mount *mp, uint *limits,
> > > >  uint xfs_btree_compute_maxlevels(struct xfs_mount *mp, uint *limits,
> > > >  		unsigned long len);
> > > >  
> > > > +/* return codes */
> > > > +#define XFS_BTREE_QUERY_RANGE_CONTINUE	0	/* keep iterating */
> > > > +#define XFS_BTREE_QUERY_RANGE_ABORT	1	/* stop iterating */
> > > > +typedef int (*xfs_btree_query_range_fn)(struct xfs_btree_cur *cur,
> > > > +		union xfs_btree_rec *rec, void *priv);
> > > > +
> > > > +int xfs_btree_query_range(struct xfs_btree_cur *cur,
> > > > +		union xfs_btree_irec *low_rec, union xfs_btree_irec *high_rec,
> > > > +		xfs_btree_query_range_fn fn, void *priv);
> > > > +
> > > >  #endif	/* __XFS_BTREE_H__ */
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index ffea28c..f0ac9c9 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -2218,6 +2218,7 @@ DEFINE_EVENT(xfs_btree_cur_class, name, \
> > > >  	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> > > >  	TP_ARGS(cur, level, bp))
> > > >  DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> > > > +DEFINE_BTREE_CUR_EVENT(xfs_btree_overlapped_query_range);
> > > >  
> > > >  #endif /* _TRACE_XFS_H */
> > > >  
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free
  2016-06-28 12:32         ` Brian Foster
@ 2016-06-28 16:33           ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-28 16:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Tue, Jun 28, 2016 at 08:32:39AM -0400, Brian Foster wrote:
> On Mon, Jun 27, 2016 at 03:00:03PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 27, 2016 at 02:41:45PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jun 27, 2016 at 09:15:08AM -0400, Brian Foster wrote:
> > > > On Thu, Jun 16, 2016 at 06:19:47PM -0700, Darrick J. Wong wrote:
> > > > > Connect the xfs_defer mechanism with the pieces that we'll need to
> > > > > handle deferred extent freeing.  We'll wire up the existing code to
> > > > > our new deferred mechanism later.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > 
> > > > Could we merge this with the xfs_trans_*efi/efd* bits? We'd need to
> > > > preserve some calls for recovery, but it looks like other parts are only
> > > > used by the deferred ops infrastructure at this point.
> > > 
> > > Yes, we could replace xfs_bmap_free_create_{intent,done} with
> > > xfs_trans_get_ef[id] and lose the silly functions.  I'll go take
> > > care of all of them.
> > 
> > Hah, gcc complains about the mismatch in pointer types for the second
> > argument.
> > 
> > fs/xfs/xfs_defer_item.c:504:17: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
> >   .create_done = xfs_trans_get_bud,
> >                  ^
> > fs/xfs/xfs_defer_item.c:504:17: note: (near initialization for ‘xfs_bmap_update_defer_type.create_done’)
> > 
> > I guess one could put in an ugly cast to coerce the types at a cost of
> > uglifying the code.  <shrug> Opinions?
> > 
> 
> Not sure what you mean here... I should be more clear. What I was
> thinking is to nuke xfs_defer_item.c, define the 'const struct
> xfs_defer_op_type xfs_extent_free_defer_type' right in
> xfs_trans_extfree.c (perhaps rename the file) and wire up the functions
> appropriately. Just change the function signatures if you need to. For
> something like xfs_trans_get_efd(), maybe refactor the guts into a
> static helper that both the defer callback and xfs_trans_get_efd() can
> use, since we need the latter for log recovery. Will something like that
> work?
> 
> To put it simply, it looks like we have at least a few places where
> defer item has a callback that calls yet another interface, but is the
> only caller of the latter (e.g.,
> xfs_bmap_free_create_intent()->xfs_trans_get_efi()). 

*Oh*.  Yes, certainly all the stuff in xfs_defer_item.c can be broken up by
type and moved into xfs_trans_*.c.

The libxfs version will retain libxfs/defer_item.c since the deferred op types
there simply call the appropriate libxfs functions without any logging.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > >  fs/xfs/libxfs/xfs_defer.h |    1 
> > > > >  fs/xfs/xfs_defer_item.c   |  108 +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  2 files changed, 109 insertions(+)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > > index 85c7a3a..743fc32 100644
> > > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > > @@ -51,6 +51,7 @@ struct xfs_defer_pending {
> > > > >   * find all the space it needs.
> > > > >   */
> > > > >  enum xfs_defer_ops_type {
> > > > > +	XFS_DEFER_OPS_TYPE_FREE,
> > > > >  	XFS_DEFER_OPS_TYPE_MAX,
> > > > >  };
> > > > >  
> > > > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > > > index 4c2ba28..127a54e 100644
> > > > > --- a/fs/xfs/xfs_defer_item.c
> > > > > +++ b/fs/xfs/xfs_defer_item.c
> > > > > @@ -29,9 +29,117 @@
> > > > >  #include "xfs_defer.h"
> > > > >  #include "xfs_trans.h"
> > > > >  #include "xfs_trace.h"
> > > > > +#include "xfs_bmap.h"
> > > > > +#include "xfs_extfree_item.h"
> > > > > +
> > > > > +/* Extent Freeing */
> > > > > +
> > > > > +/* Sort bmap items by AG. */
> > > > > +static int
> > > > > +xfs_bmap_free_diff_items(
> > > > > +	void				*priv,
> > > > > +	struct list_head		*a,
> > > > > +	struct list_head		*b)
> > > > > +{
> > > > > +	struct xfs_mount		*mp = priv;
> > > > > +	struct xfs_bmap_free_item	*ra;
> > > > > +	struct xfs_bmap_free_item	*rb;
> > > > > +
> > > > > +	ra = container_of(a, struct xfs_bmap_free_item, xbfi_list);
> > > > > +	rb = container_of(b, struct xfs_bmap_free_item, xbfi_list);
> > > > > +	return  XFS_FSB_TO_AGNO(mp, ra->xbfi_startblock) -
> > > > > +		XFS_FSB_TO_AGNO(mp, rb->xbfi_startblock);
> > > > > +}
> > > > > +
> > > > > +/* Get an EFI. */
> > > > > +STATIC void *
> > > > > +xfs_bmap_free_create_intent(
> > > > > +	struct xfs_trans		*tp,
> > > > > +	unsigned int			count)
> > > > > +{
> > > > > +	return xfs_trans_get_efi(tp, count);
> > > > > +}
> > > > > +
> > > > > +/* Log a free extent to the intent item. */
> > > > > +STATIC void
> > > > > +xfs_bmap_free_log_item(
> > > > > +	struct xfs_trans		*tp,
> > > > > +	void				*intent,
> > > > > +	struct list_head		*item)
> > > > > +{
> > > > > +	struct xfs_bmap_free_item	*free;
> > > > > +
> > > > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > > > +	xfs_trans_log_efi_extent(tp, intent, free->xbfi_startblock,
> > > > > +			free->xbfi_blockcount);
> > > > > +}
> > > > > +
> > > > > +/* Get an EFD so we can process all the free extents. */
> > > > > +STATIC void *
> > > > > +xfs_bmap_free_create_done(
> > > > > +	struct xfs_trans		*tp,
> > > > > +	void				*intent,
> > > > > +	unsigned int			count)
> > > > > +{
> > > > > +	return xfs_trans_get_efd(tp, intent, count);
> > > > > +}
> > > > > +
> > > > > +/* Process a free extent. */
> > > > > +STATIC int
> > > > > +xfs_bmap_free_finish_item(
> > > > > +	struct xfs_trans		*tp,
> > > > > +	struct xfs_defer_ops		*dop,
> > > > > +	struct list_head		*item,
> > > > > +	void				*done_item,
> > > > > +	void				**state)
> > > > > +{
> > > > > +	struct xfs_bmap_free_item	*free;
> > > > > +	int				error;
> > > > > +
> > > > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > > > +	error = xfs_trans_free_extent(tp, done_item,
> > > > > +			free->xbfi_startblock,
> > > > > +			free->xbfi_blockcount);
> > > > > +	kmem_free(free);
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > > +/* Abort all pending EFIs. */
> > > > > +STATIC void
> > > > > +xfs_bmap_free_abort_intent(
> > > > > +	void				*intent)
> > > > > +{
> > > > > +	xfs_efi_release(intent);
> > > > > +}
> > > > > +
> > > > > +/* Cancel a free extent. */
> > > > > +STATIC void
> > > > > +xfs_bmap_free_cancel_item(
> > > > > +	struct list_head		*item)
> > > > > +{
> > > > > +	struct xfs_bmap_free_item	*free;
> > > > > +
> > > > > +	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > > > > +	kmem_free(free);
> > > > > +}
> > > > > +
> > > > > +const struct xfs_defer_op_type xfs_extent_free_defer_type = {
> > > > > +	.type		= XFS_DEFER_OPS_TYPE_FREE,
> > > > > +	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
> > > > > +	.diff_items	= xfs_bmap_free_diff_items,
> > > > > +	.create_intent	= xfs_bmap_free_create_intent,
> > > > > +	.abort_intent	= xfs_bmap_free_abort_intent,
> > > > > +	.log_item	= xfs_bmap_free_log_item,
> > > > > +	.create_done	= xfs_bmap_free_create_done,
> > > > > +	.finish_item	= xfs_bmap_free_finish_item,
> > > > > +	.cancel_item	= xfs_bmap_free_cancel_item,
> > > > > +};
> > > > > +
> > > > > +/* Deferred Item Initialization */
> > > > >  
> > > > >  /* Initialize the deferred operation types. */
> > > > >  void
> > > > >  xfs_defer_init_types(void)
> > > > >  {
> > > > > +	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
> > > > >  }
> > > > > 
> > > > > _______________________________________________
> > > > > xfs mailing list
> > > > > xfs@oss.sgi.com
> > > > > http://oss.sgi.com/mailman/listinfo/xfs
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
  2016-06-28 12:32       ` Brian Foster
@ 2016-06-28 17:36         ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-28 17:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Tue, Jun 28, 2016 at 08:32:04AM -0400, Brian Foster wrote:
> On Mon, Jun 27, 2016 at 08:26:21PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 22, 2016 at 11:17:06AM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> > > > On a filesystem with both reflink and reverse mapping enabled, it's
> > > > possible to have multiple rmap records referring to the same blocks on
> > > > disk.  When overlapping intervals are possible, querying a classic
> > > > btree to find all records intersecting a given interval is inefficient
> > > > because we cannot use the left side of the search interval to filter
> > > > out non-matching records the same way that we can use the existing
> > > > btree key to filter out records coming after the right side of the
> > > > search interval.  This will become important once we want to use the
> > > > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
> > > > 
> > > > (For the non-overlapping case, we can perform such queries trivially
> > > > by starting at the left side of the interval and walking the tree
> > > > until we pass the right side.)
> > > > 
> > > > Therefore, extend the btree code to come closer to supporting
> > > > intervals as a first-class record attribute.  This involves widening
> > > > the btree node's key space to store both the lowest key reachable via
> > > > the node pointer (as the btree does now) and the highest key reachable
> > > > via the same pointer and teaching the btree modifying functions to
> > > > keep the highest-key records up to date.
> > > > 
> > > > This behavior can be turned on via a new btree ops flag so that btrees
> > > > that cannot store overlapping intervals don't pay the overhead costs
> > > > in terms of extra code and disk format changes.
> > > > 
> > > > v2: When we're deleting a record in a btree that supports overlapped
> > > > interval records and the deletion results in two btree blocks being
> > > > joined, we defer updating the high/low keys until after all possible
> > > > joining (at higher levels in the tree) have finished.  At this point,
> > > > the btree pointers at all levels have been updated to remove the empty
> > > > blocks and we can update the low and high keys.
> > > > 
> > > > When we're doing this, we must be careful to update the keys of all
> > > > node pointers up to the root instead of stopping at the first set of
> > > > keys that don't need updating.  This is because it's possible for a
> > > > single deletion to cause joining of multiple levels of tree, and so
> > > > we need to update everything going back to the root.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > I think I get the gist of this and it mostly looks Ok to me. A few
> > > questions and minor comments...
> > 
> > Ok.
> > 
> > > >  fs/xfs/libxfs/xfs_btree.c |  379 +++++++++++++++++++++++++++++++++++++++++----
> > > >  fs/xfs/libxfs/xfs_btree.h |   16 ++
> > > >  fs/xfs/xfs_trace.h        |   36 ++++
> > > >  3 files changed, 395 insertions(+), 36 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > > > index a096539..afcafd6 100644
> > > > --- a/fs/xfs/libxfs/xfs_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_btree.c
> ...
> > > > @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
> > > >  		rkp = &key;
> > > >  	}
> > > >  
> > > > -	/* Update the parent key values of right. */
> > > > +	/* Update the parent key values of left and right. */
> > > > +	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> > > > +	xfs_btree_updkeys(cur, level);
> > > >  	error = xfs_btree_updkey(cur, rkp, level + 1);
> > > >  	if (error)
> > > >  		goto error0;
> > > > @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
> > > >  	if (error)
> > > >  		goto error1;
> > > >  
> > > > +	/* Update left and right parent pointers */
> > > > +	xfs_btree_updkeys(cur, level);
> > > > +	xfs_btree_updkeys(tcur, level);
> > > 
> > > In this case, we grab the last record of the block, increment from there
> > > and update using the cursor. This is much more straightforward, imo.
> > > Could we use this approach in the left shift case as well?
> > 
> > Yes, I think so.  I might have started refactoring btree_sibling_updkeys
> > out of existence and got distracted, since there isn't anything that uses
> > the RIGHTSIB ptr value.
> > 
> 
> Ok, I think that would be much cleaner.

Done.

> > > >  	error = xfs_btree_updkey(tcur, rkp, level + 1);
> > > >  	if (error)
> > > >  		goto error1;
> > > > @@ -2356,7 +2604,7 @@ __xfs_btree_split(
> > > >  	struct xfs_btree_cur	*cur,
> > > >  	int			level,
> > > >  	union xfs_btree_ptr	*ptrp,
> > > > -	union xfs_btree_key	*key,
> > > > +	struct xfs_btree_double_key	*key,
> > > >  	struct xfs_btree_cur	**curp,
> > > >  	int			*stat)		/* success/failure */
> > > >  {
> > > > @@ -2452,9 +2700,6 @@ __xfs_btree_split(
> > > >  
> > > >  		xfs_btree_log_keys(cur, rbp, 1, rrecs);
> > > >  		xfs_btree_log_ptrs(cur, rbp, 1, rrecs);
> > > > -
> > > > -		/* Grab the keys to the entries moved to the right block */
> > > > -		xfs_btree_copy_keys(cur, key, rkp, 1);
> > > >  	} else {
> > > >  		/* It's a leaf.  Move records.  */
> > > >  		union xfs_btree_rec	*lrp;	/* left record pointer */
> > > > @@ -2465,12 +2710,8 @@ __xfs_btree_split(
> > > >  
> > > >  		xfs_btree_copy_recs(cur, rrp, lrp, rrecs);
> > > >  		xfs_btree_log_recs(cur, rbp, 1, rrecs);
> > > > -
> > > > -		cur->bc_ops->init_key_from_rec(key,
> > > > -			xfs_btree_rec_addr(cur, 1, right));
> > > >  	}
> > > >  
> > > > -
> > > >  	/*
> > > >  	 * Find the left block number by looking in the buffer.
> > > >  	 * Adjust numrecs, sibling pointers.
> > > > @@ -2484,6 +2725,12 @@ __xfs_btree_split(
> > > >  	xfs_btree_set_numrecs(left, lrecs);
> > > >  	xfs_btree_set_numrecs(right, xfs_btree_get_numrecs(right) + rrecs);
> > > >  
> > > > +	/* Find the low & high keys for the new block. */
> > > > +	if (level > 0)
> > > > +		xfs_btree_find_node_keys(cur, right, &key->low, &key->high);
> > > > +	else
> > > > +		xfs_btree_find_leaf_keys(cur, right, &key->low, &key->high);
> > > > +
> > > 
> > > Why not push these into the above if/else where the previous key
> > > copy/init calls were removed from?
> > 
> > We don't set bb_numrecs on the right block until the line above the new
> > hunk, and the btree_find_*_keys functions require numrecs to be set.
> > 
> > The removed key copy/init calls only looked at keys[1].
> > 
> > That said, it's trivial to move the set_numrecs calls above the if statement.
> > 
> 
> Ok, thanks. No need to shuffle it around. I'd suggest a one-liner
> comment though so somebody doesn't blindly refactor this down the road.
> It also sounds like the find keys functions could use ASSERT() checks
> for a sane bb_numrecs.

Hmm.  I already moved it, oh well.

It _does_ make the function less messy, so I'll leave it unless anyone yells.

> > > >  	xfs_btree_log_block(cur, rbp, XFS_BB_ALL_BITS);
> > > >  	xfs_btree_log_block(cur, lbp, XFS_BB_NUMRECS | XFS_BB_RIGHTSIB);
> > > >  
> ...
> > > > @@ -3095,8 +3365,24 @@ xfs_btree_insrec(
> > > >  	xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);
> > > >  
> > > >  	/* If we inserted at the start of a block, update the parents' keys. */
> > > 
> > > This comment is associated with the codeblock that has been pushed
> > > further down, no?
> > 
> > Correct.  I think that got mismerged somewhere along the way.
> > 
> > > > +	if (ncur && bp->b_bn != old_bn) {
> > > > +		/*
> > > > +		 * We just inserted into a new tree block, which means that
> > > > +		 * the key for the block is in nkey, not the tree.
> > > > +		 */
> > > > +		if (level == 0)
> > > > +			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> > > > +					&nkey.high);
> > > > +		else
> > > > +			xfs_btree_find_node_keys(cur, block, &nkey.low,
> > > > +					&nkey.high);
> > > > +	} else {
> > > > +		/* Updating the left block, do it the standard way. */
> > > > +		xfs_btree_updkeys(cur, level);
> > > > +	}
> > > > +
> > > 
> > > Not quite sure I follow the purpose of this hunk. Is this for the case
> > > where a btree split occurs, nkey is filled in for the new/right block
> > > and then (after nkey is filled in) the new record ends up being added to
> > > the new block? If so, what about the case where ncur is not created?
> > > (It looks like that's possible from the code, but I could easily be
> > > missing some context as to why that's not the case.)
> > 
> > Yes, the first part of the if-else hunk is to fill out nkey when we've
> > split a btree block.  Now that I look at it again, I think that whole
> > weird conditional could be replaced with the same xfs_btree_ptr_is_null()
> > check later on.  I think it can also be combined with it.

This is incorrect.  The only time we want to perform the nkey recalculation
is in the specific case where we split a btree block and the cursor ends up
pointing into the new block for the insertion.  We cannot gate the nkey
recalc on whether or not &nptr is null, because if we insert into the left
block after a split we'll recalculate nkey (the right block's keys) using
the left block's data, which is incorrect.  We probably want to do the key
recalc ahead of calling ->update_lastrec because the callback could modify
the cursor.

So I'll just leave the code mostly as is, clarify the comments about what
we're doing and why, and change the if statement to:

if (bp && bp->b_bn != old_bn)

Also for some reason I neglected to check the return code from
xfs_btree_updkeys, so I will go fix that.  At the moment it doesn't matter
because updkeys never fails, but I might as well fix it now.

> Ok.
> 
> > Commentage for now:
> > 
> > /*
> >  * If we just inserted a new tree block, we have to find the low
> >  * and high keys for the new block and arrange to pass them back
> >  * separately.  If we're just updating a block we can use the
> >  * regular tree update mechanism.
> >  */
> > 
> 
> Couldn't you just point out that nkey may not be coherent with the new
> block if the new record was inserted therein..?

Yes, that would be less convoluted.  Done. :)

> > > In any event, I think we could elaborate a bit in the comment on why
> > > this is necessary. I'd also move it above the top-level if/else.
> > > 
> > > >  	if (optr == 1) {
> > > > -		error = xfs_btree_updkey(cur, key, level + 1);
> > > > +		error = xfs_btree_updkey(cur, &key->low, level + 1);
> > > >  		if (error)
> > > >  			goto error0;
> > > >  	}
> > > > @@ -3147,7 +3433,7 @@ xfs_btree_insert(
> > > >  	union xfs_btree_ptr	nptr;	/* new block number (split result) */
> > > >  	struct xfs_btree_cur	*ncur;	/* new cursor (split result) */
> > > >  	struct xfs_btree_cur	*pcur;	/* previous level's cursor */
> > > > -	union xfs_btree_key	key;	/* key of block to insert */
> > > > +	struct xfs_btree_double_key	key;	/* key of block to insert */
> > > 
> > > Probably should fix up the function param alignment here and the couple
> > > other or so places we make this change.
> > 
> > I changed the name to xfs_btree_bigkey, which avoids the alignment problems.
> > 
> 
> Sounds good.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  
> > > >  	level = 0;
> > > >  	ncur = NULL;
> > > > @@ -3552,6 +3838,7 @@ xfs_btree_delrec(
> > > >  	 * If we deleted the leftmost entry in the block, update the
> > > >  	 * key values above us in the tree.
> > > >  	 */
> > > > +	xfs_btree_updkeys(cur, level);
> > > >  	if (ptr == 1) {
> > > >  		error = xfs_btree_updkey(cur, keyp, level + 1);
> > > >  		if (error)
> > > > @@ -3882,6 +4169,16 @@ xfs_btree_delrec(
> > > >  	if (level > 0)
> > > >  		cur->bc_ptrs[level]--;
> > > >  
> > > > +	/*
> > > > +	 * We combined blocks, so we have to update the parent keys if the
> > > > +	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
> > > > +	 * points to the old block so that the caller knows which record to
> > > > +	 * delete.  Therefore, the caller must be savvy enough to call updkeys
> > > > +	 * for us if we return stat == 2.  The other exit points from this
> > > > +	 * function don't require deletions further up the tree, so they can
> > > > +	 * call updkeys directly.
> > > > +	 */
> > > > +
> > > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > > >  	/* Return value means the next level up has something to do. */
> > > >  	*stat = 2;
> > > > @@ -3907,6 +4204,7 @@ xfs_btree_delete(
> > > >  	int			error;	/* error return value */
> > > >  	int			level;
> > > >  	int			i;
> > > > +	bool			joined = false;
> > > >  
> > > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > >  
> > > > @@ -3920,8 +4218,17 @@ xfs_btree_delete(
> > > >  		error = xfs_btree_delrec(cur, level, &i);
> > > >  		if (error)
> > > >  			goto error0;
> > > > +		if (i == 2)
> > > > +			joined = true;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * If we combined blocks as part of deleting the record, delrec won't
> > > > +	 * have updated the parent keys so we have to do that here.
> > > > +	 */
> > > > +	if (joined)
> > > > +		xfs_btree_updkeys_force(cur, 0);
> > > > +
> > > >  	if (i == 0) {
> > > >  		for (level = 1; level < cur->bc_nlevels; level++) {
> > > >  			if (cur->bc_ptrs[level] == 0) {
> > > > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > > > index b99c018..a5ec6c7 100644
> > > > --- a/fs/xfs/libxfs/xfs_btree.h
> > > > +++ b/fs/xfs/libxfs/xfs_btree.h
> > > > @@ -126,6 +126,9 @@ struct xfs_btree_ops {
> > > >  	size_t	key_len;
> > > >  	size_t	rec_len;
> > > >  
> > > > +	/* flags */
> > > > +	uint	flags;
> > > > +
> > > >  	/* cursor operations */
> > > >  	struct xfs_btree_cur *(*dup_cursor)(struct xfs_btree_cur *);
> > > >  	void	(*update_cursor)(struct xfs_btree_cur *src,
> > > > @@ -162,11 +165,21 @@ struct xfs_btree_ops {
> > > >  				     union xfs_btree_rec *rec);
> > > >  	void	(*init_ptr_from_cur)(struct xfs_btree_cur *cur,
> > > >  				     union xfs_btree_ptr *ptr);
> > > > +	void	(*init_high_key_from_rec)(union xfs_btree_key *key,
> > > > +					  union xfs_btree_rec *rec);
> > > >  
> > > >  	/* difference between key value and cursor value */
> > > >  	__int64_t (*key_diff)(struct xfs_btree_cur *cur,
> > > >  			      union xfs_btree_key *key);
> > > >  
> > > > +	/*
> > > > +	 * Difference between key2 and key1 -- positive if key2 > key1,
> > > > +	 * negative if key2 < key1, and zero if equal.
> > > > +	 */
> > > > +	__int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
> > > > +				   union xfs_btree_key *key1,
> > > > +				   union xfs_btree_key *key2);
> > > > +
> > > >  	const struct xfs_buf_ops	*buf_ops;
> > > >  
> > > >  #if defined(DEBUG) || defined(XFS_WARN)
> > > > @@ -182,6 +195,9 @@ struct xfs_btree_ops {
> > > >  #endif
> > > >  };
> > > >  
> > > > +/* btree ops flags */
> > > > +#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
> > > > +
> > > >  /*
> > > >   * Reasons for the update_lastrec method to be called.
> > > >   */
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index 68f27f7..ffea28c 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -38,6 +38,7 @@ struct xlog_recover_item;
> > > >  struct xfs_buf_log_format;
> > > >  struct xfs_inode_log_format;
> > > >  struct xfs_bmbt_irec;
> > > > +struct xfs_btree_cur;
> > > >  
> > > >  DECLARE_EVENT_CLASS(xfs_attr_list_class,
> > > >  	TP_PROTO(struct xfs_attr_list_context *ctx),
> > > > @@ -2183,6 +2184,41 @@ DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
> > > >  DEFINE_DISCARD_EVENT(xfs_discard_exclude);
> > > >  DEFINE_DISCARD_EVENT(xfs_discard_busy);
> > > >  
> > > > +/* btree cursor events */
> > > > +DECLARE_EVENT_CLASS(xfs_btree_cur_class,
> > > > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp),
> > > > +	TP_ARGS(cur, level, bp),
> > > > +	TP_STRUCT__entry(
> > > > +		__field(dev_t, dev)
> > > > +		__field(xfs_btnum_t, btnum)
> > > > +		__field(int, level)
> > > > +		__field(int, nlevels)
> > > > +		__field(int, ptr)
> > > > +		__field(xfs_daddr_t, daddr)
> > > > +	),
> > > > +	TP_fast_assign(
> > > > +		__entry->dev = cur->bc_mp->m_super->s_dev;
> > > > +		__entry->btnum = cur->bc_btnum;
> > > > +		__entry->level = level;
> > > > +		__entry->nlevels = cur->bc_nlevels;
> > > > +		__entry->ptr = cur->bc_ptrs[level];
> > > > +		__entry->daddr = bp->b_bn;
> > > > +	),
> > > > +	TP_printk("dev %d:%d btnum %d level %d/%d ptr %d daddr 0x%llx",
> > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > +		  __entry->btnum,
> > > > +		  __entry->level,
> > > > +		  __entry->nlevels,
> > > > +		  __entry->ptr,
> > > > +		  (unsigned long long)__entry->daddr)
> > > > +)
> > > > +
> > > > +#define DEFINE_BTREE_CUR_EVENT(name) \
> > > > +DEFINE_EVENT(xfs_btree_cur_class, name, \
> > > > +	TP_PROTO(struct xfs_btree_cur *cur, int level, struct xfs_buf *bp), \
> > > > +	TP_ARGS(cur, level, bp))
> > > > +DEFINE_BTREE_CUR_EVENT(xfs_btree_updkeys);
> > > > +
> > > >  #endif /* _TRACE_XFS_H */
> > > >  
> > > >  #undef TRACE_INCLUDE_PATH
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 016/119] xfs: move deferred operations into a separate file
  2016-06-28 12:32       ` Brian Foster
@ 2016-06-28 18:51         ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-28 18:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Tue, Jun 28, 2016 at 08:32:32AM -0400, Brian Foster wrote:
> On Mon, Jun 27, 2016 at 12:14:01PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 27, 2016 at 09:14:54AM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:19:34PM -0700, Darrick J. Wong wrote:
> > > > All the code around struct xfs_bmap_free basically implements a
> > > > deferred operation framework through which we can roll transactions
> > > > (to unlock buffers and avoid violating lock order rules) while
> > > > managing all the necessary log redo items.  Previously we only used
> > > > this code to free extents after some sort of mapping operation, but
> > > > with the advent of rmap and reflink, we suddenly need to do more than
> > > > that.
> > > > 
> > > > With that in mind, xfs_bmap_free really becomes a deferred ops control
> > > > structure.  Rename the structure and move the deferred ops into their
> > > > own file to avoid further bloating of the bmap code.
> > > > 
> > > > v2: actually sort the work items by AG to avoid deadlocks
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > So if I'm following this correctly, we 1.) abstract the bmap freeing
> > > infrastructure into a generic mechanism and 2.) enhance it a bit to
> > > provide things like partial intent completion, etc.
> > 
> > [Back from vacation]
> > 
> > Yup.  The partial intent completion code is for use by the refcount adjust
> > function because in the worst case an adjustment of N blocks could require
> > N record updates.
> > 
> 
> Ok, technically those bits could be punted off to the reflink series.
> 
> > > If so and for future
> > > reference, this would probably be easier to review if the abstraction
> > > and enhancement were done separately. It's probably not worth that at
> > > this point, however...
> > 
> > It wouldn't be difficult to separate them; the partial intent completion
> > are the two code blocks below that handle the -EAGAIN case.
> > 
> 
> That's kind of what I figured, since otherwise most of the rest of the
> code maps to the xfs_bmap_*() stuff.
> 
> > (On the other hand it's so little code that I figured I might as well
> > just do the whole file all at once.)
> > 
> 
> It's more a matter of simplifying review when a change is explicitly
> refactoring vs. having to read through and identify where the
> enhancements actually are. It leaves a cleaner git history and tends to
> simplify backporting as well, fwiw.

Point taken, the new functionality could be a separate patch.

Or rather, the two chunks of code and a gigantic comment explaining how it
should be used will become a separate patch.

> That said, I don't mind leaving this one as is at this point.
> 
> > > >  fs/xfs/Makefile           |    2 
> > > >  fs/xfs/libxfs/xfs_defer.c |  471 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_defer.h |   96 +++++++++
> > > >  fs/xfs/xfs_defer_item.c   |   36 +++
> > > >  fs/xfs/xfs_super.c        |    2 
> > > >  5 files changed, 607 insertions(+)
> > > >  create mode 100644 fs/xfs/libxfs/xfs_defer.c
> > > >  create mode 100644 fs/xfs/libxfs/xfs_defer.h
> > > >  create mode 100644 fs/xfs/xfs_defer_item.c
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > > index 3542d94..ad46a2d 100644
> > > > --- a/fs/xfs/Makefile
> > > > +++ b/fs/xfs/Makefile
> > > > @@ -39,6 +39,7 @@ xfs-y				+= $(addprefix libxfs/, \
> > > >  				   xfs_btree.o \
> > > >  				   xfs_da_btree.o \
> > > >  				   xfs_da_format.o \
> > > > +				   xfs_defer.o \
> > > >  				   xfs_dir2.o \
> > > >  				   xfs_dir2_block.o \
> > > >  				   xfs_dir2_data.o \
> > > > @@ -66,6 +67,7 @@ xfs-y				+= xfs_aops.o \
> > > >  				   xfs_attr_list.o \
> > > >  				   xfs_bmap_util.o \
> > > >  				   xfs_buf.o \
> > > > +				   xfs_defer_item.o \
> > > >  				   xfs_dir2_readdir.o \
> > > >  				   xfs_discard.o \
> > > >  				   xfs_error.o \
> > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > > new file mode 100644
> > > > index 0000000..ad14e33e
> > > > --- /dev/null
> > > > +++ b/fs/xfs/libxfs/xfs_defer.c
> ...
> > > > +int
> > > > +xfs_defer_finish(
> > > > +	struct xfs_trans		**tp,
> > > > +	struct xfs_defer_ops		*dop,
> > > > +	struct xfs_inode		*ip)
> > > > +{
> > > > +	struct xfs_defer_pending	*dfp;
> > > > +	struct list_head		*li;
> > > > +	struct list_head		*n;
> > > > +	void				*done_item = NULL;
> > > > +	void				*state;
> > > > +	int				error = 0;
> > > > +	void				(*cleanup_fn)(struct xfs_trans *, void *, int);
> > > > +
> > > > +	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > +
> > > > +	/* Until we run out of pending work to finish... */
> > > > +	while (xfs_defer_has_unfinished_work(dop)) {
> > > > +		/* Log intents for work items sitting in the intake. */
> > > > +		xfs_defer_intake_work(*tp, dop);
> > > > +
> > > > +		/* Roll the transaction. */
> > > > +		error = xfs_defer_trans_roll(tp, dop, ip);
> > > > +		if (error)
> > > > +			goto out;
> > > > +
> > > > +		/* Mark all pending intents as committed. */
> > > > +		list_for_each_entry_reverse(dfp, &dop->dop_pending, dfp_list) {
> > > > +			if (dfp->dfp_committed)
> > > > +				break;
> > > > +			dfp->dfp_committed = true;
> > > > +		}
> > > > +
> > > > +		/* Log an intent-done item for the first pending item. */
> > > > +		dfp = list_first_entry(&dop->dop_pending,
> > > > +				struct xfs_defer_pending, dfp_list);
> > > > +		done_item = dfp->dfp_type->create_done(*tp, dfp->dfp_intent,
> > > > +				dfp->dfp_count);
> > > > +		cleanup_fn = dfp->dfp_type->finish_cleanup;
> > > > +
> > > > +		/* Finish the work items. */
> > > > +		state = NULL;
> > > > +		list_for_each_safe(li, n, &dfp->dfp_work) {
> > > > +			list_del(li);
> > > > +			dfp->dfp_count--;
> > > > +			error = dfp->dfp_type->finish_item(*tp, dop, li,
> > > > +					done_item, &state);
> > > > +			if (error == -EAGAIN) {
> > > > +				/*
> > > > +				 * If the caller needs to try again, put the
> > > > +				 * item back on the pending list and jump out
> > > > +				 * for further processing.
> > > 
> > > A little confused by the terminology here. Perhaps better to say "back
> > > on the work list" rather than "pending list?"
> > 
> > Yes.
> > 
> > > Also, what is the meaning/purpose of -EAGAIN here? This isn't used by
> > > the extent free bits so I'm missing some context.
> > 
> > Generally, the ->finish_item() uses -EAGAIN to signal that it couldn't finish
> > the work item and that it's necessary to log a new redo item and try again.
> > 
> 
> Ah, Ok. So it is explicitly part of the dfops interface/infrastructure.
> I think that is worth documenting above with a comment (i.e., "certain
> callers might require many transactions, use -EAGAIN to indicate ...
> blah blah").
> 
> > Practically, the only user of this mechanism is the refcountbt adjust function.
> > It might be the case that we want to adjust N blocks, but some pathological
> > user has creatively used reflink to create many refcount records.  In that
> > case we could blow out the transaction reservation logging all the updates.
> > 
> > To avoid that, the refcount code tries to guess (conservatively) when it
> > might be getting close and returns a short *adjusted.  See the call sites of
> > xfs_refcount_still_have_space().  Next, xfs_trans_log_finish_refcount_update()
> > will notice the short adjust returned and fixes up the CUD item to have a
> > reduced cud_nextents and to reflect where the operation stopped.  Then,
> > xfs_refcount_update_finish_item() notices the short return, updates the work
> > item list, and returns -EAGAIN.  Finally, xfs_defer_finish() sees the -EAGAIN
> > and requeues the work item so that we resume refcount adjusting after the
> > transaction rolls.
> > 
> 
> Hmm, this makes me think that maybe it is better to split this up into
> two patches for now after all. I'm expecting this is going to be merged
> along with the rmap bits before the refcount stuff and I'm not a huge
> fan of putting in infrastructure code without users, moreso without
> fully understanding how/why said code is going to be used (and I'm not
> really looking to jump ahead into the refcount stuff yet).

<nod>

> > > For example, is there
> > > an issue with carrying a done_item with an unexpected list count?
> > 
> > AFAICT, nothing in log recovery ever checks that the list counts of the
> > intent and done items actually match, let alone the extents logged with
> > them.  It only seems to care if there's an efd such that efd->efd_efi_id ==
> > efi->efi_id, in which case it won't replay the efi.
> > 
> 
> Yeah, I didn't notice any issues with respect to EFI/EFD handling,
> though I didn't look too hard because it doesn't use this -EAGAIN
> mechanism. If it did, I think you might hit the odd ASSERT() check here
> or there (see xfs_efd_item_format()), but that's probably not
> catastrophic. I think it also affects the size of the transaction
> written to the log, fwiw.

Yes, xfs_trans_log_finish_refcount_update fixes the list count in the
CUD to avoid triggering the ASSERT in xfs_cud_item_format().

> I ask more because it's unexpected to have a structure with a list count
> that doesn't match the actual number of items and I don't see it called
> out anywhere. This might be another good reason to punt this part off to
> the reflink series...
> 
> > I don't know if that was a deliberate part of the log design, but the
> > lack of checking helps us here.
> > 
> > > Is it
> > > expected that xfs_defer_finish() will not return until -EAGAIN is
> > > "cleared" (does relogging below and rolling somehow address this)?
> > 
> > Yes, relogging and rolling gives us a fresh transaction with which to
> > continue updating.
> > 
> > > > +				 */
> > > > +				list_add(li, &dfp->dfp_work);
> > > > +				dfp->dfp_count++;
> > > > +				break;
> > > > +			} else if (error) {
> > > > +				/*
> > > > +				 * Clean up after ourselves and jump out.
> > > > +				 * xfs_defer_cancel will take care of freeing
> > > > +				 * all these lists and stuff.
> > > > +				 */
> > > > +				if (cleanup_fn)
> > > > +					cleanup_fn(*tp, state, error);
> > > > +				xfs_defer_trans_abort(*tp, dop, error);
> > > > +				goto out;
> > > > +			}
> > > > +		}
> > > > +		if (error == -EAGAIN) {
> > > > +			/*
> > > > +			 * Log a new intent, relog all the remaining work
> > > > +			 * item to the new intent, attach the new intent to
> > > > +			 * the dfp, and leave the dfp at the head of the list
> > > > +			 * for further processing.
> > > > +			 */
> > > 
> > > Similar to the above, could you elaborate on the mechanics of this with
> > > respect to the log?  E.g., the comment kind of just repeats what the
> > > code does as opposed to explain why it's here. Is the point here to log
> > > a new intent in the same transaction as the done item to ensure that we
> > > (atomically) indicate that certain operations need to be replayed if
> > > this transaction hits the log and then we crash?
> > 
> > Yes.
> > 
> > "This effectively replaces the old intent item with a new one listing only
> > the work items that were not completed when ->finish_item() returned -EAGAIN.
> > After the subsequent transaction roll, we'll resume where we left off with a
> > fresh transaction."
> > 
> 
> I'd point out the relevance of doing so in the same transaction,
> otherwise sounds good.

Ok.

--D

> 
> Brian
> 
> > Thank you for the review!
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > +			dfp->dfp_intent = dfp->dfp_type->create_intent(*tp,
> > > > +					dfp->dfp_count);
> > > > +			list_for_each(li, &dfp->dfp_work)
> > > > +				dfp->dfp_type->log_item(*tp, dfp->dfp_intent,
> > > > +						li);
> > > > +		} else {
> > > > +			/* Done with the dfp, free it. */
> > > > +			list_del(&dfp->dfp_list);
> > > > +			kmem_free(dfp);
> > > > +		}
> > > > +
> > > > +		if (cleanup_fn)
> > > > +			cleanup_fn(*tp, state, error);
> > > > +	}
> > > > +
> > > > +out:
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Free up any items left in the list.
> > > > + */
> > > > +void
> > > > +xfs_defer_cancel(
> > > > +	struct xfs_defer_ops		*dop)
> > > > +{
> > > > +	struct xfs_defer_pending	*dfp;
> > > > +	struct xfs_defer_pending	*pli;
> > > > +	struct list_head		*pwi;
> > > > +	struct list_head		*n;
> > > > +
> > > > +	/*
> > > > +	 * Free the pending items.  Caller should already have arranged
> > > > +	 * for the intent items to be released.
> > > > +	 */
> > > > +	list_for_each_entry_safe(dfp, pli, &dop->dop_intake, dfp_list) {
> > > > +		list_del(&dfp->dfp_list);
> > > > +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> > > > +			list_del(pwi);
> > > > +			dfp->dfp_count--;
> > > > +			dfp->dfp_type->cancel_item(pwi);
> > > > +		}
> > > > +		ASSERT(dfp->dfp_count == 0);
> > > > +		kmem_free(dfp);
> > > > +	}
> > > > +	list_for_each_entry_safe(dfp, pli, &dop->dop_pending, dfp_list) {
> > > > +		list_del(&dfp->dfp_list);
> > > > +		list_for_each_safe(pwi, n, &dfp->dfp_work) {
> > > > +			list_del(pwi);
> > > > +			dfp->dfp_count--;
> > > > +			dfp->dfp_type->cancel_item(pwi);
> > > > +		}
> > > > +		ASSERT(dfp->dfp_count == 0);
> > > > +		kmem_free(dfp);
> > > > +	}
> > > > +}
> > > > +
> > > > +/* Add an item for later deferred processing. */
> > > > +void
> > > > +xfs_defer_add(
> > > > +	struct xfs_defer_ops		*dop,
> > > > +	enum xfs_defer_ops_type		type,
> > > > +	struct list_head		*li)
> > > > +{
> > > > +	struct xfs_defer_pending	*dfp = NULL;
> > > > +
> > > > +	/*
> > > > +	 * Add the item to a pending item at the end of the intake list.
> > > > +	 * If the last pending item has the same type, reuse it.  Else,
> > > > +	 * create a new pending item at the end of the intake list.
> > > > +	 */
> > > > +	if (!list_empty(&dop->dop_intake)) {
> > > > +		dfp = list_last_entry(&dop->dop_intake,
> > > > +				struct xfs_defer_pending, dfp_list);
> > > > +		if (dfp->dfp_type->type != type ||
> > > > +		    (dfp->dfp_type->max_items &&
> > > > +		     dfp->dfp_count >= dfp->dfp_type->max_items))
> > > > +			dfp = NULL;
> > > > +	}
> > > > +	if (!dfp) {
> > > > +		dfp = kmem_alloc(sizeof(struct xfs_defer_pending),
> > > > +				KM_SLEEP | KM_NOFS);
> > > > +		dfp->dfp_type = defer_op_types[type];
> > > > +		dfp->dfp_committed = false;
> > > > +		dfp->dfp_intent = NULL;
> > > > +		dfp->dfp_count = 0;
> > > > +		INIT_LIST_HEAD(&dfp->dfp_work);
> > > > +		list_add_tail(&dfp->dfp_list, &dop->dop_intake);
> > > > +	}
> > > > +
> > > > +	list_add_tail(li, &dfp->dfp_work);
> > > > +	dfp->dfp_count++;
> > > > +}
> > > > +
> > > > +/* Initialize a deferred operation list. */
> > > > +void
> > > > +xfs_defer_init_op_type(
> > > > +	const struct xfs_defer_op_type	*type)
> > > > +{
> > > > +	defer_op_types[type->type] = type;
> > > > +}
> > > > +
> > > > +/* Initialize a deferred operation. */
> > > > +void
> > > > +xfs_defer_init(
> > > > +	struct xfs_defer_ops		*dop,
> > > > +	xfs_fsblock_t			*fbp)
> > > > +{
> > > > +	dop->dop_committed = false;
> > > > +	dop->dop_low = false;
> > > > +	memset(&dop->dop_inodes, 0, sizeof(dop->dop_inodes));
> > > > +	*fbp = NULLFSBLOCK;
> > > > +	INIT_LIST_HEAD(&dop->dop_intake);
> > > > +	INIT_LIST_HEAD(&dop->dop_pending);
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > new file mode 100644
> > > > index 0000000..85c7a3a
> > > > --- /dev/null
> > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > @@ -0,0 +1,96 @@
> > > > +/*
> > > > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > > > + *
> > > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > + * modify it under the terms of the GNU General Public License
> > > > + * as published by the Free Software Foundation; either version 2
> > > > + * of the License, or (at your option) any later version.
> > > > + *
> > > > + * This program is distributed in the hope that it would be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > + * GNU General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU General Public License
> > > > + * along with this program; if not, write the Free Software Foundation,
> > > > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > > > + */
> > > > +#ifndef __XFS_DEFER_H__
> > > > +#define	__XFS_DEFER_H__
> > > > +
> > > > +struct xfs_defer_op_type;
> > > > +
> > > > +/*
> > > > + * Save a log intent item and a list of extents, so that we can replay
> > > > + * whatever action had to happen to the extent list and file the log done
> > > > + * item.
> > > > + */
> > > > +struct xfs_defer_pending {
> > > > +	const struct xfs_defer_op_type	*dfp_type;	/* function pointers */
> > > > +	struct list_head		dfp_list;	/* pending items */
> > > > +	bool				dfp_committed;	/* committed trans? */
> > > > +	void				*dfp_intent;	/* log intent item */
> > > > +	struct list_head		dfp_work;	/* work items */
> > > > +	unsigned int			dfp_count;	/* # extent items */
> > > > +};
> > > > +
> > > > +/*
> > > > + * Header for deferred operation list.
> > > > + *
> > > > + * dop_low is used by the allocator to activate the lowspace algorithm -
> > > > + * when free space is running low the extent allocator may choose to
> > > > + * allocate an extent from an AG without leaving sufficient space for
> > > > + * a btree split when inserting the new extent.  In this case the allocator
> > > > + * will enable the lowspace algorithm which is supposed to allow further
> > > > + * allocations (such as btree splits and newroots) to allocate from
> > > > + * sequential AGs.  In order to avoid locking AGs out of order the lowspace
> > > > + * algorithm will start searching for free space from AG 0.  If the correct
> > > > + * transaction reservations have been made then this algorithm will eventually
> > > > + * find all the space it needs.
> > > > + */
> > > > +enum xfs_defer_ops_type {
> > > > +	XFS_DEFER_OPS_TYPE_MAX,
> > > > +};
> > > > +
> > > > +#define XFS_DEFER_OPS_NR_INODES	2	/* join up to two inodes */
> > > > +
> > > > +struct xfs_defer_ops {
> > > > +	bool			dop_committed;	/* did any trans commit? */
> > > > +	bool			dop_low;	/* alloc in low mode */
> > > > +	struct list_head	dop_intake;	/* unlogged pending work */
> > > > +	struct list_head	dop_pending;	/* logged pending work */
> > > > +
> > > > +	/* relog these inodes with each roll */
> > > > +	struct xfs_inode	*dop_inodes[XFS_DEFER_OPS_NR_INODES];
> > > > +};
> > > > +
> > > > +void xfs_defer_add(struct xfs_defer_ops *dop, enum xfs_defer_ops_type type,
> > > > +		struct list_head *h);
> > > > +int xfs_defer_finish(struct xfs_trans **tp, struct xfs_defer_ops *dop,
> > > > +		struct xfs_inode *ip);
> > > > +void xfs_defer_cancel(struct xfs_defer_ops *dop);
> > > > +void xfs_defer_init(struct xfs_defer_ops *dop, xfs_fsblock_t *fbp);
> > > > +bool xfs_defer_has_unfinished_work(struct xfs_defer_ops *dop);
> > > > +int xfs_defer_join(struct xfs_defer_ops *dop, struct xfs_inode *ip);
> > > > +
> > > > +/* Description of a deferred type. */
> > > > +struct xfs_defer_op_type {
> > > > +	enum xfs_defer_ops_type	type;
> > > > +	unsigned int		max_items;
> > > > +	void (*abort_intent)(void *);
> > > > +	void *(*create_done)(struct xfs_trans *, void *, unsigned int);
> > > > +	int (*finish_item)(struct xfs_trans *, struct xfs_defer_ops *,
> > > > +			struct list_head *, void *, void **);
> > > > +	void (*finish_cleanup)(struct xfs_trans *, void *, int);
> > > > +	void (*cancel_item)(struct list_head *);
> > > > +	int (*diff_items)(void *, struct list_head *, struct list_head *);
> > > > +	void *(*create_intent)(struct xfs_trans *, uint);
> > > > +	void (*log_item)(struct xfs_trans *, void *, struct list_head *);
> > > > +};
> > > > +
> > > > +void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
> > > > +void xfs_defer_init_types(void);
> > > > +
> > > > +#endif /* __XFS_DEFER_H__ */
> > > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > > new file mode 100644
> > > > index 0000000..849088d
> > > > --- /dev/null
> > > > +++ b/fs/xfs/xfs_defer_item.c
> > > > @@ -0,0 +1,36 @@
> > > > +/*
> > > > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > > > + *
> > > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > + * modify it under the terms of the GNU General Public License
> > > > + * as published by the Free Software Foundation; either version 2
> > > > + * of the License, or (at your option) any later version.
> > > > + *
> > > > + * This program is distributed in the hope that it would be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > + * GNU General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU General Public License
> > > > + * along with this program; if not, write the Free Software Foundation,
> > > > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > > > + */
> > > > +#include "xfs.h"
> > > > +#include "xfs_fs.h"
> > > > +#include "xfs_shared.h"
> > > > +#include "xfs_format.h"
> > > > +#include "xfs_log_format.h"
> > > > +#include "xfs_trans_resv.h"
> > > > +#include "xfs_bit.h"
> > > > +#include "xfs_sb.h"
> > > > +#include "xfs_mount.h"
> > > > +#include "xfs_defer.h"
> > > > +#include "xfs_trans.h"
> > > > +
> > > > +/* Initialize the deferred operation types. */
> > > > +void
> > > > +xfs_defer_init_types(void)
> > > > +{
> > > > +}
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index 09722a7..bf63f6d 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -46,6 +46,7 @@
> > > >  #include "xfs_quota.h"
> > > >  #include "xfs_sysfs.h"
> > > >  #include "xfs_ondisk.h"
> > > > +#include "xfs_defer.h"
> > > >  
> > > >  #include <linux/namei.h>
> > > >  #include <linux/init.h>
> > > > @@ -1850,6 +1851,7 @@ init_xfs_fs(void)
> > > >  	printk(KERN_INFO XFS_VERSION_STRING " with "
> > > >  			 XFS_BUILD_OPTIONS " enabled\n");
> > > >  
> > > > +	xfs_defer_init_types();
> > > >  	xfs_dir_startup();
> > > >  
> > > >  	error = xfs_init_zones();
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range
  2016-06-17 11:32   ` Christoph Hellwig
@ 2016-06-28 19:19     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-28 19:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Fri, Jun 17, 2016 at 04:32:23AM -0700, Christoph Hellwig wrote:
> On Thu, Jun 16, 2016 at 06:17:59PM -0700, Darrick J. Wong wrote:
> > All the VFS functions in the dedupe ioctl path return int status, so
> > the ioctl handler ought to as well.
> > 
> > Found by Coverity, CID 1350952.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> This should go out to Al as a separate patch.

I sent it to him back in February, though he hasn't responded.

--D

> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 020/119] xfs: change xfs_bmap_{finish, cancel, init, free} -> xfs_defer_*
  2016-06-17  1:20 ` [PATCH 020/119] xfs: change xfs_bmap_{finish, cancel, init, free} -> xfs_defer_* Darrick J. Wong
@ 2016-06-30  0:11   ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-30  0:11 UTC (permalink / raw)
  To: david; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:20:00PM -0700, Darrick J. Wong wrote:
> Drop the compatibility shims that we were using to integrate the new
> deferred operation mechanism into the existing code.  No new code.

I've since renamed xfs_bmap_free_item to xfs_extent_free_item to
better reflect the increased separation between bmap and extent free.

--D

> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_attr.c        |   58 ++++++++++++++++++------------------
>  fs/xfs/libxfs/xfs_attr_remote.c |   14 ++++-----
>  fs/xfs/libxfs/xfs_bmap.c        |   38 ++++++++++++------------
>  fs/xfs/libxfs/xfs_bmap.h        |   10 +++---
>  fs/xfs/libxfs/xfs_btree.h       |    5 ++-
>  fs/xfs/libxfs/xfs_da_btree.h    |    4 +--
>  fs/xfs/libxfs/xfs_defer.h       |    7 ----
>  fs/xfs/libxfs/xfs_dir2.c        |    6 ++--
>  fs/xfs/libxfs/xfs_dir2.h        |    8 +++--
>  fs/xfs/libxfs/xfs_ialloc.c      |    6 ++--
>  fs/xfs/libxfs/xfs_ialloc.h      |    2 +
>  fs/xfs/libxfs/xfs_trans_resv.c  |    4 +--
>  fs/xfs/xfs_bmap_util.c          |   28 +++++++++---------
>  fs/xfs/xfs_dquot.c              |   10 +++---
>  fs/xfs/xfs_inode.c              |   62 ++++++++++++++++++++-------------------
>  fs/xfs/xfs_inode.h              |    4 +--
>  fs/xfs/xfs_iomap.c              |   24 ++++++++-------
>  fs/xfs/xfs_rtalloc.c            |    8 +++--
>  fs/xfs/xfs_symlink.c            |   16 +++++-----
>  19 files changed, 154 insertions(+), 160 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index 79d3a30..66baf97 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -204,7 +204,7 @@ xfs_attr_set(
>  {
>  	struct xfs_mount	*mp = dp->i_mount;
>  	struct xfs_da_args	args;
> -	struct xfs_bmap_free	flist;
> +	struct xfs_defer_ops	flist;
>  	struct xfs_trans_res	tres;
>  	xfs_fsblock_t		firstblock;
>  	int			rsvd = (flags & ATTR_ROOT) != 0;
> @@ -317,13 +317,13 @@ xfs_attr_set(
>  		 * It won't fit in the shortform, transform to a leaf block.
>  		 * GROT: another possible req'mt for a double-split btree op.
>  		 */
> -		xfs_bmap_init(args.flist, args.firstblock);
> +		xfs_defer_init(args.flist, args.firstblock);
>  		error = xfs_attr_shortform_to_leaf(&args);
>  		if (!error)
> -			error = xfs_bmap_finish(&args.trans, args.flist, dp);
> +			error = xfs_defer_finish(&args.trans, args.flist, dp);
>  		if (error) {
>  			args.trans = NULL;
> -			xfs_bmap_cancel(&flist);
> +			xfs_defer_cancel(&flist);
>  			goto out;
>  		}
>  
> @@ -383,7 +383,7 @@ xfs_attr_remove(
>  {
>  	struct xfs_mount	*mp = dp->i_mount;
>  	struct xfs_da_args	args;
> -	struct xfs_bmap_free	flist;
> +	struct xfs_defer_ops	flist;
>  	xfs_fsblock_t		firstblock;
>  	int			error;
>  
> @@ -585,13 +585,13 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
>  		 * Commit that transaction so that the node_addname() call
>  		 * can manage its own transactions.
>  		 */
> -		xfs_bmap_init(args->flist, args->firstblock);
> +		xfs_defer_init(args->flist, args->firstblock);
>  		error = xfs_attr3_leaf_to_node(args);
>  		if (!error)
> -			error = xfs_bmap_finish(&args->trans, args->flist, dp);
> +			error = xfs_defer_finish(&args->trans, args->flist, dp);
>  		if (error) {
>  			args->trans = NULL;
> -			xfs_bmap_cancel(args->flist);
> +			xfs_defer_cancel(args->flist);
>  			return error;
>  		}
>  
> @@ -675,15 +675,15 @@ xfs_attr_leaf_addname(xfs_da_args_t *args)
>  		 * If the result is small enough, shrink it all into the inode.
>  		 */
>  		if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
> -			xfs_bmap_init(args->flist, args->firstblock);
> +			xfs_defer_init(args->flist, args->firstblock);
>  			error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
>  			/* bp is gone due to xfs_da_shrink_inode */
>  			if (!error)
> -				error = xfs_bmap_finish(&args->trans,
> +				error = xfs_defer_finish(&args->trans,
>  							args->flist, dp);
>  			if (error) {
>  				args->trans = NULL;
> -				xfs_bmap_cancel(args->flist);
> +				xfs_defer_cancel(args->flist);
>  				return error;
>  			}
>  		}
> @@ -738,14 +738,14 @@ xfs_attr_leaf_removename(xfs_da_args_t *args)
>  	 * If the result is small enough, shrink it all into the inode.
>  	 */
>  	if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
> -		xfs_bmap_init(args->flist, args->firstblock);
> +		xfs_defer_init(args->flist, args->firstblock);
>  		error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
>  		/* bp is gone due to xfs_da_shrink_inode */
>  		if (!error)
> -			error = xfs_bmap_finish(&args->trans, args->flist, dp);
> +			error = xfs_defer_finish(&args->trans, args->flist, dp);
>  		if (error) {
>  			args->trans = NULL;
> -			xfs_bmap_cancel(args->flist);
> +			xfs_defer_cancel(args->flist);
>  			return error;
>  		}
>  	}
> @@ -864,14 +864,14 @@ restart:
>  			 */
>  			xfs_da_state_free(state);
>  			state = NULL;
> -			xfs_bmap_init(args->flist, args->firstblock);
> +			xfs_defer_init(args->flist, args->firstblock);
>  			error = xfs_attr3_leaf_to_node(args);
>  			if (!error)
> -				error = xfs_bmap_finish(&args->trans,
> +				error = xfs_defer_finish(&args->trans,
>  							args->flist, dp);
>  			if (error) {
>  				args->trans = NULL;
> -				xfs_bmap_cancel(args->flist);
> +				xfs_defer_cancel(args->flist);
>  				goto out;
>  			}
>  
> @@ -892,13 +892,13 @@ restart:
>  		 * in the index/blkno/rmtblkno/rmtblkcnt fields and
>  		 * in the index2/blkno2/rmtblkno2/rmtblkcnt2 fields.
>  		 */
> -		xfs_bmap_init(args->flist, args->firstblock);
> +		xfs_defer_init(args->flist, args->firstblock);
>  		error = xfs_da3_split(state);
>  		if (!error)
> -			error = xfs_bmap_finish(&args->trans, args->flist, dp);
> +			error = xfs_defer_finish(&args->trans, args->flist, dp);
>  		if (error) {
>  			args->trans = NULL;
> -			xfs_bmap_cancel(args->flist);
> +			xfs_defer_cancel(args->flist);
>  			goto out;
>  		}
>  	} else {
> @@ -991,14 +991,14 @@ restart:
>  		 * Check to see if the tree needs to be collapsed.
>  		 */
>  		if (retval && (state->path.active > 1)) {
> -			xfs_bmap_init(args->flist, args->firstblock);
> +			xfs_defer_init(args->flist, args->firstblock);
>  			error = xfs_da3_join(state);
>  			if (!error)
> -				error = xfs_bmap_finish(&args->trans,
> +				error = xfs_defer_finish(&args->trans,
>  							args->flist, dp);
>  			if (error) {
>  				args->trans = NULL;
> -				xfs_bmap_cancel(args->flist);
> +				xfs_defer_cancel(args->flist);
>  				goto out;
>  			}
>  		}
> @@ -1114,13 +1114,13 @@ xfs_attr_node_removename(xfs_da_args_t *args)
>  	 * Check to see if the tree needs to be collapsed.
>  	 */
>  	if (retval && (state->path.active > 1)) {
> -		xfs_bmap_init(args->flist, args->firstblock);
> +		xfs_defer_init(args->flist, args->firstblock);
>  		error = xfs_da3_join(state);
>  		if (!error)
> -			error = xfs_bmap_finish(&args->trans, args->flist, dp);
> +			error = xfs_defer_finish(&args->trans, args->flist, dp);
>  		if (error) {
>  			args->trans = NULL;
> -			xfs_bmap_cancel(args->flist);
> +			xfs_defer_cancel(args->flist);
>  			goto out;
>  		}
>  		/*
> @@ -1147,15 +1147,15 @@ xfs_attr_node_removename(xfs_da_args_t *args)
>  			goto out;
>  
>  		if ((forkoff = xfs_attr_shortform_allfit(bp, dp))) {
> -			xfs_bmap_init(args->flist, args->firstblock);
> +			xfs_defer_init(args->flist, args->firstblock);
>  			error = xfs_attr3_leaf_to_shortform(bp, args, forkoff);
>  			/* bp is gone due to xfs_da_shrink_inode */
>  			if (!error)
> -				error = xfs_bmap_finish(&args->trans,
> +				error = xfs_defer_finish(&args->trans,
>  							args->flist, dp);
>  			if (error) {
>  				args->trans = NULL;
> -				xfs_bmap_cancel(args->flist);
> +				xfs_defer_cancel(args->flist);
>  				goto out;
>  			}
>  		} else
> diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
> index 93a9ce1..aabb516 100644
> --- a/fs/xfs/libxfs/xfs_attr_remote.c
> +++ b/fs/xfs/libxfs/xfs_attr_remote.c
> @@ -461,16 +461,16 @@ xfs_attr_rmtval_set(
>  		 * extent and then crash then the block may not contain the
>  		 * correct metadata after log recovery occurs.
>  		 */
> -		xfs_bmap_init(args->flist, args->firstblock);
> +		xfs_defer_init(args->flist, args->firstblock);
>  		nmap = 1;
>  		error = xfs_bmapi_write(args->trans, dp, (xfs_fileoff_t)lblkno,
>  				  blkcnt, XFS_BMAPI_ATTRFORK, args->firstblock,
>  				  args->total, &map, &nmap, args->flist);
>  		if (!error)
> -			error = xfs_bmap_finish(&args->trans, args->flist, dp);
> +			error = xfs_defer_finish(&args->trans, args->flist, dp);
>  		if (error) {
>  			args->trans = NULL;
> -			xfs_bmap_cancel(args->flist);
> +			xfs_defer_cancel(args->flist);
>  			return error;
>  		}
>  
> @@ -504,7 +504,7 @@ xfs_attr_rmtval_set(
>  
>  		ASSERT(blkcnt > 0);
>  
> -		xfs_bmap_init(args->flist, args->firstblock);
> +		xfs_defer_init(args->flist, args->firstblock);
>  		nmap = 1;
>  		error = xfs_bmapi_read(dp, (xfs_fileoff_t)lblkno,
>  				       blkcnt, &map, &nmap,
> @@ -604,16 +604,16 @@ xfs_attr_rmtval_remove(
>  	blkcnt = args->rmtblkcnt;
>  	done = 0;
>  	while (!done) {
> -		xfs_bmap_init(args->flist, args->firstblock);
> +		xfs_defer_init(args->flist, args->firstblock);
>  		error = xfs_bunmapi(args->trans, args->dp, lblkno, blkcnt,
>  				    XFS_BMAPI_ATTRFORK, 1, args->firstblock,
>  				    args->flist, &done);
>  		if (!error)
> -			error = xfs_bmap_finish(&args->trans, args->flist,
> +			error = xfs_defer_finish(&args->trans, args->flist,
>  						args->dp);
>  		if (error) {
>  			args->trans = NULL;
> -			xfs_bmap_cancel(args->flist);
> +			xfs_defer_cancel(args->flist);
>  			return error;
>  		}
>  
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 64ca97f..45ce7bd 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -572,7 +572,7 @@ xfs_bmap_validate_ret(
>  void
>  xfs_bmap_add_free(
>  	struct xfs_mount	*mp,		/* mount point structure */
> -	struct xfs_bmap_free	*flist,		/* list of extents */
> +	struct xfs_defer_ops	*flist,		/* list of extents */
>  	xfs_fsblock_t		bno,		/* fs block number of extent */
>  	xfs_filblks_t		len)		/* length of extent */
>  {
> @@ -672,7 +672,7 @@ xfs_bmap_extents_to_btree(
>  	xfs_trans_t		*tp,		/* transaction pointer */
>  	xfs_inode_t		*ip,		/* incore inode pointer */
>  	xfs_fsblock_t		*firstblock,	/* first-block-allocated */
> -	xfs_bmap_free_t		*flist,		/* blocks freed in xaction */
> +	struct xfs_defer_ops	*flist,		/* blocks freed in xaction */
>  	xfs_btree_cur_t		**curp,		/* cursor returned to caller */
>  	int			wasdel,		/* converting a delayed alloc */
>  	int			*logflagsp,	/* inode logging flags */
> @@ -940,7 +940,7 @@ xfs_bmap_add_attrfork_btree(
>  	xfs_trans_t		*tp,		/* transaction pointer */
>  	xfs_inode_t		*ip,		/* incore inode pointer */
>  	xfs_fsblock_t		*firstblock,	/* first block allocated */
> -	xfs_bmap_free_t		*flist,		/* blocks to free at commit */
> +	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
>  	int			*flags)		/* inode logging flags */
>  {
>  	xfs_btree_cur_t		*cur;		/* btree cursor */
> @@ -983,7 +983,7 @@ xfs_bmap_add_attrfork_extents(
>  	xfs_trans_t		*tp,		/* transaction pointer */
>  	xfs_inode_t		*ip,		/* incore inode pointer */
>  	xfs_fsblock_t		*firstblock,	/* first block allocated */
> -	xfs_bmap_free_t		*flist,		/* blocks to free at commit */
> +	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
>  	int			*flags)		/* inode logging flags */
>  {
>  	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
> @@ -1018,7 +1018,7 @@ xfs_bmap_add_attrfork_local(
>  	xfs_trans_t		*tp,		/* transaction pointer */
>  	xfs_inode_t		*ip,		/* incore inode pointer */
>  	xfs_fsblock_t		*firstblock,	/* first block allocated */
> -	xfs_bmap_free_t		*flist,		/* blocks to free at commit */
> +	struct xfs_defer_ops	*flist,		/* blocks to free at commit */
>  	int			*flags)		/* inode logging flags */
>  {
>  	xfs_da_args_t		dargs;		/* args for dir/attr code */
> @@ -1059,7 +1059,7 @@ xfs_bmap_add_attrfork(
>  	int			rsvd)		/* xact may use reserved blks */
>  {
>  	xfs_fsblock_t		firstblock;	/* 1st block/ag allocated */
> -	xfs_bmap_free_t		flist;		/* freed extent records */
> +	struct xfs_defer_ops	flist;		/* freed extent records */
>  	xfs_mount_t		*mp;		/* mount structure */
>  	xfs_trans_t		*tp;		/* transaction pointer */
>  	int			blks;		/* space reservation */
> @@ -1125,7 +1125,7 @@ xfs_bmap_add_attrfork(
>  	ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_SLEEP);
>  	ip->i_afp->if_flags = XFS_IFEXTENTS;
>  	logflags = 0;
> -	xfs_bmap_init(&flist, &firstblock);
> +	xfs_defer_init(&flist, &firstblock);
>  	switch (ip->i_d.di_format) {
>  	case XFS_DINODE_FMT_LOCAL:
>  		error = xfs_bmap_add_attrfork_local(tp, ip, &firstblock, &flist,
> @@ -1165,7 +1165,7 @@ xfs_bmap_add_attrfork(
>  			xfs_log_sb(tp);
>  	}
>  
> -	error = xfs_bmap_finish(&tp, &flist, NULL);
> +	error = xfs_defer_finish(&tp, &flist, NULL);
>  	if (error)
>  		goto bmap_cancel;
>  	error = xfs_trans_commit(tp);
> @@ -1173,7 +1173,7 @@ xfs_bmap_add_attrfork(
>  	return error;
>  
>  bmap_cancel:
> -	xfs_bmap_cancel(&flist);
> +	xfs_defer_cancel(&flist);
>  trans_cancel:
>  	xfs_trans_cancel(tp);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -2214,7 +2214,7 @@ xfs_bmap_add_extent_unwritten_real(
>  	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
>  	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
>  	xfs_fsblock_t		*first,	/* pointer to firstblock variable */
> -	xfs_bmap_free_t		*flist,	/* list of extents to be freed */
> +	struct xfs_defer_ops	*flist,	/* list of extents to be freed */
>  	int			*logflagsp) /* inode logging flags */
>  {
>  	xfs_btree_cur_t		*cur;	/* btree cursor */
> @@ -4447,7 +4447,7 @@ xfs_bmapi_write(
>  	xfs_extlen_t		total,		/* total blocks needed */
>  	struct xfs_bmbt_irec	*mval,		/* output: map values */
>  	int			*nmap,		/* i/o: mval size/count */
> -	struct xfs_bmap_free	*flist)		/* i/o: list extents to free */
> +	struct xfs_defer_ops	*flist)		/* i/o: list extents to free */
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_ifork	*ifp;
> @@ -4735,7 +4735,7 @@ xfs_bmap_del_extent(
>  	xfs_inode_t		*ip,	/* incore inode pointer */
>  	xfs_trans_t		*tp,	/* current transaction pointer */
>  	xfs_extnum_t		*idx,	/* extent number to update/delete */
> -	xfs_bmap_free_t		*flist,	/* list of extents to be freed */
> +	struct xfs_defer_ops	*flist,	/* list of extents to be freed */
>  	xfs_btree_cur_t		*cur,	/* if null, not a btree */
>  	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
>  	int			*logflagsp, /* inode logging flags */
> @@ -5064,7 +5064,7 @@ xfs_bunmapi(
>  	xfs_extnum_t		nexts,		/* number of extents max */
>  	xfs_fsblock_t		*firstblock,	/* first allocated block
>  						   controls a.g. for allocs */
> -	xfs_bmap_free_t		*flist,		/* i/o: list extents to free */
> +	struct xfs_defer_ops	*flist,		/* i/o: list extents to free */
>  	int			*done)		/* set if not done yet */
>  {
>  	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
> @@ -5678,7 +5678,7 @@ xfs_bmap_shift_extents(
>  	int			*done,
>  	xfs_fileoff_t		stop_fsb,
>  	xfs_fsblock_t		*firstblock,
> -	struct xfs_bmap_free	*flist,
> +	struct xfs_defer_ops	*flist,
>  	enum shift_direction	direction,
>  	int			num_exts)
>  {
> @@ -5832,7 +5832,7 @@ xfs_bmap_split_extent_at(
>  	struct xfs_inode	*ip,
>  	xfs_fileoff_t		split_fsb,
>  	xfs_fsblock_t		*firstfsb,
> -	struct xfs_bmap_free	*free_list)
> +	struct xfs_defer_ops	*free_list)
>  {
>  	int				whichfork = XFS_DATA_FORK;
>  	struct xfs_btree_cur		*cur = NULL;
> @@ -5971,7 +5971,7 @@ xfs_bmap_split_extent(
>  {
>  	struct xfs_mount        *mp = ip->i_mount;
>  	struct xfs_trans        *tp;
> -	struct xfs_bmap_free    free_list;
> +	struct xfs_defer_ops    free_list;
>  	xfs_fsblock_t           firstfsb;
>  	int                     error;
>  
> @@ -5983,21 +5983,21 @@ xfs_bmap_split_extent(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  
> -	xfs_bmap_init(&free_list, &firstfsb);
> +	xfs_defer_init(&free_list, &firstfsb);
>  
>  	error = xfs_bmap_split_extent_at(tp, ip, split_fsb,
>  			&firstfsb, &free_list);
>  	if (error)
>  		goto out;
>  
> -	error = xfs_bmap_finish(&tp, &free_list, NULL);
> +	error = xfs_defer_finish(&tp, &free_list, NULL);
>  	if (error)
>  		goto out;
>  
>  	return xfs_trans_commit(tp);
>  
>  out:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  	xfs_trans_cancel(tp);
>  	return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 6681bd9..e2a0425 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -32,7 +32,7 @@ extern kmem_zone_t	*xfs_bmap_free_item_zone;
>   */
>  struct xfs_bmalloca {
>  	xfs_fsblock_t		*firstblock; /* i/o first block allocated */
> -	struct xfs_bmap_free	*flist;	/* bmap freelist */
> +	struct xfs_defer_ops	*flist;	/* bmap freelist */
>  	struct xfs_trans	*tp;	/* transaction pointer */
>  	struct xfs_inode	*ip;	/* incore inode pointer */
>  	struct xfs_bmbt_irec	prev;	/* extent before the new one */
> @@ -164,7 +164,7 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
>  
>  int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
>  void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
> -void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_bmap_free *flist,
> +void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *flist,
>  			  xfs_fsblock_t bno, xfs_filblks_t len);
>  void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
>  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
> @@ -186,18 +186,18 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
>  		xfs_fsblock_t *firstblock, xfs_extlen_t total,
>  		struct xfs_bmbt_irec *mval, int *nmap,
> -		struct xfs_bmap_free *flist);
> +		struct xfs_defer_ops *flist);
>  int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
>  		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
> -		struct xfs_bmap_free *flist, int *done);
> +		struct xfs_defer_ops *flist, int *done);
>  int	xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
>  		xfs_extnum_t num);
>  uint	xfs_default_attroffset(struct xfs_inode *ip);
>  int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb,
>  		int *done, xfs_fileoff_t stop_fsb, xfs_fsblock_t *firstblock,
> -		struct xfs_bmap_free *flist, enum shift_direction direction,
> +		struct xfs_defer_ops *flist, enum shift_direction direction,
>  		int num_exts);
>  int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 0ec3055..ae714a8 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -19,7 +19,7 @@
>  #define	__XFS_BTREE_H__
>  
>  struct xfs_buf;
> -struct xfs_bmap_free;
> +struct xfs_defer_ops;
>  struct xfs_inode;
>  struct xfs_mount;
>  struct xfs_trans;
> @@ -234,11 +234,12 @@ typedef struct xfs_btree_cur
>  	union {
>  		struct {			/* needed for BNO, CNT, INO */
>  			struct xfs_buf	*agbp;	/* agf/agi buffer pointer */
> +			struct xfs_defer_ops *flist;	/* deferred updates */
>  			xfs_agnumber_t	agno;	/* ag number */
>  		} a;
>  		struct {			/* needed for BMAP */
>  			struct xfs_inode *ip;	/* pointer to our inode */
> -			struct xfs_bmap_free *flist;	/* list to free after */
> +			struct xfs_defer_ops *flist;	/* deferred updates */
>  			xfs_fsblock_t	firstblock;	/* 1st blk allocated */
>  			int		allocated;	/* count of alloced */
>  			short		forksize;	/* fork's inode space */
> diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
> index 6e153e3..249813a 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.h
> +++ b/fs/xfs/libxfs/xfs_da_btree.h
> @@ -19,7 +19,7 @@
>  #ifndef __XFS_DA_BTREE_H__
>  #define	__XFS_DA_BTREE_H__
>  
> -struct xfs_bmap_free;
> +struct xfs_defer_ops;
>  struct xfs_inode;
>  struct xfs_trans;
>  struct zone;
> @@ -70,7 +70,7 @@ typedef struct xfs_da_args {
>  	xfs_ino_t	inumber;	/* input/output inode number */
>  	struct xfs_inode *dp;		/* directory inode to manipulate */
>  	xfs_fsblock_t	*firstblock;	/* ptr to firstblock for bmap calls */
> -	struct xfs_bmap_free *flist;	/* ptr to freelist for bmap_finish */
> +	struct xfs_defer_ops *flist;	/* ptr to freelist for bmap_finish */
>  	struct xfs_trans *trans;	/* current trans (changes over time) */
>  	xfs_extlen_t	total;		/* total blocks needed, for 1st bmap */
>  	int		whichfork;	/* data or attribute fork */
> diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> index 4c05ba6..743fc32 100644
> --- a/fs/xfs/libxfs/xfs_defer.h
> +++ b/fs/xfs/libxfs/xfs_defer.h
> @@ -94,11 +94,4 @@ struct xfs_defer_op_type {
>  void xfs_defer_init_op_type(const struct xfs_defer_op_type *type);
>  void xfs_defer_init_types(void);
>  
> -/* XXX: compatibility shims, will go away in the next patch */
> -#define xfs_bmap_finish		xfs_defer_finish
> -#define xfs_bmap_cancel		xfs_defer_cancel
> -#define xfs_bmap_init		xfs_defer_init
> -#define xfs_bmap_free		xfs_defer_ops
> -typedef struct xfs_defer_ops	xfs_bmap_free_t;
> -
>  #endif /* __XFS_DEFER_H__ */
> diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
> index 945c0345..0523100 100644
> --- a/fs/xfs/libxfs/xfs_dir2.c
> +++ b/fs/xfs/libxfs/xfs_dir2.c
> @@ -260,7 +260,7 @@ xfs_dir_createname(
>  	struct xfs_name		*name,
>  	xfs_ino_t		inum,		/* new entry inode number */
>  	xfs_fsblock_t		*first,		/* bmap's firstblock */
> -	xfs_bmap_free_t		*flist,		/* bmap's freeblock list */
> +	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
>  	xfs_extlen_t		total)		/* bmap's total block count */
>  {
>  	struct xfs_da_args	*args;
> @@ -437,7 +437,7 @@ xfs_dir_removename(
>  	struct xfs_name	*name,
>  	xfs_ino_t	ino,
>  	xfs_fsblock_t	*first,		/* bmap's firstblock */
> -	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
> +	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
>  	xfs_extlen_t	total)		/* bmap's total block count */
>  {
>  	struct xfs_da_args *args;
> @@ -499,7 +499,7 @@ xfs_dir_replace(
>  	struct xfs_name	*name,		/* name of entry to replace */
>  	xfs_ino_t	inum,		/* new inode number */
>  	xfs_fsblock_t	*first,		/* bmap's firstblock */
> -	xfs_bmap_free_t	*flist,		/* bmap's freeblock list */
> +	struct xfs_defer_ops	*flist,		/* bmap's freeblock list */
>  	xfs_extlen_t	total)		/* bmap's total block count */
>  {
>  	struct xfs_da_args *args;
> diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h
> index 0a62e73..5737d85 100644
> --- a/fs/xfs/libxfs/xfs_dir2.h
> +++ b/fs/xfs/libxfs/xfs_dir2.h
> @@ -18,7 +18,7 @@
>  #ifndef __XFS_DIR2_H__
>  #define __XFS_DIR2_H__
>  
> -struct xfs_bmap_free;
> +struct xfs_defer_ops;
>  struct xfs_da_args;
>  struct xfs_inode;
>  struct xfs_mount;
> @@ -129,18 +129,18 @@ extern int xfs_dir_init(struct xfs_trans *tp, struct xfs_inode *dp,
>  extern int xfs_dir_createname(struct xfs_trans *tp, struct xfs_inode *dp,
>  				struct xfs_name *name, xfs_ino_t inum,
>  				xfs_fsblock_t *first,
> -				struct xfs_bmap_free *flist, xfs_extlen_t tot);
> +				struct xfs_defer_ops *flist, xfs_extlen_t tot);
>  extern int xfs_dir_lookup(struct xfs_trans *tp, struct xfs_inode *dp,
>  				struct xfs_name *name, xfs_ino_t *inum,
>  				struct xfs_name *ci_name);
>  extern int xfs_dir_removename(struct xfs_trans *tp, struct xfs_inode *dp,
>  				struct xfs_name *name, xfs_ino_t ino,
>  				xfs_fsblock_t *first,
> -				struct xfs_bmap_free *flist, xfs_extlen_t tot);
> +				struct xfs_defer_ops *flist, xfs_extlen_t tot);
>  extern int xfs_dir_replace(struct xfs_trans *tp, struct xfs_inode *dp,
>  				struct xfs_name *name, xfs_ino_t inum,
>  				xfs_fsblock_t *first,
> -				struct xfs_bmap_free *flist, xfs_extlen_t tot);
> +				struct xfs_defer_ops *flist, xfs_extlen_t tot);
>  extern int xfs_dir_canenter(struct xfs_trans *tp, struct xfs_inode *dp,
>  				struct xfs_name *name);
>  
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 9ae9a43..f2e29a1 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -1818,7 +1818,7 @@ xfs_difree_inode_chunk(
>  	struct xfs_mount		*mp,
>  	xfs_agnumber_t			agno,
>  	struct xfs_inobt_rec_incore	*rec,
> -	struct xfs_bmap_free		*flist)
> +	struct xfs_defer_ops		*flist)
>  {
>  	xfs_agblock_t	sagbno = XFS_AGINO_TO_AGBNO(mp, rec->ir_startino);
>  	int		startidx, endidx;
> @@ -1890,7 +1890,7 @@ xfs_difree_inobt(
>  	struct xfs_trans		*tp,
>  	struct xfs_buf			*agbp,
>  	xfs_agino_t			agino,
> -	struct xfs_bmap_free		*flist,
> +	struct xfs_defer_ops		*flist,
>  	struct xfs_icluster		*xic,
>  	struct xfs_inobt_rec_incore	*orec)
>  {
> @@ -2122,7 +2122,7 @@ int
>  xfs_difree(
>  	struct xfs_trans	*tp,		/* transaction pointer */
>  	xfs_ino_t		inode,		/* inode to be freed */
> -	struct xfs_bmap_free	*flist,		/* extents to free */
> +	struct xfs_defer_ops	*flist,		/* extents to free */
>  	struct xfs_icluster	*xic)	/* cluster info if deleted */
>  {
>  	/* REFERENCED */
> diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
> index 6e450df..2e06b67 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.h
> +++ b/fs/xfs/libxfs/xfs_ialloc.h
> @@ -95,7 +95,7 @@ int					/* error */
>  xfs_difree(
>  	struct xfs_trans *tp,		/* transaction pointer */
>  	xfs_ino_t	inode,		/* inode to be freed */
> -	struct xfs_bmap_free *flist,	/* extents to free */
> +	struct xfs_defer_ops *flist,	/* extents to free */
>  	struct xfs_icluster *ifree);	/* cluster info if deleted */
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 68cb1e7..4c7eb9d 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -153,9 +153,9 @@ xfs_calc_finobt_res(
>   * item logged to try to account for the overhead of the transaction mechanism.
>   *
>   * Note:  Most of the reservations underestimate the number of allocation
> - * groups into which they could free extents in the xfs_bmap_finish() call.
> + * groups into which they could free extents in the xfs_defer_finish() call.
>   * This is because the number in the worst case is quite high and quite
> - * unusual.  In order to fix this we need to change xfs_bmap_finish() to free
> + * unusual.  In order to fix this we need to change xfs_defer_finish() to free
>   * extents in only a single AG at a time.  This will require changes to the
>   * EFI code as well, however, so that the EFI for the extents not freed is
>   * logged again in each transaction.  See SGI PV #261917.
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 972a27a..928dfa4 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -685,7 +685,7 @@ xfs_bmap_punch_delalloc_range(
>  		xfs_bmbt_irec_t	imap;
>  		int		nimaps = 1;
>  		xfs_fsblock_t	firstblock;
> -		xfs_bmap_free_t flist;
> +		struct xfs_defer_ops flist;
>  
>  		/*
>  		 * Map the range first and check that it is a delalloc extent
> @@ -721,7 +721,7 @@ xfs_bmap_punch_delalloc_range(
>  		 * allocated or freed for a delalloc extent and hence we need
>  		 * don't cancel or finish them after the xfs_bunmapi() call.
>  		 */
> -		xfs_bmap_init(&flist, &firstblock);
> +		xfs_defer_init(&flist, &firstblock);
>  		error = xfs_bunmapi(NULL, ip, start_fsb, 1, 0, 1, &firstblock,
>  					&flist, &done);
>  		if (error)
> @@ -884,7 +884,7 @@ xfs_alloc_file_space(
>  	int			rt;
>  	xfs_trans_t		*tp;
>  	xfs_bmbt_irec_t		imaps[1], *imapp;
> -	xfs_bmap_free_t		free_list;
> +	struct xfs_defer_ops	free_list;
>  	uint			qblocks, resblks, resrtextents;
>  	int			error;
>  
> @@ -975,7 +975,7 @@ xfs_alloc_file_space(
>  
>  		xfs_trans_ijoin(tp, ip, 0);
>  
> -		xfs_bmap_init(&free_list, &firstfsb);
> +		xfs_defer_init(&free_list, &firstfsb);
>  		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
>  					allocatesize_fsb, alloc_type, &firstfsb,
>  					resblks, imapp, &nimaps, &free_list);
> @@ -985,7 +985,7 @@ xfs_alloc_file_space(
>  		/*
>  		 * Complete the transaction
>  		 */
> -		error = xfs_bmap_finish(&tp, &free_list, NULL);
> +		error = xfs_defer_finish(&tp, &free_list, NULL);
>  		if (error)
>  			goto error0;
>  
> @@ -1008,7 +1008,7 @@ xfs_alloc_file_space(
>  	return error;
>  
>  error0:	/* Cancel bmap, unlock inode, unreserve quota blocks, cancel trans */
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  	xfs_trans_unreserve_quota_nblks(tp, ip, (long)qblocks, 0, quota_flag);
>  
>  error1:	/* Just cancel transaction */
> @@ -1122,7 +1122,7 @@ xfs_free_file_space(
>  	xfs_fileoff_t		endoffset_fsb;
>  	int			error;
>  	xfs_fsblock_t		firstfsb;
> -	xfs_bmap_free_t		free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_bmbt_irec_t		imap;
>  	xfs_off_t		ioffset;
>  	xfs_off_t		iendoffset;
> @@ -1245,7 +1245,7 @@ xfs_free_file_space(
>  		/*
>  		 * issue the bunmapi() call to free the blocks
>  		 */
> -		xfs_bmap_init(&free_list, &firstfsb);
> +		xfs_defer_init(&free_list, &firstfsb);
>  		error = xfs_bunmapi(tp, ip, startoffset_fsb,
>  				  endoffset_fsb - startoffset_fsb,
>  				  0, 2, &firstfsb, &free_list, &done);
> @@ -1255,7 +1255,7 @@ xfs_free_file_space(
>  		/*
>  		 * complete the transaction
>  		 */
> -		error = xfs_bmap_finish(&tp, &free_list, NULL);
> +		error = xfs_defer_finish(&tp, &free_list, ip);
>  		if (error)
>  			goto error0;
>  
> @@ -1267,7 +1267,7 @@ xfs_free_file_space(
>  	return error;
>  
>   error0:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>   error1:
>  	xfs_trans_cancel(tp);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -1333,7 +1333,7 @@ xfs_shift_file_space(
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_trans	*tp;
>  	int			error;
> -	struct xfs_bmap_free	free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t		first_block;
>  	xfs_fileoff_t		stop_fsb;
>  	xfs_fileoff_t		next_fsb;
> @@ -1411,7 +1411,7 @@ xfs_shift_file_space(
>  
>  		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  
> -		xfs_bmap_init(&free_list, &first_block);
> +		xfs_defer_init(&free_list, &first_block);
>  
>  		/*
>  		 * We are using the write transaction in which max 2 bmbt
> @@ -1423,7 +1423,7 @@ xfs_shift_file_space(
>  		if (error)
>  			goto out_bmap_cancel;
>  
> -		error = xfs_bmap_finish(&tp, &free_list, NULL);
> +		error = xfs_defer_finish(&tp, &free_list, NULL);
>  		if (error)
>  			goto out_bmap_cancel;
>  
> @@ -1433,7 +1433,7 @@ xfs_shift_file_space(
>  	return error;
>  
>  out_bmap_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
>  	return error;
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index be17f0a..764e1cc 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -307,7 +307,7 @@ xfs_qm_dqalloc(
>  	xfs_buf_t	**O_bpp)
>  {
>  	xfs_fsblock_t	firstblock;
> -	xfs_bmap_free_t flist;
> +	struct xfs_defer_ops flist;
>  	xfs_bmbt_irec_t map;
>  	int		nmaps, error;
>  	xfs_buf_t	*bp;
> @@ -320,7 +320,7 @@ xfs_qm_dqalloc(
>  	/*
>  	 * Initialize the bmap freelist prior to calling bmapi code.
>  	 */
> -	xfs_bmap_init(&flist, &firstblock);
> +	xfs_defer_init(&flist, &firstblock);
>  	xfs_ilock(quotip, XFS_ILOCK_EXCL);
>  	/*
>  	 * Return if this type of quotas is turned off while we didn't
> @@ -368,7 +368,7 @@ xfs_qm_dqalloc(
>  			      dqp->dq_flags & XFS_DQ_ALLTYPES, bp);
>  
>  	/*
> -	 * xfs_bmap_finish() may commit the current transaction and
> +	 * xfs_defer_finish() may commit the current transaction and
>  	 * start a second transaction if the freelist is not empty.
>  	 *
>  	 * Since we still want to modify this buffer, we need to
> @@ -382,7 +382,7 @@ xfs_qm_dqalloc(
>  
>  	xfs_trans_bhold(tp, bp);
>  
> -	error = xfs_bmap_finish(tpp, &flist, NULL);
> +	error = xfs_defer_finish(tpp, &flist, NULL);
>  	if (error)
>  		goto error1;
>  
> @@ -398,7 +398,7 @@ xfs_qm_dqalloc(
>  	return 0;
>  
>  error1:
> -	xfs_bmap_cancel(&flist);
> +	xfs_defer_cancel(&flist);
>  error0:
>  	xfs_iunlock(quotip, XFS_ILOCK_EXCL);
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index d2389bb..3ce50da 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1123,7 +1123,7 @@ xfs_create(
>  	struct xfs_inode	*ip = NULL;
>  	struct xfs_trans	*tp = NULL;
>  	int			error;
> -	xfs_bmap_free_t		free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t		first_block;
>  	bool                    unlock_dp_on_error = false;
>  	prid_t			prid;
> @@ -1183,7 +1183,7 @@ xfs_create(
>  		      XFS_IOLOCK_PARENT | XFS_ILOCK_PARENT);
>  	unlock_dp_on_error = true;
>  
> -	xfs_bmap_init(&free_list, &first_block);
> +	xfs_defer_init(&free_list, &first_block);
>  
>  	/*
>  	 * Reserve disk quota and the inode.
> @@ -1254,7 +1254,7 @@ xfs_create(
>  	 */
>  	xfs_qm_vop_create_dqattach(tp, ip, udqp, gdqp, pdqp);
>  
> -	error = xfs_bmap_finish(&tp, &free_list, NULL);
> +	error = xfs_defer_finish(&tp, &free_list, NULL);
>  	if (error)
>  		goto out_bmap_cancel;
>  
> @@ -1270,7 +1270,7 @@ xfs_create(
>  	return 0;
>  
>   out_bmap_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>   out_trans_cancel:
>  	xfs_trans_cancel(tp);
>   out_release_inode:
> @@ -1402,7 +1402,7 @@ xfs_link(
>  	xfs_mount_t		*mp = tdp->i_mount;
>  	xfs_trans_t		*tp;
>  	int			error;
> -	xfs_bmap_free_t         free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t           first_block;
>  	int			resblks;
>  
> @@ -1453,7 +1453,7 @@ xfs_link(
>  			goto error_return;
>  	}
>  
> -	xfs_bmap_init(&free_list, &first_block);
> +	xfs_defer_init(&free_list, &first_block);
>  
>  	/*
>  	 * Handle initial link state of O_TMPFILE inode
> @@ -1483,9 +1483,9 @@ xfs_link(
>  	if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
>  		xfs_trans_set_sync(tp);
>  
> -	error = xfs_bmap_finish(&tp, &free_list, NULL);
> +	error = xfs_defer_finish(&tp, &free_list, NULL);
>  	if (error) {
> -		xfs_bmap_cancel(&free_list);
> +		xfs_defer_cancel(&free_list);
>  		goto error_return;
>  	}
>  
> @@ -1527,7 +1527,7 @@ xfs_itruncate_extents(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_trans	*tp = *tpp;
> -	xfs_bmap_free_t		free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t		first_block;
>  	xfs_fileoff_t		first_unmap_block;
>  	xfs_fileoff_t		last_block;
> @@ -1563,7 +1563,7 @@ xfs_itruncate_extents(
>  	ASSERT(first_unmap_block < last_block);
>  	unmap_len = last_block - first_unmap_block + 1;
>  	while (!done) {
> -		xfs_bmap_init(&free_list, &first_block);
> +		xfs_defer_init(&free_list, &first_block);
>  		error = xfs_bunmapi(tp, ip,
>  				    first_unmap_block, unmap_len,
>  				    xfs_bmapi_aflag(whichfork),
> @@ -1577,7 +1577,7 @@ xfs_itruncate_extents(
>  		 * Duplicate the transaction that has the permanent
>  		 * reservation and commit the old transaction.
>  		 */
> -		error = xfs_bmap_finish(&tp, &free_list, ip);
> +		error = xfs_defer_finish(&tp, &free_list, ip);
>  		if (error)
>  			goto out_bmap_cancel;
>  
> @@ -1603,7 +1603,7 @@ out_bmap_cancel:
>  	 * the transaction can be properly aborted.  We just need to make sure
>  	 * we're not holding any resources that we were not when we came in.
>  	 */
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  	goto out;
>  }
>  
> @@ -1744,7 +1744,7 @@ STATIC int
>  xfs_inactive_ifree(
>  	struct xfs_inode *ip)
>  {
> -	xfs_bmap_free_t		free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t		first_block;
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_trans	*tp;
> @@ -1781,7 +1781,7 @@ xfs_inactive_ifree(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_trans_ijoin(tp, ip, 0);
>  
> -	xfs_bmap_init(&free_list, &first_block);
> +	xfs_defer_init(&free_list, &first_block);
>  	error = xfs_ifree(tp, ip, &free_list);
>  	if (error) {
>  		/*
> @@ -1808,11 +1808,11 @@ xfs_inactive_ifree(
>  	 * Just ignore errors at this point.  There is nothing we can do except
>  	 * to try to keep going. Make sure it's not a silent error.
>  	 */
> -	error = xfs_bmap_finish(&tp, &free_list, NULL);
> +	error = xfs_defer_finish(&tp, &free_list, NULL);
>  	if (error) {
> -		xfs_notice(mp, "%s: xfs_bmap_finish returned error %d",
> +		xfs_notice(mp, "%s: xfs_defer_finish returned error %d",
>  			__func__, error);
> -		xfs_bmap_cancel(&free_list);
> +		xfs_defer_cancel(&free_list);
>  	}
>  	error = xfs_trans_commit(tp);
>  	if (error)
> @@ -2368,7 +2368,7 @@ int
>  xfs_ifree(
>  	xfs_trans_t	*tp,
>  	xfs_inode_t	*ip,
> -	xfs_bmap_free_t	*flist)
> +	struct xfs_defer_ops	*flist)
>  {
>  	int			error;
>  	struct xfs_icluster	xic = { 0 };
> @@ -2475,7 +2475,7 @@ xfs_iunpin_wait(
>   * directory entry.
>   *
>   * This is still safe from a transactional point of view - it is not until we
> - * get to xfs_bmap_finish() that we have the possibility of multiple
> + * get to xfs_defer_finish() that we have the possibility of multiple
>   * transactions in this operation. Hence as long as we remove the directory
>   * entry and drop the link count in the first transaction of the remove
>   * operation, there are no transactional constraints on the ordering here.
> @@ -2490,7 +2490,7 @@ xfs_remove(
>  	xfs_trans_t             *tp = NULL;
>  	int			is_dir = S_ISDIR(VFS_I(ip)->i_mode);
>  	int                     error = 0;
> -	xfs_bmap_free_t         free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t           first_block;
>  	uint			resblks;
>  
> @@ -2572,7 +2572,7 @@ xfs_remove(
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	xfs_bmap_init(&free_list, &first_block);
> +	xfs_defer_init(&free_list, &first_block);
>  	error = xfs_dir_removename(tp, dp, name, ip->i_ino,
>  					&first_block, &free_list, resblks);
>  	if (error) {
> @@ -2588,7 +2588,7 @@ xfs_remove(
>  	if (mp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
>  		xfs_trans_set_sync(tp);
>  
> -	error = xfs_bmap_finish(&tp, &free_list, NULL);
> +	error = xfs_defer_finish(&tp, &free_list, NULL);
>  	if (error)
>  		goto out_bmap_cancel;
>  
> @@ -2602,7 +2602,7 @@ xfs_remove(
>  	return 0;
>  
>   out_bmap_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>   out_trans_cancel:
>  	xfs_trans_cancel(tp);
>   std_return:
> @@ -2663,7 +2663,7 @@ xfs_sort_for_rename(
>  static int
>  xfs_finish_rename(
>  	struct xfs_trans	*tp,
> -	struct xfs_bmap_free	*free_list)
> +	struct xfs_defer_ops	*free_list)
>  {
>  	int			error;
>  
> @@ -2674,9 +2674,9 @@ xfs_finish_rename(
>  	if (tp->t_mountp->m_flags & (XFS_MOUNT_WSYNC|XFS_MOUNT_DIRSYNC))
>  		xfs_trans_set_sync(tp);
>  
> -	error = xfs_bmap_finish(&tp, free_list, NULL);
> +	error = xfs_defer_finish(&tp, free_list, NULL);
>  	if (error) {
> -		xfs_bmap_cancel(free_list);
> +		xfs_defer_cancel(free_list);
>  		xfs_trans_cancel(tp);
>  		return error;
>  	}
> @@ -2698,7 +2698,7 @@ xfs_cross_rename(
>  	struct xfs_inode	*dp2,
>  	struct xfs_name		*name2,
>  	struct xfs_inode	*ip2,
> -	struct xfs_bmap_free	*free_list,
> +	struct xfs_defer_ops	*free_list,
>  	xfs_fsblock_t		*first_block,
>  	int			spaceres)
>  {
> @@ -2801,7 +2801,7 @@ xfs_cross_rename(
>  	return xfs_finish_rename(tp, free_list);
>  
>  out_trans_abort:
> -	xfs_bmap_cancel(free_list);
> +	xfs_defer_cancel(free_list);
>  	xfs_trans_cancel(tp);
>  	return error;
>  }
> @@ -2856,7 +2856,7 @@ xfs_rename(
>  {
>  	struct xfs_mount	*mp = src_dp->i_mount;
>  	struct xfs_trans	*tp;
> -	struct xfs_bmap_free	free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t		first_block;
>  	struct xfs_inode	*wip = NULL;		/* whiteout inode */
>  	struct xfs_inode	*inodes[__XFS_SORT_INODES];
> @@ -2945,7 +2945,7 @@ xfs_rename(
>  		goto out_trans_cancel;
>  	}
>  
> -	xfs_bmap_init(&free_list, &first_block);
> +	xfs_defer_init(&free_list, &first_block);
>  
>  	/* RENAME_EXCHANGE is unique from here on. */
>  	if (flags & RENAME_EXCHANGE)
> @@ -3131,7 +3131,7 @@ xfs_rename(
>  	return error;
>  
>  out_bmap_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
>  out_release_wip:
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 99d7522..633f2af 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -27,7 +27,7 @@
>  struct xfs_dinode;
>  struct xfs_inode;
>  struct xfs_buf;
> -struct xfs_bmap_free;
> +struct xfs_defer_ops;
>  struct xfs_bmbt_irec;
>  struct xfs_inode_log_item;
>  struct xfs_mount;
> @@ -398,7 +398,7 @@ uint		xfs_ilock_attr_map_shared(struct xfs_inode *);
>  
>  uint		xfs_ip2xflags(struct xfs_inode *);
>  int		xfs_ifree(struct xfs_trans *, xfs_inode_t *,
> -			   struct xfs_bmap_free *);
> +			   struct xfs_defer_ops *);
>  int		xfs_itruncate_extents(struct xfs_trans **, struct xfs_inode *,
>  				      int, xfs_fsize_t);
>  void		xfs_iext_realloc(xfs_inode_t *, int, int);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index b090bc1..cb7abe84 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -128,7 +128,7 @@ xfs_iomap_write_direct(
>  	int		quota_flag;
>  	int		rt;
>  	xfs_trans_t	*tp;
> -	xfs_bmap_free_t free_list;
> +	struct xfs_defer_ops free_list;
>  	uint		qblocks, resblks, resrtextents;
>  	int		error;
>  	int		lockmode;
> @@ -231,7 +231,7 @@ xfs_iomap_write_direct(
>  	 * From this point onwards we overwrite the imap pointer that the
>  	 * caller gave to us.
>  	 */
> -	xfs_bmap_init(&free_list, &firstfsb);
> +	xfs_defer_init(&free_list, &firstfsb);
>  	nimaps = 1;
>  	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
>  				bmapi_flags, &firstfsb, resblks, imap,
> @@ -242,7 +242,7 @@ xfs_iomap_write_direct(
>  	/*
>  	 * Complete the transaction
>  	 */
> -	error = xfs_bmap_finish(&tp, &free_list, NULL);
> +	error = xfs_defer_finish(&tp, &free_list, NULL);
>  	if (error)
>  		goto out_bmap_cancel;
>  
> @@ -266,7 +266,7 @@ out_unlock:
>  	return error;
>  
>  out_bmap_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  	xfs_trans_unreserve_quota_nblks(tp, ip, (long)qblocks, 0, quota_flag);
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
> @@ -685,7 +685,7 @@ xfs_iomap_write_allocate(
>  	xfs_fileoff_t	offset_fsb, last_block;
>  	xfs_fileoff_t	end_fsb, map_start_fsb;
>  	xfs_fsblock_t	first_block;
> -	xfs_bmap_free_t	free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_filblks_t	count_fsb;
>  	xfs_trans_t	*tp;
>  	int		nimaps;
> @@ -727,7 +727,7 @@ xfs_iomap_write_allocate(
>  			xfs_ilock(ip, XFS_ILOCK_EXCL);
>  			xfs_trans_ijoin(tp, ip, 0);
>  
> -			xfs_bmap_init(&free_list, &first_block);
> +			xfs_defer_init(&free_list, &first_block);
>  
>  			/*
>  			 * it is possible that the extents have changed since
> @@ -787,7 +787,7 @@ xfs_iomap_write_allocate(
>  			if (error)
>  				goto trans_cancel;
>  
> -			error = xfs_bmap_finish(&tp, &free_list, NULL);
> +			error = xfs_defer_finish(&tp, &free_list, NULL);
>  			if (error)
>  				goto trans_cancel;
>  
> @@ -821,7 +821,7 @@ xfs_iomap_write_allocate(
>  	}
>  
>  trans_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  	xfs_trans_cancel(tp);
>  error0:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -842,7 +842,7 @@ xfs_iomap_write_unwritten(
>  	int		nimaps;
>  	xfs_trans_t	*tp;
>  	xfs_bmbt_irec_t imap;
> -	xfs_bmap_free_t free_list;
> +	struct xfs_defer_ops free_list;
>  	xfs_fsize_t	i_size;
>  	uint		resblks;
>  	int		error;
> @@ -886,7 +886,7 @@ xfs_iomap_write_unwritten(
>  		/*
>  		 * Modify the unwritten extent state of the buffer.
>  		 */
> -		xfs_bmap_init(&free_list, &firstfsb);
> +		xfs_defer_init(&free_list, &firstfsb);
>  		nimaps = 1;
>  		error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
>  					XFS_BMAPI_CONVERT, &firstfsb, resblks,
> @@ -909,7 +909,7 @@ xfs_iomap_write_unwritten(
>  			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  		}
>  
> -		error = xfs_bmap_finish(&tp, &free_list, NULL);
> +		error = xfs_defer_finish(&tp, &free_list, NULL);
>  		if (error)
>  			goto error_on_bmapi_transaction;
>  
> @@ -936,7 +936,7 @@ xfs_iomap_write_unwritten(
>  	return 0;
>  
>  error_on_bmapi_transaction:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  	xfs_trans_cancel(tp);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 627f7e6..c761a6a 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -770,7 +770,7 @@ xfs_growfs_rt_alloc(
>  	xfs_daddr_t		d;		/* disk block address */
>  	int			error;		/* error return value */
>  	xfs_fsblock_t		firstblock;/* first block allocated in xaction */
> -	struct xfs_bmap_free	flist;		/* list of freed blocks */
> +	struct xfs_defer_ops	flist;		/* list of freed blocks */
>  	xfs_fsblock_t		fsbno;		/* filesystem block for bno */
>  	struct xfs_bmbt_irec	map;		/* block map output */
>  	int			nmap;		/* number of block maps */
> @@ -795,7 +795,7 @@ xfs_growfs_rt_alloc(
>  		xfs_ilock(ip, XFS_ILOCK_EXCL);
>  		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  
> -		xfs_bmap_init(&flist, &firstblock);
> +		xfs_defer_init(&flist, &firstblock);
>  		/*
>  		 * Allocate blocks to the bitmap file.
>  		 */
> @@ -810,7 +810,7 @@ xfs_growfs_rt_alloc(
>  		/*
>  		 * Free any blocks freed up in the transaction, then commit.
>  		 */
> -		error = xfs_bmap_finish(&tp, &flist, NULL);
> +		error = xfs_defer_finish(&tp, &flist, NULL);
>  		if (error)
>  			goto out_bmap_cancel;
>  		error = xfs_trans_commit(tp);
> @@ -863,7 +863,7 @@ xfs_growfs_rt_alloc(
>  	return 0;
>  
>  out_bmap_cancel:
> -	xfs_bmap_cancel(&flist);
> +	xfs_defer_cancel(&flist);
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
>  	return error;
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index 20af47b..3b005ec 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -173,7 +173,7 @@ xfs_symlink(
>  	struct xfs_inode	*ip = NULL;
>  	int			error = 0;
>  	int			pathlen;
> -	struct xfs_bmap_free	free_list;
> +	struct xfs_defer_ops	free_list;
>  	xfs_fsblock_t		first_block;
>  	bool                    unlock_dp_on_error = false;
>  	xfs_fileoff_t		first_fsb;
> @@ -270,7 +270,7 @@ xfs_symlink(
>  	 * Initialize the bmap freelist prior to calling either
>  	 * bmapi or the directory create code.
>  	 */
> -	xfs_bmap_init(&free_list, &first_block);
> +	xfs_defer_init(&free_list, &first_block);
>  
>  	/*
>  	 * Allocate an inode for the symlink.
> @@ -377,7 +377,7 @@ xfs_symlink(
>  		xfs_trans_set_sync(tp);
>  	}
>  
> -	error = xfs_bmap_finish(&tp, &free_list, NULL);
> +	error = xfs_defer_finish(&tp, &free_list, NULL);
>  	if (error)
>  		goto out_bmap_cancel;
>  
> @@ -393,7 +393,7 @@ xfs_symlink(
>  	return 0;
>  
>  out_bmap_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
>  out_release_inode:
> @@ -427,7 +427,7 @@ xfs_inactive_symlink_rmt(
>  	int		done;
>  	int		error;
>  	xfs_fsblock_t	first_block;
> -	xfs_bmap_free_t	free_list;
> +	struct xfs_defer_ops	free_list;
>  	int		i;
>  	xfs_mount_t	*mp;
>  	xfs_bmbt_irec_t	mval[XFS_SYMLINK_MAPS];
> @@ -466,7 +466,7 @@ xfs_inactive_symlink_rmt(
>  	 * Find the block(s) so we can inval and unmap them.
>  	 */
>  	done = 0;
> -	xfs_bmap_init(&free_list, &first_block);
> +	xfs_defer_init(&free_list, &first_block);
>  	nmaps = ARRAY_SIZE(mval);
>  	error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
>  				mval, &nmaps, 0);
> @@ -496,7 +496,7 @@ xfs_inactive_symlink_rmt(
>  	/*
>  	 * Commit the first transaction.  This logs the EFI and the inode.
>  	 */
> -	error = xfs_bmap_finish(&tp, &free_list, ip);
> +	error = xfs_defer_finish(&tp, &free_list, ip);
>  	if (error)
>  		goto error_bmap_cancel;
>  	/*
> @@ -526,7 +526,7 @@ xfs_inactive_symlink_rmt(
>  	return 0;
>  
>  error_bmap_cancel:
> -	xfs_bmap_cancel(&free_list);
> +	xfs_defer_cancel(&free_list);
>  error_trans_cancel:
>  	xfs_trans_cancel(tp);
>  error_unlock:
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 038/119] xfs: convert unwritten status of reverse mappings
  2016-06-17  1:21 ` [PATCH 038/119] xfs: convert unwritten status of reverse mappings Darrick J. Wong
@ 2016-06-30  0:15   ` Darrick J. Wong
  2016-07-13 18:27   ` Brian Foster
  1 sibling, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-06-30  0:15 UTC (permalink / raw)
  To: david; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:21:55PM -0700, Darrick J. Wong wrote:
> Provide a function to convert an unwritten extent to a real one and
> vice versa.
> 
> v2: Move unwritten bit to rm_offset.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_rmap.c |  442 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trace.h       |    6 +
>  2 files changed, 448 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index 1043c63..53ba14e 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -610,6 +610,448 @@ out_error:
>  	return error;
>  }
>  
> +#define RMAP_LEFT_CONTIG	(1 << 0)
> +#define RMAP_RIGHT_CONTIG	(1 << 1)
> +#define RMAP_LEFT_FILLING	(1 << 2)
> +#define RMAP_RIGHT_FILLING	(1 << 3)
> +#define RMAP_LEFT_VALID		(1 << 6)
> +#define RMAP_RIGHT_VALID	(1 << 7)
> +
> +#define LEFT		r[0]
> +#define RIGHT		r[1]
> +#define PREV		r[2]
> +#define NEW		r[3]
> +
> +/*
> + * Convert an unwritten extent to a real extent or vice versa.
> + * Does not handle overlapping extents.
> + */
> +STATIC int
> +__xfs_rmap_convert(
> +	struct xfs_btree_cur	*cur,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	bool			unwritten,
> +	struct xfs_owner_info	*oinfo)
> +{
> +	struct xfs_mount	*mp = cur->bc_mp;
> +	struct xfs_rmap_irec	r[4];	/* neighbor extent entries */
> +					/* left is 0, right is 1, prev is 2 */
> +					/* new is 3 */
> +	uint64_t		owner;
> +	uint64_t		offset;
> +	uint64_t		new_endoff;
> +	unsigned int		oldext;
> +	unsigned int		newext;
> +	unsigned int		flags = 0;
> +	int			i;
> +	int			state = 0;
> +	int			error;
> +
> +	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
> +	ASSERT(!(XFS_RMAP_NON_INODE_OWNER(owner) ||
> +			(flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))));
> +	oldext = unwritten ? XFS_RMAP_UNWRITTEN : 0;
> +	new_endoff = offset + len;
> +	trace_xfs_rmap_convert(mp, cur->bc_private.a.agno, bno, len,
> +			unwritten, oinfo);
> +
> +	/*
> +	 * For the initial lookup, look for and exact match or the left-adjacent
> +	 * record for our insertion point. This will also give us the record for
> +	 * start block contiguity tests.
> +	 */
> +	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, oldext, &i);
> +	if (error)
> +		goto done;
> +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +
> +	error = xfs_rmap_get_rec(cur, &PREV, &i);
> +	if (error)
> +		goto done;
> +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> +			cur->bc_private.a.agno, PREV.rm_startblock,
> +			PREV.rm_blockcount, PREV.rm_owner,
> +			PREV.rm_offset, PREV.rm_flags);
> +
> +	ASSERT(PREV.rm_offset <= offset);
> +	ASSERT(PREV.rm_offset + PREV.rm_blockcount >= new_endoff);
> +	ASSERT((PREV.rm_flags & XFS_RMAP_UNWRITTEN) == oldext);
> +	newext = ~oldext & XFS_RMAP_UNWRITTEN;
> +
> +	/*
> +	 * Set flags determining what part of the previous oldext allocation
> +	 * extent is being replaced by a newext allocation.
> +	 */
> +	if (PREV.rm_offset == offset)
> +		state |= RMAP_LEFT_FILLING;
> +	if (PREV.rm_offset + PREV.rm_blockcount == new_endoff)
> +		state |= RMAP_RIGHT_FILLING;
> +
> +	/*
> +	 * Decrement the cursor to see if we have a left-adjacent record to our
> +	 * insertion point. This will give us the record for end block
> +	 * contiguity tests.
> +	 */
> +	error = xfs_btree_decrement(cur, 0, &i);
> +	if (error)
> +		goto done;
> +	if (i) {
> +		state |= RMAP_LEFT_VALID;
> +		error = xfs_rmap_get_rec(cur, &LEFT, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		XFS_WANT_CORRUPTED_GOTO(mp,
> +				LEFT.rm_startblock + LEFT.rm_blockcount <= bno,
> +				done);
> +		trace_xfs_rmap_find_left_neighbor_result(cur->bc_mp,
> +				cur->bc_private.a.agno, LEFT.rm_startblock,
> +				LEFT.rm_blockcount, LEFT.rm_owner,
> +				LEFT.rm_offset, LEFT.rm_flags);
> +		if (LEFT.rm_startblock + LEFT.rm_blockcount == bno &&
> +		    LEFT.rm_offset + LEFT.rm_blockcount == offset &&
> +		    xfs_rmap_is_mergeable(&LEFT, owner, offset, len, newext))
> +			state |= RMAP_LEFT_CONTIG;
> +	}
> +
> +	/*
> +	 * Increment the cursor to see if we have a right-adjacent record to our
> +	 * insertion point. This will give us the record for end block
> +	 * contiguity tests.
> +	 */
> +	error = xfs_btree_increment(cur, 0, &i);
> +	if (error)
> +		goto done;
> +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +	error = xfs_btree_increment(cur, 0, &i);
> +	if (error)
> +		goto done;
> +	if (i) {
> +		state |= RMAP_RIGHT_VALID;
> +		error = xfs_rmap_get_rec(cur, &RIGHT, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= RIGHT.rm_startblock,
> +					done);
> +		trace_xfs_rmap_convert_gtrec(cur->bc_mp,
> +				cur->bc_private.a.agno, RIGHT.rm_startblock,
> +				RIGHT.rm_blockcount, RIGHT.rm_owner,
> +				RIGHT.rm_offset, RIGHT.rm_flags);
> +		if (bno + len == RIGHT.rm_startblock &&
> +		    offset + len == RIGHT.rm_offset &&
> +		    xfs_rmap_is_mergeable(&RIGHT, owner, offset, len, newext))
> +			state |= RMAP_RIGHT_CONTIG;
> +	}
> +
> +	/* check that left + prev + right is not too long */
> +	if ((state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
> +			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) ==
> +	    (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
> +	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG) &&
> +	    (unsigned long)LEFT.rm_blockcount + len +
> +	     RIGHT.rm_blockcount > XFS_RMAP_LEN_MAX)
> +		state &= ~RMAP_RIGHT_CONTIG;
> +
> +	trace_xfs_rmap_convert_state(mp, cur->bc_private.a.agno, state,
> +			_RET_IP_);
> +
> +	/* reset the cursor back to PREV */
> +	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, oldext, &i);
> +	if (error)
> +		goto done;
> +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +
> +	/*
> +	 * Switch out based on the FILLING and CONTIG state bits.
> +	 */
> +	switch (state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
> +			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) {
> +	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
> +	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
> +		/*
> +		 * Setting all of a previous oldext extent to newext.
> +		 * The left and right neighbors are both contiguous with new.
> +		 */
> +		error = xfs_btree_increment(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> +				RIGHT.rm_startblock, RIGHT.rm_blockcount,
> +				RIGHT.rm_owner, RIGHT.rm_offset,
> +				RIGHT.rm_flags);
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		error = xfs_btree_decrement(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> +				PREV.rm_startblock, PREV.rm_blockcount,
> +				PREV.rm_owner, PREV.rm_offset,
> +				PREV.rm_flags);
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		error = xfs_btree_decrement(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		NEW = LEFT;
> +		NEW.rm_blockcount += PREV.rm_blockcount + RIGHT.rm_blockcount;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		break;
> +
> +	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
> +		/*
> +		 * Setting all of a previous oldext extent to newext.
> +		 * The left neighbor is contiguous, the right is not.
> +		 */
> +		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> +				PREV.rm_startblock, PREV.rm_blockcount,
> +				PREV.rm_owner, PREV.rm_offset,
> +				PREV.rm_flags);
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		error = xfs_btree_decrement(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		NEW = LEFT;
> +		NEW.rm_blockcount += PREV.rm_blockcount;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		break;
> +
> +	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
> +		/*
> +		 * Setting all of a previous oldext extent to newext.
> +		 * The right neighbor is contiguous, the left is not.
> +		 */
> +		error = xfs_btree_increment(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> +				RIGHT.rm_startblock, RIGHT.rm_blockcount,
> +				RIGHT.rm_owner, RIGHT.rm_offset,
> +				RIGHT.rm_flags);
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		error = xfs_btree_decrement(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		NEW.rm_startblock = bno;
> +		NEW.rm_owner = owner;
> +		NEW.rm_offset = offset;
> +		NEW.rm_blockcount = len + RIGHT.rm_blockcount;
> +		NEW.rm_flags = newext;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		break;
> +
> +	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING:
> +		/*
> +		 * Setting all of a previous oldext extent to newext.
> +		 * Neither the left nor right neighbors are contiguous with
> +		 * the new one.
> +		 */
> +		NEW = PREV;
> +		NEW.rm_flags = newext;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		break;
> +
> +	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG:
> +		/*
> +		 * Setting the first part of a previous oldext extent to newext.
> +		 * The left neighbor is contiguous.
> +		 */
> +		NEW = PREV;
> +		NEW.rm_offset += len;
> +		NEW.rm_startblock += len;
> +		NEW.rm_blockcount -= len;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		error = xfs_btree_decrement(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		NEW = LEFT;
> +		NEW.rm_blockcount += len;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		break;
> +
> +	case RMAP_LEFT_FILLING:
> +		/*
> +		 * Setting the first part of a previous oldext extent to newext.
> +		 * The left neighbor is not contiguous.
> +		 */
> +		NEW = PREV;
> +		NEW.rm_startblock += len;
> +		NEW.rm_offset += len;
> +		NEW.rm_blockcount -= len;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		NEW.rm_startblock = bno;
> +		NEW.rm_owner = owner;
> +		NEW.rm_offset = offset;
> +		NEW.rm_blockcount = len;
> +		NEW.rm_flags = newext;
> +		cur->bc_rec.r = NEW;
> +		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno,
> +				len, owner, offset, newext);
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		break;
> +
> +	case RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
> +		/*
> +		 * Setting the last part of a previous oldext extent to newext.
> +		 * The right neighbor is contiguous with the new allocation.
> +		 */
> +		NEW = PREV;
> +		NEW.rm_blockcount -= len;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		error = xfs_btree_increment(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		NEW = RIGHT;
> +		NEW.rm_offset = offset;
> +		NEW.rm_startblock = bno;
> +		NEW.rm_blockcount += len;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		break;
> +
> +	case RMAP_RIGHT_FILLING:
> +		/*
> +		 * Setting the last part of a previous oldext extent to newext.
> +		 * The right neighbor is not contiguous.
> +		 */
> +		NEW = PREV;
> +		NEW.rm_blockcount -= len;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		error = xfs_rmap_lookup_eq(cur, bno, len, owner, offset,
> +				oldext, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 0, done);
> +		NEW.rm_startblock = bno;
> +		NEW.rm_owner = owner;
> +		NEW.rm_offset = offset;
> +		NEW.rm_blockcount = len;
> +		NEW.rm_flags = newext;
> +		cur->bc_rec.r = NEW;
> +		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno,
> +				len, owner, offset, newext);
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		break;
> +
> +	case 0:
> +		/*
> +		 * Setting the middle part of a previous oldext extent to
> +		 * newext.  Contiguity is impossible here.
> +		 * One extent becomes three extents.
> +		 */
> +		/* new right extent - oldext */
> +		NEW.rm_startblock = bno + len;
> +		NEW.rm_owner = owner;
> +		NEW.rm_offset = new_endoff;
> +		NEW.rm_blockcount = PREV.rm_offset + PREV.rm_blockcount -
> +				new_endoff;
> +		NEW.rm_flags = PREV.rm_flags;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		/* new left extent - oldext */
> +		NEW = PREV;
> +		NEW.rm_blockcount = offset - PREV.rm_offset;
> +		cur->bc_rec.r = NEW;
> +		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno,
> +				NEW.rm_startblock, NEW.rm_blockcount,
> +				NEW.rm_owner, NEW.rm_offset,
> +				NEW.rm_flags);
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		/*
> +		 * Reset the cursor to the position of the new extent
> +		 * we are about to insert as we can't trust it after
> +		 * the previous insert.
> +		 */
> +		error = xfs_rmap_lookup_eq(cur, bno, len, owner, offset,
> +				oldext, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 0, done);
> +		/* new middle extent - newext */
> +		cur->bc_rec.b.br_state = newext;

Wrong, should be:

		cur->bc_rec.r.rm_flags &= ~XFS_RMAP_UNWRITTEN;
		cur->bc_rec.r.rm_flags |= newext;

We're modifying the rmapbt here, not the bmbt, so it makes no sense to touch
the bmbt_irec in the cursor.  Modify the rmap_irec instead.  Incidentally
this just happens not to fail because the fields line up....

--D

> +		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno, len,
> +				owner, offset, newext);
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		break;
> +
> +	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
> +	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
> +	case RMAP_LEFT_FILLING | RMAP_RIGHT_CONTIG:
> +	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
> +	case RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
> +	case RMAP_LEFT_CONTIG:
> +	case RMAP_RIGHT_CONTIG:
> +		/*
> +		 * These cases are all impossible.
> +		 */
> +		ASSERT(0);
> +	}
> +
> +	trace_xfs_rmap_convert_done(mp, cur->bc_private.a.agno, bno, len,
> +			unwritten, oinfo);
> +done:
> +	if (error)
> +		trace_xfs_rmap_convert_error(cur->bc_mp,
> +				cur->bc_private.a.agno, error, _RET_IP_);
> +	return error;
> +}
> +
> +#undef	NEW
> +#undef	LEFT
> +#undef	RIGHT
> +#undef	PREV
> +
>  struct xfs_rmapbt_query_range_info {
>  	xfs_rmapbt_query_range_fn	fn;
>  	void				*priv;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 3ebceb0..6466adc 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2497,6 +2497,10 @@ DEFINE_RMAP_EVENT(xfs_rmap_free_extent_error);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
> +DEFINE_RMAP_EVENT(xfs_rmap_convert);
> +DEFINE_RMAP_EVENT(xfs_rmap_convert_done);
> +DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_error);
> +DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_state);
>  
>  DECLARE_EVENT_CLASS(xfs_rmapbt_class,
>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> @@ -2551,6 +2555,8 @@ DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
>  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
>  DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
>  DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
> +DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
> +DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
>  
>  #endif /* _TRACE_XFS_H */
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 023/119] xfs: introduce rmap btree definitions
  2016-06-17  1:20 ` [PATCH 023/119] xfs: introduce rmap btree definitions Darrick J. Wong
@ 2016-06-30 17:32   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-30 17:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:20:19PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add new per-ag rmap btree definitions to the per-ag structures. The
> rmap btree will sit in the empty slots on disk after the free space
> btrees, and hence form a part of the array of space management
> btrees. This requires the definition of the btree to be contiguous
> with the free space btrees.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_alloc.c  |    6 ++++++
>  fs/xfs/libxfs/xfs_btree.c  |    4 ++--
>  fs/xfs/libxfs/xfs_btree.h  |    3 +++
>  fs/xfs/libxfs/xfs_format.h |   22 +++++++++++++++++-----
>  fs/xfs/libxfs/xfs_types.h  |    4 ++--
>  5 files changed, 30 insertions(+), 9 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 56c8690..b61e9c6 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2272,6 +2272,10 @@ xfs_agf_verify(
>  	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]) > XFS_BTREE_MAXLEVELS)
>  		return false;
>  
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb) &&
> +	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) > XFS_BTREE_MAXLEVELS)
> +		return false;
> +
>  	/*
>  	 * during growfs operations, the perag is not fully initialised,
>  	 * so we can't use it for any useful checking. growfs ensures we can't
> @@ -2403,6 +2407,8 @@ xfs_alloc_read_agf(
>  			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
>  		pag->pagf_levels[XFS_BTNUM_CNTi] =
>  			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
> +		pag->pagf_levels[XFS_BTNUM_RMAPi] =
> +			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
>  		spin_lock_init(&pag->pagb_lock);
>  		pag->pagb_count = 0;
>  #ifdef __KERNEL__
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 5b3743a..624b572 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -44,9 +44,9 @@ kmem_zone_t	*xfs_btree_cur_zone;
>   * Btree magic numbers.
>   */
>  static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
> -	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
> +	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
>  	  XFS_FIBT_MAGIC },
> -	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC,
> +	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC,
>  	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
>  };
>  #define xfs_btree_magic(cur) \
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 7483cac..202fdd3 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -63,6 +63,7 @@ union xfs_btree_rec {
>  #define	XFS_BTNUM_BMAP	((xfs_btnum_t)XFS_BTNUM_BMAPi)
>  #define	XFS_BTNUM_INO	((xfs_btnum_t)XFS_BTNUM_INOi)
>  #define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
> +#define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
>  
>  /*
>   * For logging record fields.
> @@ -95,6 +96,7 @@ do {    \
>  	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(__mp, bmbt, stat); break; \
>  	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
>  	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
> +	case XFS_BTNUM_RMAP: break;	\
>  	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
>  	}       \
>  } while (0)
> @@ -115,6 +117,7 @@ do {    \
>  		__XFS_BTREE_STATS_ADD(__mp, ibt, stat, val); break; \
>  	case XFS_BTNUM_FINO:	\
>  		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
> +	case XFS_BTNUM_RMAP: break; \
>  	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
>  	}       \
>  } while (0)
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index ba528b3..8ca4a3d 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -455,6 +455,7 @@ xfs_sb_has_compat_feature(
>  }
>  
>  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> +#define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
>  		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> @@ -538,6 +539,12 @@ static inline bool xfs_sb_version_hasmetauuid(struct xfs_sb *sbp)
>  		(sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_META_UUID);
>  }
>  
> +static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
> +{
> +	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
> +		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
> +}
> +
>  /*
>   * end of superblock version macros
>   */
> @@ -598,10 +605,10 @@ xfs_is_quota_inode(struct xfs_sb *sbp, xfs_ino_t ino)
>  #define	XFS_AGI_GOOD_VERSION(v)	((v) == XFS_AGI_VERSION)
>  
>  /*
> - * Btree number 0 is bno, 1 is cnt.  This value gives the size of the
> + * Btree number 0 is bno, 1 is cnt, 2 is rmap. This value gives the size of the
>   * arrays below.
>   */
> -#define	XFS_BTNUM_AGF	((int)XFS_BTNUM_CNTi + 1)
> +#define	XFS_BTNUM_AGF	((int)XFS_BTNUM_RMAPi + 1)
>  
>  /*
>   * The second word of agf_levels in the first a.g. overlaps the EFS
> @@ -618,12 +625,10 @@ typedef struct xfs_agf {
>  	__be32		agf_seqno;	/* sequence # starting from 0 */
>  	__be32		agf_length;	/* size in blocks of a.g. */
>  	/*
> -	 * Freespace information
> +	 * Freespace and rmap information
>  	 */
>  	__be32		agf_roots[XFS_BTNUM_AGF];	/* root blocks */
> -	__be32		agf_spare0;	/* spare field */
>  	__be32		agf_levels[XFS_BTNUM_AGF];	/* btree levels */
> -	__be32		agf_spare1;	/* spare field */
>  
>  	__be32		agf_flfirst;	/* first freelist block's index */
>  	__be32		agf_fllast;	/* last freelist block's index */
> @@ -1307,6 +1312,13 @@ typedef __be32 xfs_inobt_ptr_t;
>  #define	XFS_FIBT_BLOCK(mp)		((xfs_agblock_t)(XFS_IBT_BLOCK(mp) + 1))
>  
>  /*
> + * Reverse mapping btree format definitions
> + *
> + * There is a btree for the reverse map per allocation group
> + */
> +#define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
> +
> +/*
>   * The first data block of an AG depends on whether the filesystem was formatted
>   * with the finobt feature. If so, account for the finobt reserved root btree
>   * block.
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index f0d145a..da87796 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -111,8 +111,8 @@ typedef enum {
>  } xfs_lookup_t;
>  
>  typedef enum {
> -	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_BMAPi, XFS_BTNUM_INOi,
> -	XFS_BTNUM_FINOi, XFS_BTNUM_MAX
> +	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
> +	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_MAX
>  } xfs_btnum_t;
>  
>  struct xfs_name {
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 024/119] xfs: add rmap btree stats infrastructure
  2016-06-17  1:20 ` [PATCH 024/119] xfs: add rmap btree stats infrastructure Darrick J. Wong
@ 2016-06-30 17:32   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-30 17:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:20:26PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The rmap btree will require the same stats as all the other generic
> btrees, so add al the code for that now.

		 all

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_btree.h |    5 +++--
>  fs/xfs/xfs_stats.c        |    1 +
>  fs/xfs/xfs_stats.h        |   18 +++++++++++++++++-
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 202fdd3..a29067c 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -96,7 +96,7 @@ do {    \
>  	case XFS_BTNUM_BMAP: __XFS_BTREE_STATS_INC(__mp, bmbt, stat); break; \
>  	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
>  	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
> -	case XFS_BTNUM_RMAP: break;	\
> +	case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(__mp, rmap, stat); break; \
>  	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
>  	}       \
>  } while (0)
> @@ -117,7 +117,8 @@ do {    \
>  		__XFS_BTREE_STATS_ADD(__mp, ibt, stat, val); break; \
>  	case XFS_BTNUM_FINO:	\
>  		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
> -	case XFS_BTNUM_RMAP: break; \
> +	case XFS_BTNUM_RMAP:	\
> +		__XFS_BTREE_STATS_ADD(__mp, rmap, stat, val); break; \
>  	case XFS_BTNUM_MAX: ASSERT(0); __mp = __mp /* fucking gcc */ ; break; \
>  	}       \
>  } while (0)
> diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
> index 8686df6..f04f547 100644
> --- a/fs/xfs/xfs_stats.c
> +++ b/fs/xfs/xfs_stats.c
> @@ -61,6 +61,7 @@ int xfs_stats_format(struct xfsstats __percpu *stats, char *buf)
>  		{ "bmbt2",		XFSSTAT_END_BMBT_V2		},
>  		{ "ibt2",		XFSSTAT_END_IBT_V2		},
>  		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
> +		{ "rmapbt",		XFSSTAT_END_RMAP_V2		},
>  		/* we print both series of quota information together */
>  		{ "qm",			XFSSTAT_END_QM			},
>  	};
> diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
> index 483b0ef..657865f 100644
> --- a/fs/xfs/xfs_stats.h
> +++ b/fs/xfs/xfs_stats.h
> @@ -197,7 +197,23 @@ struct xfsstats {
>  	__uint32_t		xs_fibt_2_alloc;
>  	__uint32_t		xs_fibt_2_free;
>  	__uint32_t		xs_fibt_2_moves;
> -#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_FIBT_V2+6)
> +#define XFSSTAT_END_RMAP_V2		(XFSSTAT_END_FIBT_V2+15)
> +	__uint32_t		xs_rmap_2_lookup;
> +	__uint32_t		xs_rmap_2_compare;
> +	__uint32_t		xs_rmap_2_insrec;
> +	__uint32_t		xs_rmap_2_delrec;
> +	__uint32_t		xs_rmap_2_newroot;
> +	__uint32_t		xs_rmap_2_killroot;
> +	__uint32_t		xs_rmap_2_increment;
> +	__uint32_t		xs_rmap_2_decrement;
> +	__uint32_t		xs_rmap_2_lshift;
> +	__uint32_t		xs_rmap_2_rshift;
> +	__uint32_t		xs_rmap_2_split;
> +	__uint32_t		xs_rmap_2_join;
> +	__uint32_t		xs_rmap_2_alloc;
> +	__uint32_t		xs_rmap_2_free;
> +	__uint32_t		xs_rmap_2_moves;
> +#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_RMAP_V2+6)
>  	__uint32_t		xs_qm_dqreclaims;
>  	__uint32_t		xs_qm_dqreclaim_misses;
>  	__uint32_t		xs_qm_dquot_dups;
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 025/119] xfs: rmap btree add more reserved blocks
  2016-06-17  1:20 ` [PATCH 025/119] xfs: rmap btree add more reserved blocks Darrick J. Wong
@ 2016-06-30 17:32   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-06-30 17:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:20:32PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> XFS reserves a small amount of space in each AG for the minimum
> number of free blocks needed for operation. Adding the rmap btree
> increases the number of reserved blocks, but it also increases the
> complexity of the calculation as the free inode btree is optional
> (like the rmbt).
> 
> Rather than calculate the prealloc blocks every time we need to
> check it, add a function to calculate it at mount time and store it
> in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
> just to use the xfs-mount variable directly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_alloc.c  |   11 +++++++++++
>  fs/xfs/libxfs/xfs_alloc.h  |    2 ++
>  fs/xfs/libxfs/xfs_format.h |    9 +--------
>  fs/xfs/xfs_fsops.c         |    6 +++---
>  fs/xfs/xfs_mount.c         |    2 ++
>  fs/xfs/xfs_mount.h         |    1 +
>  6 files changed, 20 insertions(+), 11 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index b61e9c6..fb00042 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -50,6 +50,17 @@ STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
>  STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
>  		xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
>  
> +xfs_extlen_t
> +xfs_prealloc_blocks(
> +	struct xfs_mount	*mp)
> +{
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return XFS_RMAP_BLOCK(mp) + 1;
> +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return XFS_FIBT_BLOCK(mp) + 1;
> +	return XFS_IBT_BLOCK(mp) + 1;
> +}
> +
>  /*
>   * Lookup the record equal to [bno, len] in the btree given by cur.
>   */
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index cf268b2..20b54aa 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -232,4 +232,6 @@ int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
>  int xfs_free_extent_fix_freelist(struct xfs_trans *tp, xfs_agnumber_t agno,
>  		struct xfs_buf **agbp);
>  
> +xfs_extlen_t xfs_prealloc_blocks(struct xfs_mount *mp);
> +
>  #endif	/* __XFS_ALLOC_H__ */
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 8ca4a3d..b5b0901 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1318,18 +1318,11 @@ typedef __be32 xfs_inobt_ptr_t;
>   */
>  #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
>  
> -/*
> - * The first data block of an AG depends on whether the filesystem was formatted
> - * with the finobt feature. If so, account for the finobt reserved root btree
> - * block.
> - */
> -#define XFS_PREALLOC_BLOCKS(mp) \
> +#define	XFS_RMAP_BLOCK(mp) \
>  	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
>  	 XFS_FIBT_BLOCK(mp) + 1 : \
>  	 XFS_IBT_BLOCK(mp) + 1)
>  
> -
> -
>  /*
>   * BMAP Btree format definitions
>   *
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 064fce1..62162d4 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -243,7 +243,7 @@ xfs_growfs_data_private(
>  		agf->agf_flfirst = cpu_to_be32(1);
>  		agf->agf_fllast = 0;
>  		agf->agf_flcount = 0;
> -		tmpsize = agsize - XFS_PREALLOC_BLOCKS(mp);
> +		tmpsize = agsize - mp->m_ag_prealloc_blocks;
>  		agf->agf_freeblks = cpu_to_be32(tmpsize);
>  		agf->agf_longest = cpu_to_be32(tmpsize);
>  		if (xfs_sb_version_hascrc(&mp->m_sb))
> @@ -340,7 +340,7 @@ xfs_growfs_data_private(
>  						agno, 0);
>  
>  		arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
> -		arec->ar_startblock = cpu_to_be32(XFS_PREALLOC_BLOCKS(mp));
> +		arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
>  		arec->ar_blockcount = cpu_to_be32(
>  			agsize - be32_to_cpu(arec->ar_startblock));
>  
> @@ -369,7 +369,7 @@ xfs_growfs_data_private(
>  						agno, 0);
>  
>  		arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
> -		arec->ar_startblock = cpu_to_be32(XFS_PREALLOC_BLOCKS(mp));
> +		arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
>  		arec->ar_blockcount = cpu_to_be32(
>  			agsize - be32_to_cpu(arec->ar_startblock));
>  		nfree += be32_to_cpu(arec->ar_blockcount);
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index bf63682..b4153f0 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -231,6 +231,8 @@ xfs_initialize_perag(
>  
>  	if (maxagi)
>  		*maxagi = index;
> +
> +	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
>  	return 0;
>  
>  out_unwind:
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index c1b798c..0537b1f 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -119,6 +119,7 @@ typedef struct xfs_mount {
>  	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
>  	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
>  	uint			m_in_maxlevels;	/* max inobt btree levels. */
> +	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
>  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
>  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
>  	struct mutex		m_growlock;	/* growfs mutex */
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 026/119] xfs: add owner field to extent allocation and freeing
  2016-06-17  1:20 ` [PATCH 026/119] xfs: add owner field to extent allocation and freeing Darrick J. Wong
@ 2016-07-06  4:01   ` Dave Chinner
  2016-07-06  6:44     ` Darrick J. Wong
  2016-07-07 15:12   ` Brian Foster
  1 sibling, 1 reply; 236+ messages in thread
From: Dave Chinner @ 2016-07-06  4:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Thu, Jun 16, 2016 at 06:20:39PM -0700, Darrick J. Wong wrote:
> For the rmap btree to work, we have to feed the extent owner
> information to the the allocation and freeing functions. This
> information is what will end up in the rmap btree that tracks
> allocated extents. While we technically don't need the owner
> information when freeing extents, passing it allows us to validate
> that the extent we are removing from the rmap btree actually
> belonged to the owner we expected it to belong to.
....

> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1318,6 +1318,71 @@ typedef __be32 xfs_inobt_ptr_t;
>   */
>  #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
>  
> +/*
> + * Ownership info for an extent.  This is used to create reverse-mapping
> + * entries.
> + */
> +#define XFS_OWNER_INFO_ATTR_FORK	(1 << 0)
> +#define XFS_OWNER_INFO_BMBT_BLOCK	(1 << 1)
> +struct xfs_owner_info {
> +	uint64_t		oi_owner;
> +	xfs_fileoff_t		oi_offset;
> +	unsigned int		oi_flags;
> +};
> +
> +static inline void
> +xfs_rmap_ag_owner(
> +	struct xfs_owner_info	*oi,
> +	uint64_t		owner)
> +{
> +	oi->oi_owner = owner;
> +	oi->oi_offset = 0;
> +	oi->oi_flags = 0;
> +}
> +
> +static inline void
> +xfs_rmap_ino_bmbt_owner(
> +	struct xfs_owner_info	*oi,
> +	xfs_ino_t		ino,
> +	int			whichfork)
> +{
> +	oi->oi_owner = ino;
> +	oi->oi_offset = 0;
> +	oi->oi_flags = XFS_OWNER_INFO_BMBT_BLOCK;
> +	if (whichfork == XFS_ATTR_FORK)
> +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> +}
> +
> +static inline void
> +xfs_rmap_ino_owner(
> +	struct xfs_owner_info	*oi,
> +	xfs_ino_t		ino,
> +	int			whichfork,
> +	xfs_fileoff_t		offset)
> +{
> +	oi->oi_owner = ino;
> +	oi->oi_offset = offset;
> +	oi->oi_flags = 0;
> +	if (whichfork == XFS_ATTR_FORK)
> +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> +}

One of the things we've avaoided doing so far is putting functions
like this into xfs_format.h. xfs_format.h is really just for the
on disk format definitions, not the code to access/pack/unpack it.
Hence I think think these sorts of functions need to be moved to
xfs_rmap.h....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 028/119] xfs: define the on-disk rmap btree format
  2016-06-17  1:20 ` [PATCH 028/119] xfs: define the on-disk rmap btree format Darrick J. Wong
@ 2016-07-06  4:05   ` Dave Chinner
  2016-07-06  6:44     ` Darrick J. Wong
  2016-07-07 18:41   ` Brian Foster
  1 sibling, 1 reply; 236+ messages in thread
From: Dave Chinner @ 2016-07-06  4:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Thu, Jun 16, 2016 at 06:20:52PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have all the surrounding call infrastructure in place, we can
> start filling out the rmap btree implementation. Start with the
> on-disk btree format; add everything needed to read, write and
> manipulate rmap btree blocks. This prepares the way for adding the
> btree operations implementation.
> 
> [darrick: record owner and offset info in rmap btree]
> [darrick: fork, bmbt and unwritten state in rmap btree]
> [darrick: flags are a separate field in xfs_rmap_irec]
> [darrick: calculate maxlevels separately]
> [darrick: move the 'unwritten' bit into unused parts of rm_offset]
.....
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 97f354f..6efc7a3 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1383,11 +1383,151 @@ xfs_rmap_ino_owner(
>  #define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
>  #define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
>  
> +#define XFS_RMAP_NON_INODE_OWNER(owner)	(!!((owner) & (1ULL << 63)))
> +
> +/*
> + * Data record structure
> + */
> +struct xfs_rmap_rec {
> +	__be32		rm_startblock;	/* extent start block */
> +	__be32		rm_blockcount;	/* extent length */
> +	__be64		rm_owner;	/* extent owner */
> +	__be64		rm_offset;	/* offset within the owner */
> +};
> +
> +/*
> + * rmap btree record
> + *  rm_offset:63 is the attribute fork flag
> + *  rm_offset:62 is the bmbt block flag
> + *  rm_offset:61 is the unwritten extent flag (same as l0:63 in bmbt)
> + *  rm_offset:54-60 aren't used and should be zero
> + *  rm_offset:0-53 is the block offset within the inode
> + */
> +#define XFS_RMAP_OFF_ATTR_FORK	((__uint64_t)1ULL << 63)
> +#define XFS_RMAP_OFF_BMBT_BLOCK	((__uint64_t)1ULL << 62)
> +#define XFS_RMAP_OFF_UNWRITTEN	((__uint64_t)1ULL << 61)
> +
> +#define XFS_RMAP_LEN_MAX	((__uint32_t)~0U)
> +#define XFS_RMAP_OFF_FLAGS	(XFS_RMAP_OFF_ATTR_FORK | \
> +				 XFS_RMAP_OFF_BMBT_BLOCK | \
> +				 XFS_RMAP_OFF_UNWRITTEN)
> +#define XFS_RMAP_OFF_MASK	((__uint64_t)0x3FFFFFFFFFFFFFULL)
> +
> +#define XFS_RMAP_OFF(off)		((off) & XFS_RMAP_OFF_MASK)
> +
> +#define XFS_RMAP_IS_BMBT_BLOCK(off)	(!!((off) & XFS_RMAP_OFF_BMBT_BLOCK))
> +#define XFS_RMAP_IS_ATTR_FORK(off)	(!!((off) & XFS_RMAP_OFF_ATTR_FORK))
> +#define XFS_RMAP_IS_UNWRITTEN(len)	(!!((off) & XFS_RMAP_OFF_UNWRITTEN))
> +
> +#define RMAPBT_STARTBLOCK_BITLEN	32
> +#define RMAPBT_BLOCKCOUNT_BITLEN	32
> +#define RMAPBT_OWNER_BITLEN		64
> +#define RMAPBT_ATTRFLAG_BITLEN		1
> +#define RMAPBT_BMBTFLAG_BITLEN		1
> +#define RMAPBT_EXNTFLAG_BITLEN		1
> +#define RMAPBT_UNUSED_OFFSET_BITLEN	7
> +#define RMAPBT_OFFSET_BITLEN		54
> +
> +#define XFS_RMAP_ATTR_FORK		(1 << 0)
> +#define XFS_RMAP_BMBT_BLOCK		(1 << 1)
> +#define XFS_RMAP_UNWRITTEN		(1 << 2)
> +#define XFS_RMAP_KEY_FLAGS		(XFS_RMAP_ATTR_FORK | \
> +					 XFS_RMAP_BMBT_BLOCK)
> +#define XFS_RMAP_REC_FLAGS		(XFS_RMAP_UNWRITTEN)
> +struct xfs_rmap_irec {
> +	xfs_agblock_t	rm_startblock;	/* extent start block */
> +	xfs_extlen_t	rm_blockcount;	/* extent length */
> +	__uint64_t	rm_owner;	/* extent owner */
> +	__uint64_t	rm_offset;	/* offset within the owner */
> +	unsigned int	rm_flags;	/* state flags */
> +};

Same as my last comment about xfs_format.h. Up to here is all good -
they are format definitions. But these:

> +
> +static inline __u64
> +xfs_rmap_irec_offset_pack(
> +	const struct xfs_rmap_irec	*irec)
> +{
> +	__u64			x;
> +
> +	x = XFS_RMAP_OFF(irec->rm_offset);
> +	if (irec->rm_flags & XFS_RMAP_ATTR_FORK)
> +		x |= XFS_RMAP_OFF_ATTR_FORK;
> +	if (irec->rm_flags & XFS_RMAP_BMBT_BLOCK)
> +		x |= XFS_RMAP_OFF_BMBT_BLOCK;
> +	if (irec->rm_flags & XFS_RMAP_UNWRITTEN)
> +		x |= XFS_RMAP_OFF_UNWRITTEN;
> +	return x;
> +}
> +
> +static inline int
> +xfs_rmap_irec_offset_unpack(
> +	__u64			offset,
> +	struct xfs_rmap_irec	*irec)
> +{
> +	if (offset & ~(XFS_RMAP_OFF_MASK | XFS_RMAP_OFF_FLAGS))
> +		return -EFSCORRUPTED;
> +	irec->rm_offset = XFS_RMAP_OFF(offset);
> +	if (offset & XFS_RMAP_OFF_ATTR_FORK)
> +		irec->rm_flags |= XFS_RMAP_ATTR_FORK;
> +	if (offset & XFS_RMAP_OFF_BMBT_BLOCK)
> +		irec->rm_flags |= XFS_RMAP_BMBT_BLOCK;
> +	if (offset & XFS_RMAP_OFF_UNWRITTEN)
> +		irec->rm_flags |= XFS_RMAP_UNWRITTEN;
> +	return 0;
> +}

And these:

> +static inline void
> +xfs_owner_info_unpack(
> +	struct xfs_owner_info	*oinfo,
> +	uint64_t		*owner,
> +	uint64_t		*offset,
> +	unsigned int		*flags)
> +{
> +	unsigned int		r = 0;
> +
> +	*owner = oinfo->oi_owner;
> +	*offset = oinfo->oi_offset;
> +	if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK)
> +		r |= XFS_RMAP_ATTR_FORK;
> +	if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
> +		r |= XFS_RMAP_BMBT_BLOCK;
> +	*flags = r;
> +}
> +
> +static inline void
> +xfs_owner_info_pack(
> +	struct xfs_owner_info	*oinfo,
> +	uint64_t		owner,
> +	uint64_t		offset,
> +	unsigned int		flags)
> +{
> +	oinfo->oi_owner = owner;
> +	oinfo->oi_offset = XFS_RMAP_OFF(offset);
> +	oinfo->oi_flags = 0;
> +	if (flags & XFS_RMAP_ATTR_FORK)
> +		oinfo->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> +	if (flags & XFS_RMAP_BMBT_BLOCK)
> +		oinfo->oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
> +}
> +

really belong in xfs_rmap.h or xfs_rmap_btree.h.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
  2016-06-17  1:19 ` [PATCH 013/119] xfs: support btrees with overlapping intervals for keys Darrick J. Wong
  2016-06-22 15:17   ` Brian Foster
@ 2016-07-06  4:59   ` Dave Chinner
  2016-07-06  8:09     ` Darrick J. Wong
  1 sibling, 1 reply; 236+ messages in thread
From: Dave Chinner @ 2016-07-06  4:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> On a filesystem with both reflink and reverse mapping enabled, it's
> possible to have multiple rmap records referring to the same blocks on
> disk.  When overlapping intervals are possible, querying a classic
> btree to find all records intersecting a given interval is inefficient
> because we cannot use the left side of the search interval to filter
> out non-matching records the same way that we can use the existing
> btree key to filter out records coming after the right side of the
> search interval.  This will become important once we want to use the
> rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.

I thought I didn't hav emuch to say about this, but then I started
writing down all my questions.....

> @@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct xfs_btree_cur *cur)
>  	return XFS_BTREE_SBLOCK_LEN;
>  }
>  
> +/* Return size of btree block keys for this btree instance. */
> +static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur)
> +{
> +	size_t			len;
> +
> +	len = cur->bc_ops->key_len;
> +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> +		len *= 2;
> +	return len;
> +}

So there's magic here. Why can't the cur->bc_ops->key_len be set
appropriately when it isi initialised?

>  /*
>   * Return size of btree block pointers for this btree instance.
>   */
> @@ -475,7 +515,19 @@ xfs_btree_key_offset(
>  	int			n)
>  {
>  	return xfs_btree_block_len(cur) +
> -		(n - 1) * cur->bc_ops->key_len;
> +		(n - 1) * xfs_btree_key_len(cur);
> +}

because this effectively means the key length and offsets for
a btree with the XFS_BTREE_OPS_OVERLAPPING flag set is *always*
cur->bc_ops->key_len * 2.

> +
> +/*
> + * Calculate offset of the n-th high key in a btree block.
> + */
> +STATIC size_t
> +xfs_btree_high_key_offset(
> +	struct xfs_btree_cur	*cur,
> +	int			n)
> +{
> +	return xfs_btree_block_len(cur) +
> +		(n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len;
>  }

And this is the only case where we use a "half key" length to pull
the offset of the high key. Wouldn't it be better to be explicit
about the high key offset rather than encode magic numbers to infer
that the "overlapping key is really two key lengths with the high
key at plus one key len". IMO, this is better:

xfs_btree_high_key_offset(
	struct xfs_btree_cur	*cur,
	int			n)
{
	ASSERT(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING);
	return xfs_btree_block_len(cur) +
		 (n - 1) * cur->bc_ops->key_len +
		 offset_of(struct xfs_btree_double_key, high);
}

It means there are much fewer code changes needed for supporting
the XFS_BTREE_OPS_OVERLAPPING flag, too.

> +STATIC void
> +xfs_btree_find_leaf_keys(
> +	struct xfs_btree_cur	*cur,
> +	struct xfs_btree_block	*block,
> +	union xfs_btree_key	*low,
> +	union xfs_btree_key	*high)
> +{
> +	int			n;
> +	union xfs_btree_rec	*rec;
> +	union xfs_btree_key	max_hkey;
> +	union xfs_btree_key	hkey;
> +
> +	rec = xfs_btree_rec_addr(cur, 1, block);
> +	cur->bc_ops->init_key_from_rec(low, rec);
> +
> +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> +		return;

When I see conditionals like this, it makes me want to add
a btree specific method. i.e.

	bc_ops->find_leaf_keys()
	bc_ops->find_node_keys()

and we hook them up to generic functions that don't require
checks against feature flags.

i.e:

xfs_btree_find_leaf_low_key()
{
	rec = xfs_btree_rec_addr(cur, 1, block);
	cur->bc_ops->init_key_from_rec(low, rec);
}

xfs_btree_find_leaf_low_high_keys()
{
	xfs_btree_find_leaf_low_key();

	/*
	 * high key finding code here, which is the same function
	 * for both keys and pointers
	 */
}

.....

> +/*
> + * Update parental low & high keys from some block all the way back to the
> + * root of the btree.
> + */
> +STATIC int
> +__xfs_btree_updkeys(

I kept getting confused by xfs_btree_updkey() and
xfs_btree_updkeys(). Can we chose a better name for this parent key
update?


> +	struct xfs_btree_cur	*cur,
> +	int			level,
> +	struct xfs_btree_block	*block,
> +	struct xfs_buf		*bp0,
> +	bool			force_all)
> +{
> +	union xfs_btree_key	lkey;	/* keys from current level */
> +	union xfs_btree_key	hkey;
> +	union xfs_btree_key	*nlkey;	/* keys from the next level up */
> +	union xfs_btree_key	*nhkey;
> +	struct xfs_buf		*bp;
> +	int			ptr = -1;
> +
> +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> +		return 0;

And, again, it's a probably better to use a btree op callout for
this, especially when you've added this to xfs_btree_updkey():

> @@ -1893,6 +2132,9 @@ xfs_btree_updkey(
>  	union xfs_btree_key	*kp;
>  	int			ptr;
>  
> +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> +		return 0;
> +
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
>  	XFS_BTREE_TRACE_ARGIK(cur, level, keyp);

i.e. one or the other "updkey" does something, but not both.
Extremely confusing to see both called but then only one do
anything.

[back to __xfs_btree_updkeys()]

> +
> +	if (level + 1 >= cur->bc_nlevels)
> +		return 0;
> +
> +	trace_xfs_btree_updkeys(cur, level, bp0);
> +
> +	if (level == 0)
> +		xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey);
> +	else
> +		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);

And this code fragment is repeated in many places, so i think a
helper is warranted for this. That also reminds me - the "find" in
the name is confusing - it's not "finding" as much as it is
"getting" the low and high key values from the current block.

It's especially confusing when you do this:

> @@ -1970,7 +2212,8 @@ xfs_btree_update(
>  					    ptr, LASTREC_UPDATE);
>  	}
>  
> -	/* Updating first rec in leaf. Pass new key value up to our parent. */
> +	/* Pass new key value up to our parent. */
> +	xfs_btree_updkeys(cur, 0);
>  	if (ptr == 1) {
>  		union xfs_btree_key	key;

You're throwing away the error from xfs_btree_updkeys() at, AFAICT,
all call sites. This update can fail, so I suspect this needs to
check and handle the update.

>  
> @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
>  		rkp = &key;
>  	}
>  
> -	/* Update the parent key values of right. */
> +	/* Update the parent key values of left and right. */
> +	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> +	xfs_btree_updkeys(cur, level);
>  	error = xfs_btree_updkey(cur, rkp, level + 1);
>  	if (error)
>  		goto error0;

Remember what I said above about xfs_btree_updkeys/xfs_btree_updkey
being confusing? Here we have 3 different key update functions, all
doing different stuff, taking different parameters. None of the code
is consistent in how these updates are done - they are all different
combinations of these functions, so I'm not sure how we are supposed
to verify the correct updates are being done now or in the future.

How can we hide this complexity from the generic btree code?

> @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
>  	if (error)
>  		goto error1;
>  
> +	/* Update left and right parent pointers */
> +	xfs_btree_updkeys(cur, level);
> +	xfs_btree_updkeys(tcur, level);
>  	error = xfs_btree_updkey(tcur, rkp, level + 1);
>  	if (error)
>  		goto error1;

Different.

> @@ -2499,6 +2746,10 @@ __xfs_btree_split(
>  		xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB);
>  		xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB);
>  	}
> +
> +	/* Update the left block's keys... */
> +	xfs_btree_updkeys(cur, level);

different...

> @@ -2806,27 +3057,27 @@ xfs_btree_new_root(
>  		bp = lbp;
>  		nptr = 2;
>  	}
> +
>  	/* Fill in the new block's btree header and log it. */
>  	xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2);
>  	xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS);
>  	ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) &&
>  			!xfs_btree_ptr_is_null(cur, &rptr));
> -
>  	/* Fill in the key data in the new root. */
>  	if (xfs_btree_get_level(left) > 0) {
> -		xfs_btree_copy_keys(cur,
> +		xfs_btree_find_node_keys(cur, left,
>  				xfs_btree_key_addr(cur, 1, new),
> -				xfs_btree_key_addr(cur, 1, left), 1);
> -		xfs_btree_copy_keys(cur,
> +				xfs_btree_high_key_addr(cur, 1, new));
> +		xfs_btree_find_node_keys(cur, right,
>  				xfs_btree_key_addr(cur, 2, new),
> -				xfs_btree_key_addr(cur, 1, right), 1);
> +				xfs_btree_high_key_addr(cur, 2, new));

And this took me ages to work out - you replaced
xfs_btree_copy_keys() with xfs_btree_find_node_keys() which means
the fact that we are copying a key from one block to antoher has
been lost.  It wasn't until I realised that
xfs_btree_find_node_keys() was writing directly into the new block
record that it was an equivalent operation to a copy.

This is why I don't like the name xfs_btree_find_*_keys() - when it
is used like this it badly obfuscates what operation is being
performed - it's most definitely not a find operation being
performed. i.e. xfs_btree_copy_keys() documents the operation in
an obvious and straight forward manner, the new code takes time and
thought to decipher.

Perhaps you could move it all to inside xfs_btree_copy_keys(), so
the complexity is hidden from the higher level  btree manipulation
functions...

> +/* Copy a double key into a btree block. */
> +static void
> +xfs_btree_copy_double_keys(
> +	struct xfs_btree_cur	*cur,
> +	int			ptr,
> +	struct xfs_btree_block	*block,
> +	struct xfs_btree_double_key	*key)
> +{
> +	memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low,
> +			cur->bc_ops->key_len);
> +
> +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> +		memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high,
> +				cur->bc_ops->key_len);
> +}

This should be located next to xfs_btree_copy_keys().

>  	/* If we inserted at the start of a block, update the parents' keys. */
> +	if (ncur && bp->b_bn != old_bn) {
> +		/*
> +		 * We just inserted into a new tree block, which means that
> +		 * the key for the block is in nkey, not the tree.
> +		 */
> +		if (level == 0)
> +			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> +					&nkey.high);
> +		else
> +			xfs_btree_find_node_keys(cur, block, &nkey.low,
> +					&nkey.high);
> +	} else {
> +		/* Updating the left block, do it the standard way. */
> +		xfs_btree_updkeys(cur, level);
> +	}
> +
>  	if (optr == 1) {
> -		error = xfs_btree_updkey(cur, key, level + 1);
> +		error = xfs_btree_updkey(cur, &key->low, level + 1);
>  		if (error)
>  			goto error0;
>  	}

This is another of those "huh, what" moments I had with all the
different _updkey functions....

> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index b99c018..a5ec6c7 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -126,6 +126,9 @@ struct xfs_btree_ops {
>  	size_t	key_len;
>  	size_t	rec_len;
>  
> +	/* flags */
> +	uint	flags;
> +
.....
> @@ -182,6 +195,9 @@ struct xfs_btree_ops {
>  #endif
>  };
>  
> +/* btree ops flags */
> +#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
> +

why did you put this in the struct btree_ops  and not in the
btree cursor ->bc_flags field like all the other btree specific
customisations like:

/* cursor flags */
#define XFS_BTREE_LONG_PTRS             (1<<0)  /* pointers are 64bits long */
#define XFS_BTREE_ROOT_IN_INODE         (1<<1)  /* root may be variable size */
#define XFS_BTREE_LASTREC_UPDATE        (1<<2)  /* track last rec externally */
#define XFS_BTREE_CRC_BLOCKS            (1<<3)  /* uses extended btree blocks */

i.e. we should have all the structural/behavioural flags in the one
place, not split across different structures....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 026/119] xfs: add owner field to extent allocation and freeing
  2016-07-06  4:01   ` Dave Chinner
@ 2016-07-06  6:44     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-06  6:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Wed, Jul 06, 2016 at 02:01:20PM +1000, Dave Chinner wrote:
> On Thu, Jun 16, 2016 at 06:20:39PM -0700, Darrick J. Wong wrote:
> > For the rmap btree to work, we have to feed the extent owner
> > information to the the allocation and freeing functions. This
> > information is what will end up in the rmap btree that tracks
> > allocated extents. While we technically don't need the owner
> > information when freeing extents, passing it allows us to validate
> > that the extent we are removing from the rmap btree actually
> > belonged to the owner we expected it to belong to.
> ....
> 
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1318,6 +1318,71 @@ typedef __be32 xfs_inobt_ptr_t;
> >   */
> >  #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
> >  
> > +/*
> > + * Ownership info for an extent.  This is used to create reverse-mapping
> > + * entries.
> > + */
> > +#define XFS_OWNER_INFO_ATTR_FORK	(1 << 0)
> > +#define XFS_OWNER_INFO_BMBT_BLOCK	(1 << 1)
> > +struct xfs_owner_info {
> > +	uint64_t		oi_owner;
> > +	xfs_fileoff_t		oi_offset;
> > +	unsigned int		oi_flags;
> > +};
> > +
> > +static inline void
> > +xfs_rmap_ag_owner(
> > +	struct xfs_owner_info	*oi,
> > +	uint64_t		owner)
> > +{
> > +	oi->oi_owner = owner;
> > +	oi->oi_offset = 0;
> > +	oi->oi_flags = 0;
> > +}
> > +
> > +static inline void
> > +xfs_rmap_ino_bmbt_owner(
> > +	struct xfs_owner_info	*oi,
> > +	xfs_ino_t		ino,
> > +	int			whichfork)
> > +{
> > +	oi->oi_owner = ino;
> > +	oi->oi_offset = 0;
> > +	oi->oi_flags = XFS_OWNER_INFO_BMBT_BLOCK;
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> > +}
> > +
> > +static inline void
> > +xfs_rmap_ino_owner(
> > +	struct xfs_owner_info	*oi,
> > +	xfs_ino_t		ino,
> > +	int			whichfork,
> > +	xfs_fileoff_t		offset)
> > +{
> > +	oi->oi_owner = ino;
> > +	oi->oi_offset = offset;
> > +	oi->oi_flags = 0;
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> > +}
> 
> One of the things we've avaoided doing so far is putting functions
> like this into xfs_format.h. xfs_format.h is really just for the
> on disk format definitions, not the code to access/pack/unpack it.
> Hence I think think these sorts of functions need to be moved to
> xfs_rmap.h....

Yes.

I've already split xfs_rmap_btree.h into xfs_rmap.h (high level rmap
functions) and xfs_rmap_btree.h (low level btree functions) for the
realtime rmapbt code split.  Won't be difficult to move these over
from xfs_format.h.

Speaking of which, I've pushed that along to the point that the
kernel-side implementation is at pre-alpha eatmydata stage, check
works well enough that xfstests doesn't explode, and we can collect
rt rmaps for checking and rebuilding of the tree.  I'll try to finish
that tomorrow.

--D

> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 028/119] xfs: define the on-disk rmap btree format
  2016-07-06  4:05   ` Dave Chinner
@ 2016-07-06  6:44     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-06  6:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Wed, Jul 06, 2016 at 02:05:55PM +1000, Dave Chinner wrote:
> On Thu, Jun 16, 2016 at 06:20:52PM -0700, Darrick J. Wong wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now we have all the surrounding call infrastructure in place, we can
> > start filling out the rmap btree implementation. Start with the
> > on-disk btree format; add everything needed to read, write and
> > manipulate rmap btree blocks. This prepares the way for adding the
> > btree operations implementation.
> > 
> > [darrick: record owner and offset info in rmap btree]
> > [darrick: fork, bmbt and unwritten state in rmap btree]
> > [darrick: flags are a separate field in xfs_rmap_irec]
> > [darrick: calculate maxlevels separately]
> > [darrick: move the 'unwritten' bit into unused parts of rm_offset]
> .....
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 97f354f..6efc7a3 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1383,11 +1383,151 @@ xfs_rmap_ino_owner(
> >  #define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
> >  #define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
> >  
> > +#define XFS_RMAP_NON_INODE_OWNER(owner)	(!!((owner) & (1ULL << 63)))
> > +
> > +/*
> > + * Data record structure
> > + */
> > +struct xfs_rmap_rec {
> > +	__be32		rm_startblock;	/* extent start block */
> > +	__be32		rm_blockcount;	/* extent length */
> > +	__be64		rm_owner;	/* extent owner */
> > +	__be64		rm_offset;	/* offset within the owner */
> > +};
> > +
> > +/*
> > + * rmap btree record
> > + *  rm_offset:63 is the attribute fork flag
> > + *  rm_offset:62 is the bmbt block flag
> > + *  rm_offset:61 is the unwritten extent flag (same as l0:63 in bmbt)
> > + *  rm_offset:54-60 aren't used and should be zero
> > + *  rm_offset:0-53 is the block offset within the inode
> > + */
> > +#define XFS_RMAP_OFF_ATTR_FORK	((__uint64_t)1ULL << 63)
> > +#define XFS_RMAP_OFF_BMBT_BLOCK	((__uint64_t)1ULL << 62)
> > +#define XFS_RMAP_OFF_UNWRITTEN	((__uint64_t)1ULL << 61)
> > +
> > +#define XFS_RMAP_LEN_MAX	((__uint32_t)~0U)
> > +#define XFS_RMAP_OFF_FLAGS	(XFS_RMAP_OFF_ATTR_FORK | \
> > +				 XFS_RMAP_OFF_BMBT_BLOCK | \
> > +				 XFS_RMAP_OFF_UNWRITTEN)
> > +#define XFS_RMAP_OFF_MASK	((__uint64_t)0x3FFFFFFFFFFFFFULL)
> > +
> > +#define XFS_RMAP_OFF(off)		((off) & XFS_RMAP_OFF_MASK)
> > +
> > +#define XFS_RMAP_IS_BMBT_BLOCK(off)	(!!((off) & XFS_RMAP_OFF_BMBT_BLOCK))
> > +#define XFS_RMAP_IS_ATTR_FORK(off)	(!!((off) & XFS_RMAP_OFF_ATTR_FORK))
> > +#define XFS_RMAP_IS_UNWRITTEN(len)	(!!((off) & XFS_RMAP_OFF_UNWRITTEN))
> > +
> > +#define RMAPBT_STARTBLOCK_BITLEN	32
> > +#define RMAPBT_BLOCKCOUNT_BITLEN	32
> > +#define RMAPBT_OWNER_BITLEN		64
> > +#define RMAPBT_ATTRFLAG_BITLEN		1
> > +#define RMAPBT_BMBTFLAG_BITLEN		1
> > +#define RMAPBT_EXNTFLAG_BITLEN		1
> > +#define RMAPBT_UNUSED_OFFSET_BITLEN	7
> > +#define RMAPBT_OFFSET_BITLEN		54
> > +
> > +#define XFS_RMAP_ATTR_FORK		(1 << 0)
> > +#define XFS_RMAP_BMBT_BLOCK		(1 << 1)
> > +#define XFS_RMAP_UNWRITTEN		(1 << 2)
> > +#define XFS_RMAP_KEY_FLAGS		(XFS_RMAP_ATTR_FORK | \
> > +					 XFS_RMAP_BMBT_BLOCK)
> > +#define XFS_RMAP_REC_FLAGS		(XFS_RMAP_UNWRITTEN)
> > +struct xfs_rmap_irec {
> > +	xfs_agblock_t	rm_startblock;	/* extent start block */
> > +	xfs_extlen_t	rm_blockcount;	/* extent length */
> > +	__uint64_t	rm_owner;	/* extent owner */
> > +	__uint64_t	rm_offset;	/* offset within the owner */
> > +	unsigned int	rm_flags;	/* state flags */
> > +};
> 
> Same as my last comment about xfs_format.h. Up to here is all good -
> they are format definitions. But these:
> 
> > +
> > +static inline __u64
> > +xfs_rmap_irec_offset_pack(
> > +	const struct xfs_rmap_irec	*irec)
> > +{
> > +	__u64			x;
> > +
> > +	x = XFS_RMAP_OFF(irec->rm_offset);
> > +	if (irec->rm_flags & XFS_RMAP_ATTR_FORK)
> > +		x |= XFS_RMAP_OFF_ATTR_FORK;
> > +	if (irec->rm_flags & XFS_RMAP_BMBT_BLOCK)
> > +		x |= XFS_RMAP_OFF_BMBT_BLOCK;
> > +	if (irec->rm_flags & XFS_RMAP_UNWRITTEN)
> > +		x |= XFS_RMAP_OFF_UNWRITTEN;
> > +	return x;
> > +}
> > +
> > +static inline int
> > +xfs_rmap_irec_offset_unpack(
> > +	__u64			offset,
> > +	struct xfs_rmap_irec	*irec)
> > +{
> > +	if (offset & ~(XFS_RMAP_OFF_MASK | XFS_RMAP_OFF_FLAGS))
> > +		return -EFSCORRUPTED;
> > +	irec->rm_offset = XFS_RMAP_OFF(offset);
> > +	if (offset & XFS_RMAP_OFF_ATTR_FORK)
> > +		irec->rm_flags |= XFS_RMAP_ATTR_FORK;
> > +	if (offset & XFS_RMAP_OFF_BMBT_BLOCK)
> > +		irec->rm_flags |= XFS_RMAP_BMBT_BLOCK;
> > +	if (offset & XFS_RMAP_OFF_UNWRITTEN)
> > +		irec->rm_flags |= XFS_RMAP_UNWRITTEN;
> > +	return 0;
> > +}
> 
> And these:
> 
> > +static inline void
> > +xfs_owner_info_unpack(
> > +	struct xfs_owner_info	*oinfo,
> > +	uint64_t		*owner,
> > +	uint64_t		*offset,
> > +	unsigned int		*flags)
> > +{
> > +	unsigned int		r = 0;
> > +
> > +	*owner = oinfo->oi_owner;
> > +	*offset = oinfo->oi_offset;
> > +	if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK)
> > +		r |= XFS_RMAP_ATTR_FORK;
> > +	if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
> > +		r |= XFS_RMAP_BMBT_BLOCK;
> > +	*flags = r;
> > +}
> > +
> > +static inline void
> > +xfs_owner_info_pack(
> > +	struct xfs_owner_info	*oinfo,
> > +	uint64_t		owner,
> > +	uint64_t		offset,
> > +	unsigned int		flags)
> > +{
> > +	oinfo->oi_owner = owner;
> > +	oinfo->oi_offset = XFS_RMAP_OFF(offset);
> > +	oinfo->oi_flags = 0;
> > +	if (flags & XFS_RMAP_ATTR_FORK)
> > +		oinfo->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> > +	if (flags & XFS_RMAP_BMBT_BLOCK)
> > +		oinfo->oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
> > +}
> > +
> 
> really belong in xfs_rmap.h or xfs_rmap_btree.h.

Yep.  I think these'll end up in xfs_rmap_btree.h.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 013/119] xfs: support btrees with overlapping intervals for keys
  2016-07-06  4:59   ` Dave Chinner
@ 2016-07-06  8:09     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-06  8:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Wed, Jul 06, 2016 at 02:59:41PM +1000, Dave Chinner wrote:
> On Thu, Jun 16, 2016 at 06:19:15PM -0700, Darrick J. Wong wrote:
> > On a filesystem with both reflink and reverse mapping enabled, it's
> > possible to have multiple rmap records referring to the same blocks on
> > disk.  When overlapping intervals are possible, querying a classic
> > btree to find all records intersecting a given interval is inefficient
> > because we cannot use the left side of the search interval to filter
> > out non-matching records the same way that we can use the existing
> > btree key to filter out records coming after the right side of the
> > search interval.  This will become important once we want to use the
> > rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
> 
> I thought I didn't hav emuch to say about this, but then I started
> writing down all my questions.....

I'd have been surprised if you didn't have much to say--

*I* certainly had plenty to say about this code when I dug back into it
last week to make XFS_BTREE_ROOT_IN_INODE work for level == 0 roots.
Most of it was unprintable. :P

> > @@ -445,6 +474,17 @@ static inline size_t xfs_btree_block_len(struct xfs_btree_cur *cur)
> >  	return XFS_BTREE_SBLOCK_LEN;
> >  }
> >  
> > +/* Return size of btree block keys for this btree instance. */
> > +static inline size_t xfs_btree_key_len(struct xfs_btree_cur *cur)
> > +{
> > +	size_t			len;
> > +
> > +	len = cur->bc_ops->key_len;
> > +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +		len *= 2;
> > +	return len;
> > +}
> 
> So there's magic here. Why can't the cur->bc_ops->key_len be set
> appropriately when it isi initialised?
> 
> >  /*
> >   * Return size of btree block pointers for this btree instance.
> >   */
> > @@ -475,7 +515,19 @@ xfs_btree_key_offset(
> >  	int			n)
> >  {
> >  	return xfs_btree_block_len(cur) +
> > -		(n - 1) * cur->bc_ops->key_len;
> > +		(n - 1) * xfs_btree_key_len(cur);
> > +}
> 
> because this effectively means the key length and offsets for
> a btree with the XFS_BTREE_OPS_OVERLAPPING flag set is *always*
> cur->bc_ops->key_len * 2.

I designed the code around the idea that in going from a regular btree
to an overlapped btree, the key_len stays the same but the number of
keys doubles.  I can change everything such that key_len doubles but
the number of keys stays the same.  For the few places where we
actually update the low and high keys separately (basically updkeys)
we'll have to be a little careful with key_len.

> > +
> > +/*
> > + * Calculate offset of the n-th high key in a btree block.
> > + */
> > +STATIC size_t
> > +xfs_btree_high_key_offset(
> > +	struct xfs_btree_cur	*cur,
> > +	int			n)
> > +{
> > +	return xfs_btree_block_len(cur) +
> > +		(n - 1) * xfs_btree_key_len(cur) + cur->bc_ops->key_len;
> >  }
> 
> And this is the only case where we use a "half key" length to pull
> the offset of the high key. Wouldn't it be better to be explicit
> about the high key offset rather than encode magic numbers to infer
> that the "overlapping key is really two key lengths with the high
> key at plus one key len". IMO, this is better:
> 
> xfs_btree_high_key_offset(
> 	struct xfs_btree_cur	*cur,
> 	int			n)
> {
> 	ASSERT(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING);
> 	return xfs_btree_block_len(cur) +
> 		 (n - 1) * cur->bc_ops->key_len +
> 		 offset_of(struct xfs_btree_double_key, high);
> }
> 
> It means there are much fewer code changes needed for supporting
> the XFS_BTREE_OPS_OVERLAPPING flag, too.

Sure.

> > +STATIC void
> > +xfs_btree_find_leaf_keys(
> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_btree_block	*block,
> > +	union xfs_btree_key	*low,
> > +	union xfs_btree_key	*high)
> > +{
> > +	int			n;
> > +	union xfs_btree_rec	*rec;
> > +	union xfs_btree_key	max_hkey;
> > +	union xfs_btree_key	hkey;
> > +
> > +	rec = xfs_btree_rec_addr(cur, 1, block);
> > +	cur->bc_ops->init_key_from_rec(low, rec);
> > +
> > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +		return;
> 
> When I see conditionals like this, it makes me want to add
> a btree specific method. i.e.
> 
> 	bc_ops->find_leaf_keys()
> 	bc_ops->find_node_keys()
> 
> and we hook them up to generic functions that don't require
> checks against feature flags.
> 
> i.e:
> 
> xfs_btree_find_leaf_low_key()
> {
> 	rec = xfs_btree_rec_addr(cur, 1, block);
> 	cur->bc_ops->init_key_from_rec(low, rec);
> }
> 
> xfs_btree_find_leaf_low_high_keys()
> {
> 	xfs_btree_find_leaf_low_key();
> 
> 	/*
> 	 * high key finding code here, which is the same function
> 	 * for both keys and pointers
> 	 */
> }

The thing is, there's nothing in xfs_btree_find_*_keys that's specifc
to a btree.  I rather like only having to set one thing in the
btree_ops to get the overlapped mode, rather than having to remember
to make sure that such-and-such-functions are paired with
such-and-such flags.

Well, maybe it wouldn't be so bad.  I think there's only three
functions that need this treatment.

> .....
> 
> > +/*
> > + * Update parental low & high keys from some block all the way back to the
> > + * root of the btree.
> > + */
> > +STATIC int
> > +__xfs_btree_updkeys(
> 
> I kept getting confused by xfs_btree_updkey() and
> xfs_btree_updkeys(). Can we chose a better name for this parent key
> update?

I /think/ I want to collapse them into a single ->updkeys() function.

And maybe rename to update_parent_keys() ?

> > +	struct xfs_btree_cur	*cur,
> > +	int			level,
> > +	struct xfs_btree_block	*block,
> > +	struct xfs_buf		*bp0,
> > +	bool			force_all)
> > +{
> > +	union xfs_btree_key	lkey;	/* keys from current level */
> > +	union xfs_btree_key	hkey;
> > +	union xfs_btree_key	*nlkey;	/* keys from the next level up */
> > +	union xfs_btree_key	*nhkey;
> > +	struct xfs_buf		*bp;
> > +	int			ptr = -1;
> > +
> > +	if (!(cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING))
> > +		return 0;
> 
> And, again, it's a probably better to use a btree op callout for
> this, especially when you've added this to xfs_btree_updkey():
> 
> > @@ -1893,6 +2132,9 @@ xfs_btree_updkey(
> >  	union xfs_btree_key	*kp;
> >  	int			ptr;
> >  
> > +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +		return 0;
> > +
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> >  	XFS_BTREE_TRACE_ARGIK(cur, level, keyp);
> 
> i.e. one or the other "updkey" does something, but not both.
> Extremely confusing to see both called but then only one do
> anything.
> 
> [back to __xfs_btree_updkeys()]
> 
> > +
> > +	if (level + 1 >= cur->bc_nlevels)
> > +		return 0;
> > +
> > +	trace_xfs_btree_updkeys(cur, level, bp0);
> > +
> > +	if (level == 0)
> > +		xfs_btree_find_leaf_keys(cur, block, &lkey, &hkey);
> > +	else
> > +		xfs_btree_find_node_keys(cur, block, &lkey, &hkey);
> 
> And this code fragment is repeated in many places, so i think a
> helper is warranted for this. That also reminds me - the "find" in
> the name is confusing - it's not "finding" as much as it is
> "getting" the low and high key values from the current block.
> 
> It's especially confusing when you do this:
> 
> > @@ -1970,7 +2212,8 @@ xfs_btree_update(
> >  					    ptr, LASTREC_UPDATE);
> >  	}
> >  
> > -	/* Updating first rec in leaf. Pass new key value up to our parent. */
> > +	/* Pass new key value up to our parent. */
> > +	xfs_btree_updkeys(cur, 0);
> >  	if (ptr == 1) {
> >  		union xfs_btree_key	key;
> 
> You're throwing away the error from xfs_btree_updkeys() at, AFAICT,
> all call sites. This update can fail, so I suspect this needs to
> check and handle the update.

Yes, that's a bug, albeit a theoretical one since updkeys can't fail
at the moment.

(I fixed this one already in my djwong-wtf tree.)

> > @@ -2149,7 +2392,9 @@ xfs_btree_lshift(
> >  		rkp = &key;
> >  	}
> >  
> > -	/* Update the parent key values of right. */
> > +	/* Update the parent key values of left and right. */
> > +	xfs_btree_sibling_updkeys(cur, level, XFS_BB_LEFTSIB, left, lbp);
> > +	xfs_btree_updkeys(cur, level);
> >  	error = xfs_btree_updkey(cur, rkp, level + 1);
> >  	if (error)
> >  		goto error0;
> 
> Remember what I said above about xfs_btree_updkeys/xfs_btree_updkey
> being confusing? Here we have 3 different key update functions, all
> doing different stuff, taking different parameters. None of the code
> is consistent in how these updates are done - they are all different
> combinations of these functions, so I'm not sure how we are supposed
> to verify the correct updates are being done now or in the future.
> 
> How can we hide this complexity from the generic btree code?

I refactored this mess after bfoster complained, but even after that
there's still a conditional.  We need to updkeys the right block
regardless, but we only need to updkeys the left block if it's an
overlapped tree, which leads to this:

/*
 * Using a temporary cursor, update the parent key values of
 * the
 * block on the left.
 */
error = xfs_btree_dup_cursor(cur, &tcur);
if (error)
	goto error0;
i = xfs_btree_firstrec(tcur, level);
XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, error0);

error = xfs_btree_decrement(tcur, level, &i);
if (error)
	goto error1;

/* Update left and right parent pointers */
error = xfs_btree_updkeys(cur, level);
if (error)
	goto error1;
error = xfs_btree_updkeys(tcur, level);
if (error)
	goto error1;
error = xfs_btree_updkey(cur, rkp, level + 1);
if (error)
	goto error0;

xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);

Still yucky, will have to meditate on this further...

> > @@ -2321,6 +2566,9 @@ xfs_btree_rshift(
> >  	if (error)
> >  		goto error1;
> >  
> > +	/* Update left and right parent pointers */
> > +	xfs_btree_updkeys(cur, level);
> > +	xfs_btree_updkeys(tcur, level);
> >  	error = xfs_btree_updkey(tcur, rkp, level + 1);
> >  	if (error)
> >  		goto error1;
> 
> Different.
> 
> > @@ -2499,6 +2746,10 @@ __xfs_btree_split(
> >  		xfs_btree_set_sibling(cur, rrblock, &rptr, XFS_BB_LEFTSIB);
> >  		xfs_btree_log_block(cur, rrbp, XFS_BB_LEFTSIB);
> >  	}
> > +
> > +	/* Update the left block's keys... */
> > +	xfs_btree_updkeys(cur, level);
> 
> different...
> 
> > @@ -2806,27 +3057,27 @@ xfs_btree_new_root(
> >  		bp = lbp;
> >  		nptr = 2;
> >  	}
> > +
> >  	/* Fill in the new block's btree header and log it. */
> >  	xfs_btree_init_block_cur(cur, nbp, cur->bc_nlevels, 2);
> >  	xfs_btree_log_block(cur, nbp, XFS_BB_ALL_BITS);
> >  	ASSERT(!xfs_btree_ptr_is_null(cur, &lptr) &&
> >  			!xfs_btree_ptr_is_null(cur, &rptr));
> > -
> >  	/* Fill in the key data in the new root. */
> >  	if (xfs_btree_get_level(left) > 0) {
> > -		xfs_btree_copy_keys(cur,
> > +		xfs_btree_find_node_keys(cur, left,
> >  				xfs_btree_key_addr(cur, 1, new),
> > -				xfs_btree_key_addr(cur, 1, left), 1);
> > -		xfs_btree_copy_keys(cur,
> > +				xfs_btree_high_key_addr(cur, 1, new));
> > +		xfs_btree_find_node_keys(cur, right,
> >  				xfs_btree_key_addr(cur, 2, new),
> > -				xfs_btree_key_addr(cur, 1, right), 1);
> > +				xfs_btree_high_key_addr(cur, 2, new));
> 
> And this took me ages to work out - you replaced
> xfs_btree_copy_keys() with xfs_btree_find_node_keys() which means
> the fact that we are copying a key from one block to antoher has
> been lost.

That's because we're not strictly copying keys from left and right
into the root anymore.  Yes, the low part of the key is a straight
copy, but we have to iterate left and right, respectively, to
calculate the high keys that go in keys 1 & 2 in the root block.
The high key of a given tree node is the maximum of all the keys or
records in that node, or put another way, it's the highest key
reachable in that subtree...

> It wasn't until I realised that
> xfs_btree_find_node_keys() was writing directly into the new block
> record that it was an equivalent operation to a copy.
> 
> This is why I don't like the name xfs_btree_find_*_keys() - when it
> is used like this it badly obfuscates what operation is being
> performed - it's most definitely not a find operation being
> performed. i.e. xfs_btree_copy_keys() documents the operation in
> an obvious and straight forward manner, the new code takes time and
> thought to decipher.
> 
> Perhaps you could move it all to inside xfs_btree_copy_keys(), so
> the complexity is hidden from the higher level  btree manipulation
> functions...

...so it's not a strict copy.

> > +/* Copy a double key into a btree block. */
> > +static void
> > +xfs_btree_copy_double_keys(
> > +	struct xfs_btree_cur	*cur,
> > +	int			ptr,
> > +	struct xfs_btree_block	*block,
> > +	struct xfs_btree_double_key	*key)
> > +{
> > +	memcpy(xfs_btree_key_addr(cur, ptr, block), &key->low,
> > +			cur->bc_ops->key_len);
> > +
> > +	if (cur->bc_ops->flags & XFS_BTREE_OPS_OVERLAPPING)
> > +		memcpy(xfs_btree_high_key_addr(cur, ptr, block), &key->high,
> > +				cur->bc_ops->key_len);
> > +}
> 
> This should be located next to xfs_btree_copy_keys().
> 
> >  	/* If we inserted at the start of a block, update the parents' keys. */

BTW, I replaced the above comment with:

/*
 * If we just inserted into a new tree block, we have to
 * recalculate nkey here because nkey is out of date.
 *
 * Otherwise we're just updating an existing block (having
 * shoved some records into the new tree block), so use the
 * regular key update mechanism.
 */

> > +	if (ncur && bp->b_bn != old_bn) {
> > +		/*
> > +		 * We just inserted into a new tree block, which means that
> > +		 * the key for the block is in nkey, not the tree.
> > +		 */
> > +		if (level == 0)
> > +			xfs_btree_find_leaf_keys(cur, block, &nkey.low,
> > +					&nkey.high);
> > +		else
> > +			xfs_btree_find_node_keys(cur, block, &nkey.low,
> > +					&nkey.high);
> > +	} else {
> > +		/* Updating the left block, do it the standard way. */
> > +		xfs_btree_updkeys(cur, level);
> > +	}
> > +
> >  	if (optr == 1) {
> > -		error = xfs_btree_updkey(cur, key, level + 1);
> > +		error = xfs_btree_updkey(cur, &key->low, level + 1);
> >  		if (error)
> >  			goto error0;
> >  	}
> 
> This is another of those "huh, what" moments I had with all the
> different _updkey functions....

Ditto.  It took me a long time to figure out what the original code
was doing here, and therefore what was the correct thing to do for the
overlapped btree.

> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index b99c018..a5ec6c7 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -126,6 +126,9 @@ struct xfs_btree_ops {
> >  	size_t	key_len;
> >  	size_t	rec_len;
> >  
> > +	/* flags */
> > +	uint	flags;
> > +
> .....
> > @@ -182,6 +195,9 @@ struct xfs_btree_ops {
> >  #endif
> >  };
> >  
> > +/* btree ops flags */
> > +#define XFS_BTREE_OPS_OVERLAPPING	(1<<0)	/* overlapping intervals */
> > +
> 
> why did you put this in the struct btree_ops  and not in the
> btree cursor ->bc_flags field like all the other btree specific
> customisations like:
> 
> /* cursor flags */
> #define XFS_BTREE_LONG_PTRS             (1<<0)  /* pointers are 64bits long */
> #define XFS_BTREE_ROOT_IN_INODE         (1<<1)  /* root may be variable size */
> #define XFS_BTREE_LASTREC_UPDATE        (1<<2)  /* track last rec externally */
> #define XFS_BTREE_CRC_BLOCKS            (1<<3)  /* uses extended btree blocks */
> 
> i.e. we should have all the structural/behavioural flags in the one
> place, not split across different structures....

At the time I thought that it would be a good idea in the long run to
move the btree flags that can't be changed without changes to the
btree_ops into a btree_ops specific flags field.  At the time I didn't
know that I'd end up adding only one flag or that the only btree ops
change I'd need was init_high_key_from_rec, so when I took a second
look last week I put eliminating the flags field on the todo list.

Ok, enough for one night. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 026/119] xfs: add owner field to extent allocation and freeing
  2016-06-17  1:20 ` [PATCH 026/119] xfs: add owner field to extent allocation and freeing Darrick J. Wong
  2016-07-06  4:01   ` Dave Chinner
@ 2016-07-07 15:12   ` Brian Foster
  2016-07-07 19:09     ` Darrick J. Wong
  1 sibling, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-07 15:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:20:39PM -0700, Darrick J. Wong wrote:
> For the rmap btree to work, we have to feed the extent owner
> information to the the allocation and freeing functions. This
> information is what will end up in the rmap btree that tracks
> allocated extents. While we technically don't need the owner
> information when freeing extents, passing it allows us to validate
> that the extent we are removing from the rmap btree actually
> belonged to the owner we expected it to belong to.
> 
> We also define a special set of owner values for internal metadata
> that would otherwise have no owner. This allows us to tell the
> difference between metadata owned by different per-ag btrees, as
> well as static fs metadata (e.g. AG headers) and internal journal
> blocks.
> 
> There are also a couple of special cases we need to take care of -
> during EFI recovery, we don't actually know who the original owner
> was, so we need to pass a wildcard to indicate that we aren't
> checking the owner for validity. We also need special handling in
> growfs, as we "free" the space in the last AG when extending it, but
> because it's new space it has no actual owner...
> 
> While touching the xfs_bmap_add_free() function, re-order the
> parameters to put the struct xfs_mount first.
> 
> Extend the owner field to include both the owner type and some sort
> of index within the owner.  The index field will be used to support
> reverse mappings when reflink is enabled.
> 
> This is based upon a patch originally from Dave Chinner. It has been
> extended to add more owner information with the intent of helping
> recovery operations when things go wrong (e.g. offset of user data
> block in a file).
> 
> v2: When we're freeing extents from an EFI, we don't have the owner
> information available (rmap updates have their own redo items).
> xfs_free_extent therefore doesn't need to do an rmap update, but the
> log replay code doesn't signal this correctly.  Fix it so that it
> does.
> 
> [dchinner: de-shout the xfs_rmap_*_owner helpers]
> [darrick: minor style fixes suggested by Christoph Hellwig]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c        |   11 +++++-
>  fs/xfs/libxfs/xfs_alloc.h        |    4 ++
>  fs/xfs/libxfs/xfs_bmap.c         |   17 ++++++++--
>  fs/xfs/libxfs/xfs_bmap.h         |    4 ++
>  fs/xfs/libxfs/xfs_bmap_btree.c   |    6 +++-
>  fs/xfs/libxfs/xfs_format.h       |   65 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc.c       |    7 +++-
>  fs/xfs/libxfs/xfs_ialloc_btree.c |    7 ++++
>  fs/xfs/xfs_defer_item.c          |    3 +-
>  fs/xfs/xfs_fsops.c               |   16 +++++++--
>  fs/xfs/xfs_log_recover.c         |    5 ++-
>  fs/xfs/xfs_trans.h               |    2 +
>  fs/xfs/xfs_trans_extfree.c       |    5 ++-
>  13 files changed, 131 insertions(+), 21 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index fb00042..eed26f9 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1596,6 +1596,7 @@ xfs_free_ag_extent(
>  	xfs_agnumber_t	agno,	/* allocation group number */
>  	xfs_agblock_t	bno,	/* starting block number */
>  	xfs_extlen_t	len,	/* length of extent */
> +	struct xfs_owner_info	*oinfo,	/* extent owner */

Alignment, here and a couple other places in the patch.

>  	int		isfl)	/* set if is freelist blocks - no sb acctg */
>  {
>  	xfs_btree_cur_t	*bno_cur;	/* cursor for by-block btree */
> @@ -2005,13 +2006,15 @@ xfs_alloc_fix_freelist(
>  	 * back on the free list? Maybe we should only do this when space is
>  	 * getting low or the AGFL is more than half full?
>  	 */
> +	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
>  	while (pag->pagf_flcount > need) {
>  		struct xfs_buf	*bp;
>  
>  		error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
>  		if (error)
>  			goto out_agbp_relse;
> -		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1);
> +		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> +					   &targs.oinfo, 1);
>  		if (error)
>  			goto out_agbp_relse;
>  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> @@ -2021,6 +2024,7 @@ xfs_alloc_fix_freelist(
>  	memset(&targs, 0, sizeof(targs));
>  	targs.tp = tp;
>  	targs.mp = mp;
> +	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
>  	targs.agbp = agbp;
>  	targs.agno = args->agno;
>  	targs.alignment = targs.minlen = targs.prod = targs.isfl = 1;
> @@ -2711,7 +2715,8 @@ int				/* error */
>  xfs_free_extent(
>  	struct xfs_trans	*tp,	/* transaction pointer */
>  	xfs_fsblock_t		bno,	/* starting block number of extent */
> -	xfs_extlen_t		len)	/* length of extent */
> +	xfs_extlen_t		len,	/* length of extent */
> +	struct xfs_owner_info	*oinfo)	/* extent owner */
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*agbp;
> @@ -2739,7 +2744,7 @@ xfs_free_extent(
>  			agbno + len <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length),
>  			err);
>  
> -	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, 0);
> +	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, 0);
>  	if (error)
>  		goto err;
>  
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 20b54aa..0721a48 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -123,6 +123,7 @@ typedef struct xfs_alloc_arg {
>  	char		isfl;		/* set if is freelist blocks - !acctg */
>  	char		userdata;	/* mask defining userdata treatment */
>  	xfs_fsblock_t	firstblock;	/* io first block allocated */
> +	struct xfs_owner_info	oinfo;	/* owner of blocks being allocated */
>  } xfs_alloc_arg_t;
>  
>  /*
> @@ -210,7 +211,8 @@ int				/* error */
>  xfs_free_extent(
>  	struct xfs_trans *tp,	/* transaction pointer */
>  	xfs_fsblock_t	bno,	/* starting block number of extent */
> -	xfs_extlen_t	len);	/* length of extent */
> +	xfs_extlen_t	len,	/* length of extent */
> +	struct xfs_owner_info	*oinfo);	/* extent owner */
>  
>  int				/* error */
>  xfs_alloc_lookup_ge(
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 3a6d3e3..2c28f2a 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -574,7 +574,8 @@ xfs_bmap_add_free(
>  	struct xfs_mount	*mp,		/* mount point structure */
>  	struct xfs_defer_ops	*dfops,		/* list of extents */
>  	xfs_fsblock_t		bno,		/* fs block number of extent */
> -	xfs_filblks_t		len)		/* length of extent */
> +	xfs_filblks_t		len,		/* length of extent */
> +	struct xfs_owner_info	*oinfo)		/* extent owner */
>  {
>  	struct xfs_bmap_free_item	*new;		/* new element */
>  #ifdef DEBUG
> @@ -593,9 +594,14 @@ xfs_bmap_add_free(
>  	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
>  #endif
>  	ASSERT(xfs_bmap_free_item_zone != NULL);
> +
>  	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
>  	new->xbfi_startblock = bno;
>  	new->xbfi_blockcount = (xfs_extlen_t)len;
> +	if (oinfo)
> +		memcpy(&new->xbfi_oinfo, oinfo, sizeof(struct xfs_owner_info));
> +	else
> +		memset(&new->xbfi_oinfo, 0, sizeof(struct xfs_owner_info));

How about just using KM_ZERO on the allocation and doing something like
'if (oinfo) new->xbfi_oinfo = *oinfo'?

BTW, what's the use case for a zeroed out oinfo if we explicitly define
null/unknown owner types?

>  	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
>  			XFS_FSB_TO_AGBNO(mp, bno), len);
>  	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
> @@ -628,6 +634,7 @@ xfs_bmap_btree_to_extents(
>  	xfs_mount_t		*mp;	/* mount point structure */
>  	__be64			*pp;	/* ptr to block address */
>  	struct xfs_btree_block	*rblock;/* root btree block */
> +	struct xfs_owner_info	oinfo;
>  
>  	mp = ip->i_mount;
>  	ifp = XFS_IFORK_PTR(ip, whichfork);
> @@ -651,7 +658,8 @@ xfs_bmap_btree_to_extents(
>  	cblock = XFS_BUF_TO_BLOCK(cbp);
>  	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
>  		return error;
> -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1);
> +	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
>  	ip->i_d.di_nblocks--;
>  	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
>  	xfs_trans_binval(tp, cbp);
> @@ -732,6 +740,7 @@ xfs_bmap_extents_to_btree(
>  	memset(&args, 0, sizeof(args));
>  	args.tp = tp;
>  	args.mp = mp;
> +	xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork);
>  	args.firstblock = *firstblock;
>  	if (*firstblock == NULLFSBLOCK) {
>  		args.type = XFS_ALLOCTYPE_START_BNO;
> @@ -878,6 +887,7 @@ xfs_bmap_local_to_extents(
>  	memset(&args, 0, sizeof(args));
>  	args.tp = tp;
>  	args.mp = ip->i_mount;
> +	xfs_rmap_ino_owner(&args.oinfo, ip->i_ino, whichfork, 0);
>  	args.firstblock = *firstblock;
>  	/*
>  	 * Allocate a block.  We know we need only one, since the
> @@ -4839,6 +4849,7 @@ xfs_bmap_del_extent(
>  		nblks = 0;
>  		do_fx = 0;
>  	}
> +
>  	/*
>  	 * Set flag value to use in switch statement.
>  	 * Left-contig is 2, right-contig is 1.
> @@ -5026,7 +5037,7 @@ xfs_bmap_del_extent(
>  	 */
>  	if (do_fx)
>  		xfs_bmap_add_free(mp, dfops, del->br_startblock,
> -			del->br_blockcount);
> +				  del->br_blockcount, NULL);

Any reason we don't set the owner here?

>  	/*
>  	 * Adjust inode # blocks in the file.
>  	 */
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 8c5f530..862ea464 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -67,6 +67,7 @@ struct xfs_bmap_free_item
>  	xfs_fsblock_t		xbfi_startblock;/* starting fs block number */
>  	xfs_extlen_t		xbfi_blockcount;/* number of blocks in extent */
>  	struct list_head	xbfi_list;
> +	struct xfs_owner_info	xbfi_oinfo;	/* extent owner */
>  };
>  
>  #define	XFS_BMAP_MAX_NMAP	4
> @@ -165,7 +166,8 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
>  int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
>  void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
>  void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> -			  xfs_fsblock_t bno, xfs_filblks_t len);
> +			  xfs_fsblock_t bno, xfs_filblks_t len,
> +			  struct xfs_owner_info *oinfo);
>  void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
>  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index 18b5361..3e68f9a 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -447,6 +447,8 @@ xfs_bmbt_alloc_block(
>  	args.mp = cur->bc_mp;
>  	args.fsbno = cur->bc_private.b.firstblock;
>  	args.firstblock = args.fsbno;
> +	xfs_rmap_ino_bmbt_owner(&args.oinfo, cur->bc_private.b.ip->i_ino,
> +			cur->bc_private.b.whichfork);
>  
>  	if (args.fsbno == NULLFSBLOCK) {
>  		args.fsbno = be64_to_cpu(start->l);
> @@ -526,8 +528,10 @@ xfs_bmbt_free_block(
>  	struct xfs_inode	*ip = cur->bc_private.b.ip;
>  	struct xfs_trans	*tp = cur->bc_tp;
>  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> +	struct xfs_owner_info	oinfo;
>  
> -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1);
> +	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
> +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
>  	ip->i_d.di_nblocks--;
>  
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index b5b0901..97f354f 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1318,6 +1318,71 @@ typedef __be32 xfs_inobt_ptr_t;
>   */
>  #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
>  
> +/*
> + * Ownership info for an extent.  This is used to create reverse-mapping
> + * entries.
> + */
> +#define XFS_OWNER_INFO_ATTR_FORK	(1 << 0)
> +#define XFS_OWNER_INFO_BMBT_BLOCK	(1 << 1)
> +struct xfs_owner_info {
> +	uint64_t		oi_owner;
> +	xfs_fileoff_t		oi_offset;
> +	unsigned int		oi_flags;
> +};
> +
> +static inline void
> +xfs_rmap_ag_owner(
> +	struct xfs_owner_info	*oi,
> +	uint64_t		owner)
> +{
> +	oi->oi_owner = owner;
> +	oi->oi_offset = 0;
> +	oi->oi_flags = 0;
> +}
> +
> +static inline void
> +xfs_rmap_ino_bmbt_owner(
> +	struct xfs_owner_info	*oi,
> +	xfs_ino_t		ino,
> +	int			whichfork)
> +{
> +	oi->oi_owner = ino;
> +	oi->oi_offset = 0;
> +	oi->oi_flags = XFS_OWNER_INFO_BMBT_BLOCK;
> +	if (whichfork == XFS_ATTR_FORK)
> +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> +}
> +
> +static inline void
> +xfs_rmap_ino_owner(
> +	struct xfs_owner_info	*oi,
> +	xfs_ino_t		ino,
> +	int			whichfork,
> +	xfs_fileoff_t		offset)
> +{
> +	oi->oi_owner = ino;
> +	oi->oi_offset = offset;
> +	oi->oi_flags = 0;
> +	if (whichfork == XFS_ATTR_FORK)
> +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> +}
> +
> +/*
> + * Special owner types.
> + *
> + * Seeing as we only support up to 8EB, we have the upper bit of the owner field
> + * to tell us we have a special owner value. We use these for static metadata
> + * allocated at mkfs/growfs time, as well as for freespace management metadata.
> + */
> +#define XFS_RMAP_OWN_NULL	(-1ULL)	/* No owner, for growfs */
> +#define XFS_RMAP_OWN_UNKNOWN	(-2ULL)	/* Unknown owner, for EFI recovery */
> +#define XFS_RMAP_OWN_FS		(-3ULL)	/* static fs metadata */
> +#define XFS_RMAP_OWN_LOG	(-4ULL)	/* static fs metadata */
> +#define XFS_RMAP_OWN_AG		(-5ULL)	/* AG freespace btree blocks */

How about XFS_RMAP_OWN_AGFL? OWN_AG confuses me into thinking it's for
AG headers, but IIUC that is covered by OWN_FS.

> +#define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
> +#define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
> +#define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
> +
>  #define	XFS_RMAP_BLOCK(mp) \
>  	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
>  	 XFS_FIBT_BLOCK(mp) + 1 : \
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index dbc3e35..1982561 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -615,6 +615,7 @@ xfs_ialloc_ag_alloc(
>  	args.tp = tp;
>  	args.mp = tp->t_mountp;
>  	args.fsbno = NULLFSBLOCK;
> +	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INODES);
>  
>  #ifdef DEBUG
>  	/* randomly do sparse inode allocations */
> @@ -1825,12 +1826,14 @@ xfs_difree_inode_chunk(
>  	int		nextbit;
>  	xfs_agblock_t	agbno;
>  	int		contigblk;
> +	struct xfs_owner_info	oinfo;
>  	DECLARE_BITMAP(holemask, XFS_INOBT_HOLEMASK_BITS);
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INODES);
>  
>  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
>  		/* not sparse, calculate extent info directly */
>  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
> -				  mp->m_ialloc_blks);
> +				  mp->m_ialloc_blks, &oinfo);
>  		return;
>  	}
>  
> @@ -1874,7 +1877,7 @@ xfs_difree_inode_chunk(
>  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
>  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
>  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
> -				  contigblk);
> +				  contigblk, &oinfo);
>  
>  		/* reset range to current bit and carry on... */
>  		startidx = endidx = nextbit;
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 88da2ad..f9ea86b 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -96,6 +96,7 @@ xfs_inobt_alloc_block(
>  	memset(&args, 0, sizeof(args));
>  	args.tp = cur->bc_tp;
>  	args.mp = cur->bc_mp;
> +	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INOBT);
>  	args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.a.agno, sbno);
>  	args.minlen = 1;
>  	args.maxlen = 1;
> @@ -125,8 +126,12 @@ xfs_inobt_free_block(
>  	struct xfs_btree_cur	*cur,
>  	struct xfs_buf		*bp)
>  {
> +	struct xfs_owner_info	oinfo;
> +
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
>  	return xfs_free_extent(cur->bc_tp,
> -			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1);
> +			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
> +			&oinfo);
>  }
>  
>  STATIC int
> diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> index 127a54e..1c2d556 100644
> --- a/fs/xfs/xfs_defer_item.c
> +++ b/fs/xfs/xfs_defer_item.c
> @@ -99,7 +99,8 @@ xfs_bmap_free_finish_item(
>  	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
>  	error = xfs_trans_free_extent(tp, done_item,
>  			free->xbfi_startblock,
> -			free->xbfi_blockcount);
> +			free->xbfi_blockcount,
> +			&free->xbfi_oinfo);
>  	kmem_free(free);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 62162d4..d60bb97 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -436,6 +436,8 @@ xfs_growfs_data_private(
>  	 * There are new blocks in the old last a.g.
>  	 */
>  	if (new) {
> +		struct xfs_owner_info	oinfo;
> +
>  		/*
>  		 * Change the agi length.
>  		 */
> @@ -463,14 +465,20 @@ xfs_growfs_data_private(
>  		       be32_to_cpu(agi->agi_length));
>  
>  		xfs_alloc_log_agf(tp, bp, XFS_AGF_LENGTH);
> +
>  		/*
>  		 * Free the new space.
> +		 *
> +		 * XFS_RMAP_OWN_NULL is used here to tell the rmap btree that
> +		 * this doesn't actually exist in the rmap btree.
>  		 */
> -		error = xfs_free_extent(tp, XFS_AGB_TO_FSB(mp, agno,
> -			be32_to_cpu(agf->agf_length) - new), new);
> -		if (error) {
> +		xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_NULL);
> +		error = xfs_free_extent(tp,
> +				XFS_AGB_TO_FSB(mp, agno,
> +					be32_to_cpu(agf->agf_length) - new),
> +				new, &oinfo);
> +		if (error)
>  			goto error0;
> -		}
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 080b54b..0c41bd2 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -4180,6 +4180,7 @@ xlog_recover_process_efi(
>  	int			error = 0;
>  	xfs_extent_t		*extp;
>  	xfs_fsblock_t		startblock_fsb;
> +	struct xfs_owner_info	oinfo;
>  
>  	ASSERT(!test_bit(XFS_EFI_RECOVERED, &efip->efi_flags));
>  
> @@ -4211,10 +4212,12 @@ xlog_recover_process_efi(
>  		return error;
>  	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
>  
> +	oinfo.oi_owner = 0;

Should this be XFS_RMAP_OWN_UNKNOWN?

Brian

>  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>  		extp = &(efip->efi_format.efi_extents[i]);
>  		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
> -					      extp->ext_len);
> +					      extp->ext_len,
> +					      &oinfo);
>  		if (error)
>  			goto abort_error;
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 9a462e8..f8d363f 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -219,7 +219,7 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(xfs_trans_t *,
>  				  uint);
>  int		xfs_trans_free_extent(struct xfs_trans *,
>  				      struct xfs_efd_log_item *, xfs_fsblock_t,
> -				      xfs_extlen_t);
> +				      xfs_extlen_t, struct xfs_owner_info *);
>  int		xfs_trans_commit(struct xfs_trans *);
>  int		__xfs_trans_roll(struct xfs_trans **, struct xfs_inode *, int *);
>  int		xfs_trans_roll(struct xfs_trans **, struct xfs_inode *);
> diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> index a96ae54..d1b8833 100644
> --- a/fs/xfs/xfs_trans_extfree.c
> +++ b/fs/xfs/xfs_trans_extfree.c
> @@ -118,13 +118,14 @@ xfs_trans_free_extent(
>  	struct xfs_trans	*tp,
>  	struct xfs_efd_log_item	*efdp,
>  	xfs_fsblock_t		start_block,
> -	xfs_extlen_t		ext_len)
> +	xfs_extlen_t		ext_len,
> +	struct xfs_owner_info	*oinfo)
>  {
>  	uint			next_extent;
>  	struct xfs_extent	*extp;
>  	int			error;
>  
> -	error = xfs_free_extent(tp, start_block, ext_len);
> +	error = xfs_free_extent(tp, start_block, ext_len, oinfo);
>  
>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 028/119] xfs: define the on-disk rmap btree format
  2016-06-17  1:20 ` [PATCH 028/119] xfs: define the on-disk rmap btree format Darrick J. Wong
  2016-07-06  4:05   ` Dave Chinner
@ 2016-07-07 18:41   ` Brian Foster
  2016-07-07 19:18     ` Darrick J. Wong
  1 sibling, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-07 18:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:20:52PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have all the surrounding call infrastructure in place, we can
> start filling out the rmap btree implementation. Start with the
> on-disk btree format; add everything needed to read, write and
> manipulate rmap btree blocks. This prepares the way for adding the
> btree operations implementation.
> 
> [darrick: record owner and offset info in rmap btree]
> [darrick: fork, bmbt and unwritten state in rmap btree]
> [darrick: flags are a separate field in xfs_rmap_irec]
> [darrick: calculate maxlevels separately]
> [darrick: move the 'unwritten' bit into unused parts of rm_offset]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/xfs/Makefile                |    1 
>  fs/xfs/libxfs/xfs_btree.c      |    3 +
>  fs/xfs/libxfs/xfs_btree.h      |   18 ++--
>  fs/xfs/libxfs/xfs_format.h     |  140 +++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.c |  180 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h |   32 +++++++
>  fs/xfs/libxfs/xfs_sb.c         |    6 +
>  fs/xfs/libxfs/xfs_shared.h     |    2 
>  fs/xfs/xfs_mount.c             |    2 
>  fs/xfs/xfs_mount.h             |    3 +
>  fs/xfs/xfs_ondisk.h            |    3 +
>  fs/xfs/xfs_trace.h             |    2 
>  12 files changed, 384 insertions(+), 8 deletions(-)
>  create mode 100644 fs/xfs/libxfs/xfs_rmap_btree.c
> 
> 
...
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> new file mode 100644
> index 0000000..7a35c78
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -0,0 +1,180 @@
...
> +static bool
> +xfs_rmapbt_verify(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_mount	*mp = bp->b_target->bt_mount;
> +	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
> +	struct xfs_perag	*pag = bp->b_pag;
> +	unsigned int		level;
> +
> +	/*
> +	 * magic number and level verification
> +	 *
> +	 * During growfs operations, we can't verify the exact level or owner as
> +	 * the perag is not fully initialised and hence not attached to the
> +	 * buffer.  In this case, check against the maximum tree depth.
> +	 *
> +	 * Similarly, during log recovery we will have a perag structure
> +	 * attached, but the agf information will not yet have been initialised
> +	 * from the on disk AGF. Again, we can only check against maximum limits
> +	 * in this case.
> +	 */
> +	if (block->bb_magic != cpu_to_be32(XFS_RMAP_CRC_MAGIC))
> +		return false;
> +
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return false;
> +	if (!xfs_btree_sblock_v5hdr_verify(bp))
> +		return false;
> +
> +	level = be16_to_cpu(block->bb_level);
> +	if (pag && pag->pagf_init) {
> +		if (level >= pag->pagf_levels[XFS_BTNUM_RMAPi])
> +			return false;
> +	} else if (level >= mp->m_rmap_maxlevels)
> +		return false;

It looks like the above (level >= mp->m_rmap_maxlevels) check could be
independent (rather than an 'else). Otherwise looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +
> +	return xfs_btree_sblock_verify(bp, mp->m_rmap_mxr[level != 0]);
> +}
> +
> +static void
> +xfs_rmapbt_read_verify(
> +	struct xfs_buf	*bp)
> +{
> +	if (!xfs_btree_sblock_verify_crc(bp))
> +		xfs_buf_ioerror(bp, -EFSBADCRC);
> +	else if (!xfs_rmapbt_verify(bp))
> +		xfs_buf_ioerror(bp, -EFSCORRUPTED);
> +
> +	if (bp->b_error) {
> +		trace_xfs_btree_corrupt(bp, _RET_IP_);
> +		xfs_verifier_error(bp);
> +	}
> +}
> +
> +static void
> +xfs_rmapbt_write_verify(
> +	struct xfs_buf	*bp)
> +{
> +	if (!xfs_rmapbt_verify(bp)) {
> +		trace_xfs_btree_corrupt(bp, _RET_IP_);
> +		xfs_buf_ioerror(bp, -EFSCORRUPTED);
> +		xfs_verifier_error(bp);
> +		return;
> +	}
> +	xfs_btree_sblock_calc_crc(bp);
> +
> +}
> +
> +const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
> +	.name			= "xfs_rmapbt",
> +	.verify_read		= xfs_rmapbt_read_verify,
> +	.verify_write		= xfs_rmapbt_write_verify,
> +};
> +
> +static const struct xfs_btree_ops xfs_rmapbt_ops = {
> +	.rec_len		= sizeof(struct xfs_rmap_rec),
> +	.key_len		= sizeof(struct xfs_rmap_key),
> +
> +	.dup_cursor		= xfs_rmapbt_dup_cursor,
> +	.buf_ops		= &xfs_rmapbt_buf_ops,
> +};
> +
> +/*
> + * Allocate a new allocation btree cursor.
> + */
> +struct xfs_btree_cur *
> +xfs_rmapbt_init_cursor(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*agbp,
> +	xfs_agnumber_t		agno)
> +{
> +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> +	struct xfs_btree_cur	*cur;
> +
> +	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
> +	cur->bc_tp = tp;
> +	cur->bc_mp = mp;
> +	cur->bc_btnum = XFS_BTNUM_RMAP;
> +	cur->bc_flags = XFS_BTREE_CRC_BLOCKS;
> +	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> +	cur->bc_ops = &xfs_rmapbt_ops;
> +	cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
> +
> +	cur->bc_private.a.agbp = agbp;
> +	cur->bc_private.a.agno = agno;
> +
> +	return cur;
> +}
> +
> +/*
> + * Calculate number of records in an rmap btree block.
> + */
> +int
> +xfs_rmapbt_maxrecs(
> +	struct xfs_mount	*mp,
> +	int			blocklen,
> +	int			leaf)
> +{
> +	blocklen -= XFS_RMAP_BLOCK_LEN;
> +
> +	if (leaf)
> +		return blocklen / sizeof(struct xfs_rmap_rec);
> +	return blocklen /
> +		(sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> +}
> +
> +/* Compute the maximum height of an rmap btree. */
> +void
> +xfs_rmapbt_compute_maxlevels(
> +	struct xfs_mount		*mp)
> +{
> +	mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> +			mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> +}
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index a3b8f90..462767f 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -19,6 +19,38 @@
>  #define	__XFS_RMAP_BTREE_H__
>  
>  struct xfs_buf;
> +struct xfs_btree_cur;
> +struct xfs_mount;
> +
> +/* rmaps only exist on crc enabled filesystems */
> +#define XFS_RMAP_BLOCK_LEN	XFS_BTREE_SBLOCK_CRC_LEN
> +
> +/*
> + * Record, key, and pointer address macros for btree blocks.
> + *
> + * (note that some of these may appear unused, but they are used in userspace)
> + */
> +#define XFS_RMAP_REC_ADDR(block, index) \
> +	((struct xfs_rmap_rec *) \
> +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> +		 (((index) - 1) * sizeof(struct xfs_rmap_rec))))
> +
> +#define XFS_RMAP_KEY_ADDR(block, index) \
> +	((struct xfs_rmap_key *) \
> +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> +		 ((index) - 1) * sizeof(struct xfs_rmap_key)))
> +
> +#define XFS_RMAP_PTR_ADDR(block, index, maxrecs) \
> +	((xfs_rmap_ptr_t *) \
> +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> +		 (maxrecs) * sizeof(struct xfs_rmap_key) + \
> +		 ((index) - 1) * sizeof(xfs_rmap_ptr_t)))
> +
> +struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> +				struct xfs_trans *tp, struct xfs_buf *bp,
> +				xfs_agnumber_t agno);
> +int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> +extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
>  
>  int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
>  		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index a544686..f86226b 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -37,6 +37,7 @@
>  #include "xfs_alloc_btree.h"
>  #include "xfs_ialloc_btree.h"
>  #include "xfs_log.h"
> +#include "xfs_rmap_btree.h"
>  
>  /*
>   * Physical superblock buffer manipulations. Shared with libxfs in userspace.
> @@ -734,6 +735,11 @@ xfs_sb_mount_common(
>  	mp->m_bmap_dmnr[0] = mp->m_bmap_dmxr[0] / 2;
>  	mp->m_bmap_dmnr[1] = mp->m_bmap_dmxr[1] / 2;
>  
> +	mp->m_rmap_mxr[0] = xfs_rmapbt_maxrecs(mp, sbp->sb_blocksize, 1);
> +	mp->m_rmap_mxr[1] = xfs_rmapbt_maxrecs(mp, sbp->sb_blocksize, 0);
> +	mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2;
> +	mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2;
> +
>  	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
>  	mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
>  					sbp->sb_inopblock);
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 16002b5..0c5b30b 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -38,6 +38,7 @@ extern const struct xfs_buf_ops xfs_agi_buf_ops;
>  extern const struct xfs_buf_ops xfs_agf_buf_ops;
>  extern const struct xfs_buf_ops xfs_agfl_buf_ops;
>  extern const struct xfs_buf_ops xfs_allocbt_buf_ops;
> +extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
>  extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops;
>  extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops;
>  extern const struct xfs_buf_ops xfs_bmbt_buf_ops;
> @@ -116,6 +117,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
>  #define	XFS_INO_BTREE_REF	3
>  #define	XFS_ALLOC_BTREE_REF	2
>  #define	XFS_BMAP_BTREE_REF	2
> +#define	XFS_RMAP_BTREE_REF	2
>  #define	XFS_DIR_BTREE_REF	2
>  #define	XFS_INO_REF		2
>  #define	XFS_ATTR_BTREE_REF	1
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index b4153f0..8af1c88 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -42,6 +42,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
>  #include "xfs_sysfs.h"
> +#include "xfs_rmap_btree.h"
>  
>  
>  static DEFINE_MUTEX(xfs_uuid_table_mutex);
> @@ -680,6 +681,7 @@ xfs_mountfs(
>  	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
>  	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
>  	xfs_ialloc_compute_maxlevels(mp);
> +	xfs_rmapbt_compute_maxlevels(mp);
>  
>  	xfs_set_maxicount(mp);
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 0537b1f..0ed0f29 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -116,9 +116,12 @@ typedef struct xfs_mount {
>  	uint			m_bmap_dmnr[2];	/* min bmap btree records */
>  	uint			m_inobt_mxr[2];	/* max inobt btree records */
>  	uint			m_inobt_mnr[2];	/* min inobt btree records */
> +	uint			m_rmap_mxr[2];	/* max rmap btree records */
> +	uint			m_rmap_mnr[2];	/* min rmap btree records */
>  	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
>  	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
>  	uint			m_in_maxlevels;	/* max inobt btree levels. */
> +	uint			m_rmap_maxlevels; /* max rmap btree levels */
>  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
>  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
>  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> index 0272301..48d544f 100644
> --- a/fs/xfs/xfs_ondisk.h
> +++ b/fs/xfs/xfs_ondisk.h
> @@ -47,11 +47,14 @@ xfs_check_ondisk_structs(void)
>  	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
>  	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
>  	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
> +	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key,		20);
> +	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec,		24);
>  	XFS_CHECK_STRUCT_SIZE(struct xfs_timestamp,		8);
>  	XFS_CHECK_STRUCT_SIZE(xfs_alloc_key_t,			8);
>  	XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t,			4);
>  	XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t,			8);
>  	XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t,			4);
> +	XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t,			4);
>  
>  	/* dir/attr trees */
>  	XFS_CHECK_STRUCT_SIZE(struct xfs_attr3_leaf_hdr,	80);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 4872fbd..b4ee9c8 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2444,6 +2444,8 @@ DECLARE_EVENT_CLASS(xfs_rmap_class,
>  		__entry->owner = oinfo->oi_owner;
>  		__entry->offset = oinfo->oi_offset;
>  		__entry->flags = oinfo->oi_flags;
> +		if (unwritten)
> +			__entry->flags |= XFS_RMAP_UNWRITTEN;
>  	),
>  	TP_printk("dev %d:%d agno %u agbno %u len %u owner %lld offset %llu flags 0x%lx",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 026/119] xfs: add owner field to extent allocation and freeing
  2016-07-07 15:12   ` Brian Foster
@ 2016-07-07 19:09     ` Darrick J. Wong
  2016-07-07 22:55       ` Dave Chinner
  2016-07-08 11:37       ` Brian Foster
  0 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-07 19:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Thu, Jul 07, 2016 at 11:12:27AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:20:39PM -0700, Darrick J. Wong wrote:
> > For the rmap btree to work, we have to feed the extent owner
> > information to the the allocation and freeing functions. This
> > information is what will end up in the rmap btree that tracks
> > allocated extents. While we technically don't need the owner
> > information when freeing extents, passing it allows us to validate
> > that the extent we are removing from the rmap btree actually
> > belonged to the owner we expected it to belong to.
> > 
> > We also define a special set of owner values for internal metadata
> > that would otherwise have no owner. This allows us to tell the
> > difference between metadata owned by different per-ag btrees, as
> > well as static fs metadata (e.g. AG headers) and internal journal
> > blocks.
> > 
> > There are also a couple of special cases we need to take care of -
> > during EFI recovery, we don't actually know who the original owner
> > was, so we need to pass a wildcard to indicate that we aren't
> > checking the owner for validity. We also need special handling in
> > growfs, as we "free" the space in the last AG when extending it, but
> > because it's new space it has no actual owner...
> > 
> > While touching the xfs_bmap_add_free() function, re-order the
> > parameters to put the struct xfs_mount first.
> > 
> > Extend the owner field to include both the owner type and some sort
> > of index within the owner.  The index field will be used to support
> > reverse mappings when reflink is enabled.
> > 
> > This is based upon a patch originally from Dave Chinner. It has been
> > extended to add more owner information with the intent of helping
> > recovery operations when things go wrong (e.g. offset of user data
> > block in a file).
> > 
> > v2: When we're freeing extents from an EFI, we don't have the owner
> > information available (rmap updates have their own redo items).
> > xfs_free_extent therefore doesn't need to do an rmap update, but the
> > log replay code doesn't signal this correctly.  Fix it so that it
> > does.
> > 
> > [dchinner: de-shout the xfs_rmap_*_owner helpers]
> > [darrick: minor style fixes suggested by Christoph Hellwig]
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c        |   11 +++++-
> >  fs/xfs/libxfs/xfs_alloc.h        |    4 ++
> >  fs/xfs/libxfs/xfs_bmap.c         |   17 ++++++++--
> >  fs/xfs/libxfs/xfs_bmap.h         |    4 ++
> >  fs/xfs/libxfs/xfs_bmap_btree.c   |    6 +++-
> >  fs/xfs/libxfs/xfs_format.h       |   65 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_ialloc.c       |    7 +++-
> >  fs/xfs/libxfs/xfs_ialloc_btree.c |    7 ++++
> >  fs/xfs/xfs_defer_item.c          |    3 +-
> >  fs/xfs/xfs_fsops.c               |   16 +++++++--
> >  fs/xfs/xfs_log_recover.c         |    5 ++-
> >  fs/xfs/xfs_trans.h               |    2 +
> >  fs/xfs/xfs_trans_extfree.c       |    5 ++-
> >  13 files changed, 131 insertions(+), 21 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index fb00042..eed26f9 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -1596,6 +1596,7 @@ xfs_free_ag_extent(
> >  	xfs_agnumber_t	agno,	/* allocation group number */
> >  	xfs_agblock_t	bno,	/* starting block number */
> >  	xfs_extlen_t	len,	/* length of extent */
> > +	struct xfs_owner_info	*oinfo,	/* extent owner */
> 
> Alignment, here and a couple other places in the patch.

Ok, will have a look at that the next time I go through all the patches.

> >  	int		isfl)	/* set if is freelist blocks - no sb acctg */
> >  {
> >  	xfs_btree_cur_t	*bno_cur;	/* cursor for by-block btree */
> > @@ -2005,13 +2006,15 @@ xfs_alloc_fix_freelist(
> >  	 * back on the free list? Maybe we should only do this when space is
> >  	 * getting low or the AGFL is more than half full?
> >  	 */
> > +	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> >  	while (pag->pagf_flcount > need) {
> >  		struct xfs_buf	*bp;
> >  
> >  		error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
> >  		if (error)
> >  			goto out_agbp_relse;
> > -		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1);
> > +		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > +					   &targs.oinfo, 1);
> >  		if (error)
> >  			goto out_agbp_relse;
> >  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > @@ -2021,6 +2024,7 @@ xfs_alloc_fix_freelist(
> >  	memset(&targs, 0, sizeof(targs));
> >  	targs.tp = tp;
> >  	targs.mp = mp;
> > +	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> >  	targs.agbp = agbp;
> >  	targs.agno = args->agno;
> >  	targs.alignment = targs.minlen = targs.prod = targs.isfl = 1;
> > @@ -2711,7 +2715,8 @@ int				/* error */
> >  xfs_free_extent(
> >  	struct xfs_trans	*tp,	/* transaction pointer */
> >  	xfs_fsblock_t		bno,	/* starting block number of extent */
> > -	xfs_extlen_t		len)	/* length of extent */
> > +	xfs_extlen_t		len,	/* length of extent */
> > +	struct xfs_owner_info	*oinfo)	/* extent owner */
> >  {
> >  	struct xfs_mount	*mp = tp->t_mountp;
> >  	struct xfs_buf		*agbp;
> > @@ -2739,7 +2744,7 @@ xfs_free_extent(
> >  			agbno + len <= be32_to_cpu(XFS_BUF_TO_AGF(agbp)->agf_length),
> >  			err);
> >  
> > -	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, 0);
> > +	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, 0);
> >  	if (error)
> >  		goto err;
> >  
> > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > index 20b54aa..0721a48 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.h
> > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > @@ -123,6 +123,7 @@ typedef struct xfs_alloc_arg {
> >  	char		isfl;		/* set if is freelist blocks - !acctg */
> >  	char		userdata;	/* mask defining userdata treatment */
> >  	xfs_fsblock_t	firstblock;	/* io first block allocated */
> > +	struct xfs_owner_info	oinfo;	/* owner of blocks being allocated */
> >  } xfs_alloc_arg_t;
> >  
> >  /*
> > @@ -210,7 +211,8 @@ int				/* error */
> >  xfs_free_extent(
> >  	struct xfs_trans *tp,	/* transaction pointer */
> >  	xfs_fsblock_t	bno,	/* starting block number of extent */
> > -	xfs_extlen_t	len);	/* length of extent */
> > +	xfs_extlen_t	len,	/* length of extent */
> > +	struct xfs_owner_info	*oinfo);	/* extent owner */
> >  
> >  int				/* error */
> >  xfs_alloc_lookup_ge(
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 3a6d3e3..2c28f2a 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -574,7 +574,8 @@ xfs_bmap_add_free(
> >  	struct xfs_mount	*mp,		/* mount point structure */
> >  	struct xfs_defer_ops	*dfops,		/* list of extents */
> >  	xfs_fsblock_t		bno,		/* fs block number of extent */
> > -	xfs_filblks_t		len)		/* length of extent */
> > +	xfs_filblks_t		len,		/* length of extent */
> > +	struct xfs_owner_info	*oinfo)		/* extent owner */
> >  {
> >  	struct xfs_bmap_free_item	*new;		/* new element */
> >  #ifdef DEBUG
> > @@ -593,9 +594,14 @@ xfs_bmap_add_free(
> >  	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
> >  #endif
> >  	ASSERT(xfs_bmap_free_item_zone != NULL);
> > +
> >  	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
> >  	new->xbfi_startblock = bno;
> >  	new->xbfi_blockcount = (xfs_extlen_t)len;
> > +	if (oinfo)
> > +		memcpy(&new->xbfi_oinfo, oinfo, sizeof(struct xfs_owner_info));
> > +	else
> > +		memset(&new->xbfi_oinfo, 0, sizeof(struct xfs_owner_info));
> 
> How about just using KM_ZERO on the allocation and doing something like
> 'if (oinfo) new->xbfi_oinfo = *oinfo'?
> 
> BTW, what's the use case for a zeroed out oinfo if we explicitly define
> null/unknown owner types?

The two main ways we end up altering the rmapbt are as follows:

1) Alloc/free of AG metadata blocks.  For this use case, the caller (generally
a btree ->alloc_block function) bundles the bnobt and rmapbt updates in the
same transaction by passing ownership info (via this oinfo pointer) to the
alloc/free function.  Passing the "special" owner value XFS_RMAP_OWN_NULL just
checks that there are no rmaps for the given range, which is a spot check
performed by growfs.

2) Map/unmap of file blocks.  For this use case, I must treat map/unmap
separately from alloc/free in order to handle reflink.  Therefore, the map &
unmap functions schedule rmap updates directly (via the deferred ops mechanism)
and the alloc/free functions, if they're called, should not update the rmapbt.
Zeroing out the oinfo indicates this.  However, XFS_RMAP_OWN_UNKNOWN is now
unused, so I think I can overload that, especially since we should never be
writing XFS_RMAP_OWN_UNKNOWN to disk.

I think I can simply create an "xfs_rmap_skip_owner_update()" helper (like the
other xfs_rmap_*_owner functions) to encapsulate this.

if (oinfo)
	new->xbfi_oinfo = *oinfo;
else
	xfs_rmap_skip_owner_update(&new->xbfi_oinfo);

Seems clearer, I hope?

Also, the "Special Case #2: EFIs do not record the owner of the extent, so
when" comment is now wrong and needs to be changed.

"Special Case #2: An owner of XFS_RMAP_OWN_UNKNOWN means 'no rmap update'".

> >  	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
> >  			XFS_FSB_TO_AGBNO(mp, bno), len);
> >  	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
> > @@ -628,6 +634,7 @@ xfs_bmap_btree_to_extents(
> >  	xfs_mount_t		*mp;	/* mount point structure */
> >  	__be64			*pp;	/* ptr to block address */
> >  	struct xfs_btree_block	*rblock;/* root btree block */
> > +	struct xfs_owner_info	oinfo;
> >  
> >  	mp = ip->i_mount;
> >  	ifp = XFS_IFORK_PTR(ip, whichfork);
> > @@ -651,7 +658,8 @@ xfs_bmap_btree_to_extents(
> >  	cblock = XFS_BUF_TO_BLOCK(cbp);
> >  	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
> >  		return error;
> > -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1);
> > +	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> > +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, cbno, 1, &oinfo);
> >  	ip->i_d.di_nblocks--;
> >  	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
> >  	xfs_trans_binval(tp, cbp);
> > @@ -732,6 +740,7 @@ xfs_bmap_extents_to_btree(
> >  	memset(&args, 0, sizeof(args));
> >  	args.tp = tp;
> >  	args.mp = mp;
> > +	xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork);
> >  	args.firstblock = *firstblock;
> >  	if (*firstblock == NULLFSBLOCK) {
> >  		args.type = XFS_ALLOCTYPE_START_BNO;
> > @@ -878,6 +887,7 @@ xfs_bmap_local_to_extents(
> >  	memset(&args, 0, sizeof(args));
> >  	args.tp = tp;
> >  	args.mp = ip->i_mount;
> > +	xfs_rmap_ino_owner(&args.oinfo, ip->i_ino, whichfork, 0);
> >  	args.firstblock = *firstblock;
> >  	/*
> >  	 * Allocate a block.  We know we need only one, since the
> > @@ -4839,6 +4849,7 @@ xfs_bmap_del_extent(
> >  		nblks = 0;
> >  		do_fx = 0;
> >  	}
> > +
> >  	/*
> >  	 * Set flag value to use in switch statement.
> >  	 * Left-contig is 2, right-contig is 1.
> > @@ -5026,7 +5037,7 @@ xfs_bmap_del_extent(
> >  	 */
> >  	if (do_fx)
> >  		xfs_bmap_add_free(mp, dfops, del->br_startblock,
> > -			del->br_blockcount);
> > +				  del->br_blockcount, NULL);
> 
> Any reason we don't set the owner here?

(See above.)

> >  	/*
> >  	 * Adjust inode # blocks in the file.
> >  	 */
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 8c5f530..862ea464 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -67,6 +67,7 @@ struct xfs_bmap_free_item
> >  	xfs_fsblock_t		xbfi_startblock;/* starting fs block number */
> >  	xfs_extlen_t		xbfi_blockcount;/* number of blocks in extent */
> >  	struct list_head	xbfi_list;
> > +	struct xfs_owner_info	xbfi_oinfo;	/* extent owner */
> >  };
> >  
> >  #define	XFS_BMAP_MAX_NMAP	4
> > @@ -165,7 +166,8 @@ void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
> >  int	xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd);
> >  void	xfs_bmap_local_to_extents_empty(struct xfs_inode *ip, int whichfork);
> >  void	xfs_bmap_add_free(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > -			  xfs_fsblock_t bno, xfs_filblks_t len);
> > +			  xfs_fsblock_t bno, xfs_filblks_t len,
> > +			  struct xfs_owner_info *oinfo);
> >  void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
> >  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
> >  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> > index 18b5361..3e68f9a 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> > @@ -447,6 +447,8 @@ xfs_bmbt_alloc_block(
> >  	args.mp = cur->bc_mp;
> >  	args.fsbno = cur->bc_private.b.firstblock;
> >  	args.firstblock = args.fsbno;
> > +	xfs_rmap_ino_bmbt_owner(&args.oinfo, cur->bc_private.b.ip->i_ino,
> > +			cur->bc_private.b.whichfork);
> >  
> >  	if (args.fsbno == NULLFSBLOCK) {
> >  		args.fsbno = be64_to_cpu(start->l);
> > @@ -526,8 +528,10 @@ xfs_bmbt_free_block(
> >  	struct xfs_inode	*ip = cur->bc_private.b.ip;
> >  	struct xfs_trans	*tp = cur->bc_tp;
> >  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> > +	struct xfs_owner_info	oinfo;
> >  
> > -	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1);
> > +	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_private.b.whichfork);
> > +	xfs_bmap_add_free(mp, cur->bc_private.b.dfops, fsbno, 1, &oinfo);
> >  	ip->i_d.di_nblocks--;
> >  
> >  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index b5b0901..97f354f 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1318,6 +1318,71 @@ typedef __be32 xfs_inobt_ptr_t;
> >   */
> >  #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
> >  
> > +/*
> > + * Ownership info for an extent.  This is used to create reverse-mapping
> > + * entries.
> > + */
> > +#define XFS_OWNER_INFO_ATTR_FORK	(1 << 0)
> > +#define XFS_OWNER_INFO_BMBT_BLOCK	(1 << 1)
> > +struct xfs_owner_info {
> > +	uint64_t		oi_owner;
> > +	xfs_fileoff_t		oi_offset;
> > +	unsigned int		oi_flags;
> > +};
> > +
> > +static inline void
> > +xfs_rmap_ag_owner(
> > +	struct xfs_owner_info	*oi,
> > +	uint64_t		owner)
> > +{
> > +	oi->oi_owner = owner;
> > +	oi->oi_offset = 0;
> > +	oi->oi_flags = 0;
> > +}
> > +
> > +static inline void
> > +xfs_rmap_ino_bmbt_owner(
> > +	struct xfs_owner_info	*oi,
> > +	xfs_ino_t		ino,
> > +	int			whichfork)
> > +{
> > +	oi->oi_owner = ino;
> > +	oi->oi_offset = 0;
> > +	oi->oi_flags = XFS_OWNER_INFO_BMBT_BLOCK;
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> > +}
> > +
> > +static inline void
> > +xfs_rmap_ino_owner(
> > +	struct xfs_owner_info	*oi,
> > +	xfs_ino_t		ino,
> > +	int			whichfork,
> > +	xfs_fileoff_t		offset)
> > +{
> > +	oi->oi_owner = ino;
> > +	oi->oi_offset = offset;
> > +	oi->oi_flags = 0;
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> > +}
> > +
> > +/*
> > + * Special owner types.
> > + *
> > + * Seeing as we only support up to 8EB, we have the upper bit of the owner field
> > + * to tell us we have a special owner value. We use these for static metadata
> > + * allocated at mkfs/growfs time, as well as for freespace management metadata.
> > + */
> > +#define XFS_RMAP_OWN_NULL	(-1ULL)	/* No owner, for growfs */
> > +#define XFS_RMAP_OWN_UNKNOWN	(-2ULL)	/* Unknown owner, for EFI recovery */
> > +#define XFS_RMAP_OWN_FS		(-3ULL)	/* static fs metadata */
> > +#define XFS_RMAP_OWN_LOG	(-4ULL)	/* static fs metadata */
> > +#define XFS_RMAP_OWN_AG		(-5ULL)	/* AG freespace btree blocks */
> 
> How about XFS_RMAP_OWN_AGFL? OWN_AG confuses me into thinking it's for
> AG headers, but IIUC that is covered by OWN_FS.

or _SPACEBT for AG {free,rmap} space btrees?

> > +#define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
> > +#define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
> > +#define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
> > +
> >  #define	XFS_RMAP_BLOCK(mp) \
> >  	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
> >  	 XFS_FIBT_BLOCK(mp) + 1 : \
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index dbc3e35..1982561 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -615,6 +615,7 @@ xfs_ialloc_ag_alloc(
> >  	args.tp = tp;
> >  	args.mp = tp->t_mountp;
> >  	args.fsbno = NULLFSBLOCK;
> > +	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INODES);
> >  
> >  #ifdef DEBUG
> >  	/* randomly do sparse inode allocations */
> > @@ -1825,12 +1826,14 @@ xfs_difree_inode_chunk(
> >  	int		nextbit;
> >  	xfs_agblock_t	agbno;
> >  	int		contigblk;
> > +	struct xfs_owner_info	oinfo;
> >  	DECLARE_BITMAP(holemask, XFS_INOBT_HOLEMASK_BITS);
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INODES);
> >  
> >  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
> >  		/* not sparse, calculate extent info directly */
> >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
> > -				  mp->m_ialloc_blks);
> > +				  mp->m_ialloc_blks, &oinfo);
> >  		return;
> >  	}
> >  
> > @@ -1874,7 +1877,7 @@ xfs_difree_inode_chunk(
> >  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
> >  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
> >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
> > -				  contigblk);
> > +				  contigblk, &oinfo);
> >  
> >  		/* reset range to current bit and carry on... */
> >  		startidx = endidx = nextbit;
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index 88da2ad..f9ea86b 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -96,6 +96,7 @@ xfs_inobt_alloc_block(
> >  	memset(&args, 0, sizeof(args));
> >  	args.tp = cur->bc_tp;
> >  	args.mp = cur->bc_mp;
> > +	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INOBT);
> >  	args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.a.agno, sbno);
> >  	args.minlen = 1;
> >  	args.maxlen = 1;
> > @@ -125,8 +126,12 @@ xfs_inobt_free_block(
> >  	struct xfs_btree_cur	*cur,
> >  	struct xfs_buf		*bp)
> >  {
> > +	struct xfs_owner_info	oinfo;
> > +
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> >  	return xfs_free_extent(cur->bc_tp,
> > -			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1);
> > +			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
> > +			&oinfo);
> >  }
> >  
> >  STATIC int
> > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > index 127a54e..1c2d556 100644
> > --- a/fs/xfs/xfs_defer_item.c
> > +++ b/fs/xfs/xfs_defer_item.c
> > @@ -99,7 +99,8 @@ xfs_bmap_free_finish_item(
> >  	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> >  	error = xfs_trans_free_extent(tp, done_item,
> >  			free->xbfi_startblock,
> > -			free->xbfi_blockcount);
> > +			free->xbfi_blockcount,
> > +			&free->xbfi_oinfo);
> >  	kmem_free(free);
> >  	return error;
> >  }
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 62162d4..d60bb97 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -436,6 +436,8 @@ xfs_growfs_data_private(
> >  	 * There are new blocks in the old last a.g.
> >  	 */
> >  	if (new) {
> > +		struct xfs_owner_info	oinfo;
> > +
> >  		/*
> >  		 * Change the agi length.
> >  		 */
> > @@ -463,14 +465,20 @@ xfs_growfs_data_private(
> >  		       be32_to_cpu(agi->agi_length));
> >  
> >  		xfs_alloc_log_agf(tp, bp, XFS_AGF_LENGTH);
> > +
> >  		/*
> >  		 * Free the new space.
> > +		 *
> > +		 * XFS_RMAP_OWN_NULL is used here to tell the rmap btree that
> > +		 * this doesn't actually exist in the rmap btree.
> >  		 */
> > -		error = xfs_free_extent(tp, XFS_AGB_TO_FSB(mp, agno,
> > -			be32_to_cpu(agf->agf_length) - new), new);
> > -		if (error) {
> > +		xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_NULL);
> > +		error = xfs_free_extent(tp,
> > +				XFS_AGB_TO_FSB(mp, agno,
> > +					be32_to_cpu(agf->agf_length) - new),
> > +				new, &oinfo);
> > +		if (error)
> >  			goto error0;
> > -		}
> >  	}
> >  
> >  	/*
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 080b54b..0c41bd2 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -4180,6 +4180,7 @@ xlog_recover_process_efi(
> >  	int			error = 0;
> >  	xfs_extent_t		*extp;
> >  	xfs_fsblock_t		startblock_fsb;
> > +	struct xfs_owner_info	oinfo;
> >  
> >  	ASSERT(!test_bit(XFS_EFI_RECOVERED, &efip->efi_flags));
> >  
> > @@ -4211,10 +4212,12 @@ xlog_recover_process_efi(
> >  		return error;
> >  	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
> >  
> > +	oinfo.oi_owner = 0;
> 
> Should this be XFS_RMAP_OWN_UNKNOWN?

xfs_rmap_skip_owner_update(), but yes.

--D

> 
> Brian
> 
> >  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> >  		extp = &(efip->efi_format.efi_extents[i]);
> >  		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
> > -					      extp->ext_len);
> > +					      extp->ext_len,
> > +					      &oinfo);
> >  		if (error)
> >  			goto abort_error;
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 9a462e8..f8d363f 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -219,7 +219,7 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(xfs_trans_t *,
> >  				  uint);
> >  int		xfs_trans_free_extent(struct xfs_trans *,
> >  				      struct xfs_efd_log_item *, xfs_fsblock_t,
> > -				      xfs_extlen_t);
> > +				      xfs_extlen_t, struct xfs_owner_info *);
> >  int		xfs_trans_commit(struct xfs_trans *);
> >  int		__xfs_trans_roll(struct xfs_trans **, struct xfs_inode *, int *);
> >  int		xfs_trans_roll(struct xfs_trans **, struct xfs_inode *);
> > diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> > index a96ae54..d1b8833 100644
> > --- a/fs/xfs/xfs_trans_extfree.c
> > +++ b/fs/xfs/xfs_trans_extfree.c
> > @@ -118,13 +118,14 @@ xfs_trans_free_extent(
> >  	struct xfs_trans	*tp,
> >  	struct xfs_efd_log_item	*efdp,
> >  	xfs_fsblock_t		start_block,
> > -	xfs_extlen_t		ext_len)
> > +	xfs_extlen_t		ext_len,
> > +	struct xfs_owner_info	*oinfo)
> >  {
> >  	uint			next_extent;
> >  	struct xfs_extent	*extp;
> >  	int			error;
> >  
> > -	error = xfs_free_extent(tp, start_block, ext_len);
> > +	error = xfs_free_extent(tp, start_block, ext_len, oinfo);
> >  
> >  	/*
> >  	 * Mark the transaction dirty, even on error. This ensures the
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 028/119] xfs: define the on-disk rmap btree format
  2016-07-07 18:41   ` Brian Foster
@ 2016-07-07 19:18     ` Darrick J. Wong
  2016-07-07 23:14       ` Dave Chinner
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-07 19:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Thu, Jul 07, 2016 at 02:41:56PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:20:52PM -0700, Darrick J. Wong wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now we have all the surrounding call infrastructure in place, we can
> > start filling out the rmap btree implementation. Start with the
> > on-disk btree format; add everything needed to read, write and
> > manipulate rmap btree blocks. This prepares the way for adding the
> > btree operations implementation.
> > 
> > [darrick: record owner and offset info in rmap btree]
> > [darrick: fork, bmbt and unwritten state in rmap btree]
> > [darrick: flags are a separate field in xfs_rmap_irec]
> > [darrick: calculate maxlevels separately]
> > [darrick: move the 'unwritten' bit into unused parts of rm_offset]
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/xfs/Makefile                |    1 
> >  fs/xfs/libxfs/xfs_btree.c      |    3 +
> >  fs/xfs/libxfs/xfs_btree.h      |   18 ++--
> >  fs/xfs/libxfs/xfs_format.h     |  140 +++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.c |  180 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h |   32 +++++++
> >  fs/xfs/libxfs/xfs_sb.c         |    6 +
> >  fs/xfs/libxfs/xfs_shared.h     |    2 
> >  fs/xfs/xfs_mount.c             |    2 
> >  fs/xfs/xfs_mount.h             |    3 +
> >  fs/xfs/xfs_ondisk.h            |    3 +
> >  fs/xfs/xfs_trace.h             |    2 
> >  12 files changed, 384 insertions(+), 8 deletions(-)
> >  create mode 100644 fs/xfs/libxfs/xfs_rmap_btree.c
> > 
> > 
> ...
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > new file mode 100644
> > index 0000000..7a35c78
> > --- /dev/null
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > @@ -0,0 +1,180 @@
> ...
> > +static bool
> > +xfs_rmapbt_verify(
> > +	struct xfs_buf		*bp)
> > +{
> > +	struct xfs_mount	*mp = bp->b_target->bt_mount;
> > +	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
> > +	struct xfs_perag	*pag = bp->b_pag;
> > +	unsigned int		level;
> > +
> > +	/*
> > +	 * magic number and level verification
> > +	 *
> > +	 * During growfs operations, we can't verify the exact level or owner as
> > +	 * the perag is not fully initialised and hence not attached to the
> > +	 * buffer.  In this case, check against the maximum tree depth.
> > +	 *
> > +	 * Similarly, during log recovery we will have a perag structure
> > +	 * attached, but the agf information will not yet have been initialised
> > +	 * from the on disk AGF. Again, we can only check against maximum limits
> > +	 * in this case.
> > +	 */
> > +	if (block->bb_magic != cpu_to_be32(XFS_RMAP_CRC_MAGIC))
> > +		return false;
> > +
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return false;
> > +	if (!xfs_btree_sblock_v5hdr_verify(bp))
> > +		return false;
> > +
> > +	level = be16_to_cpu(block->bb_level);
> > +	if (pag && pag->pagf_init) {
> > +		if (level >= pag->pagf_levels[XFS_BTNUM_RMAPi])
> > +			return false;
> > +	} else if (level >= mp->m_rmap_maxlevels)
> > +		return false;
> 
> It looks like the above (level >= mp->m_rmap_maxlevels) check could be
> independent (rather than an 'else). Otherwise looks good:

Hmmm.... at first I wondered, "Shouldn't we have already checked that
pag->pagf_levels[XFS_BTNUM_RMAPi] <= mp->m_rmap_maxlevels?"  But then I
realized that no, we don't do that anywhere.  Nor does the bnobt/cntbt
verifier.  Am I missing something?

I did see that we at least check the AGF/AGI levels to make sure they don't
overflow XFS_BTREE_MAXLEVELS, so we're probably fine here.

--D

> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +
> > +	return xfs_btree_sblock_verify(bp, mp->m_rmap_mxr[level != 0]);
> > +}
> > +
> > +static void
> > +xfs_rmapbt_read_verify(
> > +	struct xfs_buf	*bp)
> > +{
> > +	if (!xfs_btree_sblock_verify_crc(bp))
> > +		xfs_buf_ioerror(bp, -EFSBADCRC);
> > +	else if (!xfs_rmapbt_verify(bp))
> > +		xfs_buf_ioerror(bp, -EFSCORRUPTED);
> > +
> > +	if (bp->b_error) {
> > +		trace_xfs_btree_corrupt(bp, _RET_IP_);
> > +		xfs_verifier_error(bp);
> > +	}
> > +}
> > +
> > +static void
> > +xfs_rmapbt_write_verify(
> > +	struct xfs_buf	*bp)
> > +{
> > +	if (!xfs_rmapbt_verify(bp)) {
> > +		trace_xfs_btree_corrupt(bp, _RET_IP_);
> > +		xfs_buf_ioerror(bp, -EFSCORRUPTED);
> > +		xfs_verifier_error(bp);
> > +		return;
> > +	}
> > +	xfs_btree_sblock_calc_crc(bp);
> > +
> > +}
> > +
> > +const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
> > +	.name			= "xfs_rmapbt",
> > +	.verify_read		= xfs_rmapbt_read_verify,
> > +	.verify_write		= xfs_rmapbt_write_verify,
> > +};
> > +
> > +static const struct xfs_btree_ops xfs_rmapbt_ops = {
> > +	.rec_len		= sizeof(struct xfs_rmap_rec),
> > +	.key_len		= sizeof(struct xfs_rmap_key),
> > +
> > +	.dup_cursor		= xfs_rmapbt_dup_cursor,
> > +	.buf_ops		= &xfs_rmapbt_buf_ops,
> > +};
> > +
> > +/*
> > + * Allocate a new allocation btree cursor.
> > + */
> > +struct xfs_btree_cur *
> > +xfs_rmapbt_init_cursor(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	struct xfs_buf		*agbp,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > +	struct xfs_btree_cur	*cur;
> > +
> > +	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
> > +	cur->bc_tp = tp;
> > +	cur->bc_mp = mp;
> > +	cur->bc_btnum = XFS_BTNUM_RMAP;
> > +	cur->bc_flags = XFS_BTREE_CRC_BLOCKS;
> > +	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> > +	cur->bc_ops = &xfs_rmapbt_ops;
> > +	cur->bc_nlevels = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
> > +
> > +	cur->bc_private.a.agbp = agbp;
> > +	cur->bc_private.a.agno = agno;
> > +
> > +	return cur;
> > +}
> > +
> > +/*
> > + * Calculate number of records in an rmap btree block.
> > + */
> > +int
> > +xfs_rmapbt_maxrecs(
> > +	struct xfs_mount	*mp,
> > +	int			blocklen,
> > +	int			leaf)
> > +{
> > +	blocklen -= XFS_RMAP_BLOCK_LEN;
> > +
> > +	if (leaf)
> > +		return blocklen / sizeof(struct xfs_rmap_rec);
> > +	return blocklen /
> > +		(sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> > +}
> > +
> > +/* Compute the maximum height of an rmap btree. */
> > +void
> > +xfs_rmapbt_compute_maxlevels(
> > +	struct xfs_mount		*mp)
> > +{
> > +	mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> > +			mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index a3b8f90..462767f 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -19,6 +19,38 @@
> >  #define	__XFS_RMAP_BTREE_H__
> >  
> >  struct xfs_buf;
> > +struct xfs_btree_cur;
> > +struct xfs_mount;
> > +
> > +/* rmaps only exist on crc enabled filesystems */
> > +#define XFS_RMAP_BLOCK_LEN	XFS_BTREE_SBLOCK_CRC_LEN
> > +
> > +/*
> > + * Record, key, and pointer address macros for btree blocks.
> > + *
> > + * (note that some of these may appear unused, but they are used in userspace)
> > + */
> > +#define XFS_RMAP_REC_ADDR(block, index) \
> > +	((struct xfs_rmap_rec *) \
> > +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > +		 (((index) - 1) * sizeof(struct xfs_rmap_rec))))
> > +
> > +#define XFS_RMAP_KEY_ADDR(block, index) \
> > +	((struct xfs_rmap_key *) \
> > +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > +		 ((index) - 1) * sizeof(struct xfs_rmap_key)))
> > +
> > +#define XFS_RMAP_PTR_ADDR(block, index, maxrecs) \
> > +	((xfs_rmap_ptr_t *) \
> > +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > +		 (maxrecs) * sizeof(struct xfs_rmap_key) + \
> > +		 ((index) - 1) * sizeof(xfs_rmap_ptr_t)))
> > +
> > +struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> > +				struct xfs_trans *tp, struct xfs_buf *bp,
> > +				xfs_agnumber_t agno);
> > +int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> > +extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> >  
> >  int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
> >  		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index a544686..f86226b 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -37,6 +37,7 @@
> >  #include "xfs_alloc_btree.h"
> >  #include "xfs_ialloc_btree.h"
> >  #include "xfs_log.h"
> > +#include "xfs_rmap_btree.h"
> >  
> >  /*
> >   * Physical superblock buffer manipulations. Shared with libxfs in userspace.
> > @@ -734,6 +735,11 @@ xfs_sb_mount_common(
> >  	mp->m_bmap_dmnr[0] = mp->m_bmap_dmxr[0] / 2;
> >  	mp->m_bmap_dmnr[1] = mp->m_bmap_dmxr[1] / 2;
> >  
> > +	mp->m_rmap_mxr[0] = xfs_rmapbt_maxrecs(mp, sbp->sb_blocksize, 1);
> > +	mp->m_rmap_mxr[1] = xfs_rmapbt_maxrecs(mp, sbp->sb_blocksize, 0);
> > +	mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2;
> > +	mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2;
> > +
> >  	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
> >  	mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
> >  					sbp->sb_inopblock);
> > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> > index 16002b5..0c5b30b 100644
> > --- a/fs/xfs/libxfs/xfs_shared.h
> > +++ b/fs/xfs/libxfs/xfs_shared.h
> > @@ -38,6 +38,7 @@ extern const struct xfs_buf_ops xfs_agi_buf_ops;
> >  extern const struct xfs_buf_ops xfs_agf_buf_ops;
> >  extern const struct xfs_buf_ops xfs_agfl_buf_ops;
> >  extern const struct xfs_buf_ops xfs_allocbt_buf_ops;
> > +extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
> >  extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops;
> >  extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops;
> >  extern const struct xfs_buf_ops xfs_bmbt_buf_ops;
> > @@ -116,6 +117,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
> >  #define	XFS_INO_BTREE_REF	3
> >  #define	XFS_ALLOC_BTREE_REF	2
> >  #define	XFS_BMAP_BTREE_REF	2
> > +#define	XFS_RMAP_BTREE_REF	2
> >  #define	XFS_DIR_BTREE_REF	2
> >  #define	XFS_INO_REF		2
> >  #define	XFS_ATTR_BTREE_REF	1
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index b4153f0..8af1c88 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -42,6 +42,7 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> >  #include "xfs_sysfs.h"
> > +#include "xfs_rmap_btree.h"
> >  
> >  
> >  static DEFINE_MUTEX(xfs_uuid_table_mutex);
> > @@ -680,6 +681,7 @@ xfs_mountfs(
> >  	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
> >  	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
> >  	xfs_ialloc_compute_maxlevels(mp);
> > +	xfs_rmapbt_compute_maxlevels(mp);
> >  
> >  	xfs_set_maxicount(mp);
> >  
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 0537b1f..0ed0f29 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -116,9 +116,12 @@ typedef struct xfs_mount {
> >  	uint			m_bmap_dmnr[2];	/* min bmap btree records */
> >  	uint			m_inobt_mxr[2];	/* max inobt btree records */
> >  	uint			m_inobt_mnr[2];	/* min inobt btree records */
> > +	uint			m_rmap_mxr[2];	/* max rmap btree records */
> > +	uint			m_rmap_mnr[2];	/* min rmap btree records */
> >  	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
> >  	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
> >  	uint			m_in_maxlevels;	/* max inobt btree levels. */
> > +	uint			m_rmap_maxlevels; /* max rmap btree levels */
> >  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> >  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> >  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> > diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> > index 0272301..48d544f 100644
> > --- a/fs/xfs/xfs_ondisk.h
> > +++ b/fs/xfs/xfs_ondisk.h
> > @@ -47,11 +47,14 @@ xfs_check_ondisk_structs(void)
> >  	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
> >  	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
> >  	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
> > +	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key,		20);
> > +	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec,		24);
> >  	XFS_CHECK_STRUCT_SIZE(struct xfs_timestamp,		8);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_alloc_key_t,			8);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t,			4);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t,			8);
> >  	XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t,			4);
> > +	XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t,			4);
> >  
> >  	/* dir/attr trees */
> >  	XFS_CHECK_STRUCT_SIZE(struct xfs_attr3_leaf_hdr,	80);
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 4872fbd..b4ee9c8 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2444,6 +2444,8 @@ DECLARE_EVENT_CLASS(xfs_rmap_class,
> >  		__entry->owner = oinfo->oi_owner;
> >  		__entry->offset = oinfo->oi_offset;
> >  		__entry->flags = oinfo->oi_flags;
> > +		if (unwritten)
> > +			__entry->flags |= XFS_RMAP_UNWRITTEN;
> >  	),
> >  	TP_printk("dev %d:%d agno %u agbno %u len %u owner %lld offset %llu flags 0x%lx",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 026/119] xfs: add owner field to extent allocation and freeing
  2016-07-07 19:09     ` Darrick J. Wong
@ 2016-07-07 22:55       ` Dave Chinner
  2016-07-08 11:37       ` Brian Foster
  1 sibling, 0 replies; 236+ messages in thread
From: Dave Chinner @ 2016-07-07 22:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Brian Foster, linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Thu, Jul 07, 2016 at 12:09:56PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 07, 2016 at 11:12:27AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:20:39PM -0700, Darrick J. Wong wrote:
> > > For the rmap btree to work, we have to feed the extent owner
> > > information to the the allocation and freeing functions. This
> > > information is what will end up in the rmap btree that tracks
> > > allocated extents. While we technically don't need the owner
> > > information when freeing extents, passing it allows us to validate
> > > that the extent we are removing from the rmap btree actually
> > > belonged to the owner we expected it to belong to.
....
> > > +/*
> > > + * Special owner types.
> > > + *
> > > + * Seeing as we only support up to 8EB, we have the upper bit of the owner field
> > > + * to tell us we have a special owner value. We use these for static metadata
> > > + * allocated at mkfs/growfs time, as well as for freespace management metadata.
> > > + */
> > > +#define XFS_RMAP_OWN_NULL	(-1ULL)	/* No owner, for growfs */
> > > +#define XFS_RMAP_OWN_UNKNOWN	(-2ULL)	/* Unknown owner, for EFI recovery */
> > > +#define XFS_RMAP_OWN_FS		(-3ULL)	/* static fs metadata */
> > > +#define XFS_RMAP_OWN_LOG	(-4ULL)	/* static fs metadata */
> > > +#define XFS_RMAP_OWN_AG		(-5ULL)	/* AG freespace btree blocks */
> > 
> > How about XFS_RMAP_OWN_AGFL? OWN_AG confuses me into thinking it's for
> > AG headers, but IIUC that is covered by OWN_FS.

AG headers are static metadata, laid down by mkfs. They are always
owned by the filesystem, hence the "OWN_FS" name.

> or _SPACEBT for AG {free,rmap} space btrees?

IIRC, the reason I simply named them as "Owned by the AG" is that
the space tracking btree blocks are always considered free space. THey can
move between the freespace tree and the AGFL without consuming free
space and it's not trivial to separate their classification to
anything other than "blocks used by the AG but are free space" e.g.
in he middle of a transaction that allocates and free
blocks the same block can move like this:

 bnobt block -> AGFL -> cntbt block -> AGFL -> rmapbt block

Hence blocks on the AGFL are considered to be the same as
bno/cnt/rmapbt blocks for the purpose of owner identification.
Otherwise we'd have to modify the rmapbt every time we move a block
to/from the AGFL, and that then leads to recursion problems and lots
of unnecessary overhead...

Feel free to change the names, but I don't think we can change owner
classifications of the blocks they represent...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 028/119] xfs: define the on-disk rmap btree format
  2016-07-07 19:18     ` Darrick J. Wong
@ 2016-07-07 23:14       ` Dave Chinner
  2016-07-07 23:58         ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Dave Chinner @ 2016-07-07 23:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Brian Foster, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jul 07, 2016 at 12:18:13PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 07, 2016 at 02:41:56PM -0400, Brian Foster wrote:
> > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > +		return false;
> > > +	if (!xfs_btree_sblock_v5hdr_verify(bp))
> > > +		return false;
> > > +
> > > +	level = be16_to_cpu(block->bb_level);
> > > +	if (pag && pag->pagf_init) {
> > > +		if (level >= pag->pagf_levels[XFS_BTNUM_RMAPi])
> > > +			return false;
> > > +	} else if (level >= mp->m_rmap_maxlevels)
> > > +		return false;
> > 
> > It looks like the above (level >= mp->m_rmap_maxlevels) check could be
> > independent (rather than an 'else). Otherwise looks good:
> 
> Hmmm.... at first I wondered, "Shouldn't we have already checked that
> pag->pagf_levels[XFS_BTNUM_RMAPi] <= mp->m_rmap_maxlevels?"  But then I
> realized that no, we don't do that anywhere.  Nor does the bnobt/cntbt
> verifier.  Am I missing something?

It should have been ranged checked when the AGF is first read in
(i.e. in the verifier), in ASSERTS every time xfs_alloc_read_agf()
is called after initialisation, and then every time the verifier is
run on write of the AGF.

> I did see that we at least check the AGF/AGI levels to make sure they don't
> overflow XFS_BTREE_MAXLEVELS, so we're probably fine here.

Precisely - if the AGF verifier doesn't have a max level check in it
for the rmapbt, then we need to add one there.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 028/119] xfs: define the on-disk rmap btree format
  2016-07-07 23:14       ` Dave Chinner
@ 2016-07-07 23:58         ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-07 23:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Fri, Jul 08, 2016 at 09:14:55AM +1000, Dave Chinner wrote:
> On Thu, Jul 07, 2016 at 12:18:13PM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 07, 2016 at 02:41:56PM -0400, Brian Foster wrote:
> > > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > +		return false;
> > > > +	if (!xfs_btree_sblock_v5hdr_verify(bp))
> > > > +		return false;
> > > > +
> > > > +	level = be16_to_cpu(block->bb_level);
> > > > +	if (pag && pag->pagf_init) {
> > > > +		if (level >= pag->pagf_levels[XFS_BTNUM_RMAPi])
> > > > +			return false;
> > > > +	} else if (level >= mp->m_rmap_maxlevels)
> > > > +		return false;
> > > 
> > > It looks like the above (level >= mp->m_rmap_maxlevels) check could be
> > > independent (rather than an 'else). Otherwise looks good:
> > 
> > Hmmm.... at first I wondered, "Shouldn't we have already checked that
> > pag->pagf_levels[XFS_BTNUM_RMAPi] <= mp->m_rmap_maxlevels?"  But then I
> > realized that no, we don't do that anywhere.  Nor does the bnobt/cntbt
> > verifier.  Am I missing something?

(Yes, I am.)

> It should have been ranged checked when the AGF is first read in
> (i.e. in the verifier), in ASSERTS every time xfs_alloc_read_agf()
> is called after initialisation, and then every time the verifier is
> run on write of the AGF.

You're right.  I missed that. :(

> > I did see that we at least check the AGF/AGI levels to make sure they don't
> > overflow XFS_BTREE_MAXLEVELS, so we're probably fine here.
> 
> Precisely - if the AGF verifier doesn't have a max level check in it
> for the rmapbt, then we need to add one there.

There's a check there, so we're fine.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 026/119] xfs: add owner field to extent allocation and freeing
  2016-07-07 19:09     ` Darrick J. Wong
  2016-07-07 22:55       ` Dave Chinner
@ 2016-07-08 11:37       ` Brian Foster
  1 sibling, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-08 11:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jul 07, 2016 at 12:09:56PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 07, 2016 at 11:12:27AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:20:39PM -0700, Darrick J. Wong wrote:
> > > For the rmap btree to work, we have to feed the extent owner
> > > information to the the allocation and freeing functions. This
> > > information is what will end up in the rmap btree that tracks
> > > allocated extents. While we technically don't need the owner
> > > information when freeing extents, passing it allows us to validate
> > > that the extent we are removing from the rmap btree actually
> > > belonged to the owner we expected it to belong to.
> > > 
> > > We also define a special set of owner values for internal metadata
> > > that would otherwise have no owner. This allows us to tell the
> > > difference between metadata owned by different per-ag btrees, as
> > > well as static fs metadata (e.g. AG headers) and internal journal
> > > blocks.
> > > 
> > > There are also a couple of special cases we need to take care of -
> > > during EFI recovery, we don't actually know who the original owner
> > > was, so we need to pass a wildcard to indicate that we aren't
> > > checking the owner for validity. We also need special handling in
> > > growfs, as we "free" the space in the last AG when extending it, but
> > > because it's new space it has no actual owner...
> > > 
> > > While touching the xfs_bmap_add_free() function, re-order the
> > > parameters to put the struct xfs_mount first.
> > > 
> > > Extend the owner field to include both the owner type and some sort
> > > of index within the owner.  The index field will be used to support
> > > reverse mappings when reflink is enabled.
> > > 
> > > This is based upon a patch originally from Dave Chinner. It has been
> > > extended to add more owner information with the intent of helping
> > > recovery operations when things go wrong (e.g. offset of user data
> > > block in a file).
> > > 
> > > v2: When we're freeing extents from an EFI, we don't have the owner
> > > information available (rmap updates have their own redo items).
> > > xfs_free_extent therefore doesn't need to do an rmap update, but the
> > > log replay code doesn't signal this correctly.  Fix it so that it
> > > does.
> > > 
> > > [dchinner: de-shout the xfs_rmap_*_owner helpers]
> > > [darrick: minor style fixes suggested by Christoph Hellwig]
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c        |   11 +++++-
> > >  fs/xfs/libxfs/xfs_alloc.h        |    4 ++
> > >  fs/xfs/libxfs/xfs_bmap.c         |   17 ++++++++--
> > >  fs/xfs/libxfs/xfs_bmap.h         |    4 ++
> > >  fs/xfs/libxfs/xfs_bmap_btree.c   |    6 +++-
> > >  fs/xfs/libxfs/xfs_format.h       |   65 ++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_ialloc.c       |    7 +++-
> > >  fs/xfs/libxfs/xfs_ialloc_btree.c |    7 ++++
> > >  fs/xfs/xfs_defer_item.c          |    3 +-
> > >  fs/xfs/xfs_fsops.c               |   16 +++++++--
> > >  fs/xfs/xfs_log_recover.c         |    5 ++-
> > >  fs/xfs/xfs_trans.h               |    2 +
> > >  fs/xfs/xfs_trans_extfree.c       |    5 ++-
> > >  13 files changed, 131 insertions(+), 21 deletions(-)
> > > 
> > > 
...
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 3a6d3e3..2c28f2a 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -574,7 +574,8 @@ xfs_bmap_add_free(
> > >  	struct xfs_mount	*mp,		/* mount point structure */
> > >  	struct xfs_defer_ops	*dfops,		/* list of extents */
> > >  	xfs_fsblock_t		bno,		/* fs block number of extent */
> > > -	xfs_filblks_t		len)		/* length of extent */
> > > +	xfs_filblks_t		len,		/* length of extent */
> > > +	struct xfs_owner_info	*oinfo)		/* extent owner */
> > >  {
> > >  	struct xfs_bmap_free_item	*new;		/* new element */
> > >  #ifdef DEBUG
> > > @@ -593,9 +594,14 @@ xfs_bmap_add_free(
> > >  	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
> > >  #endif
> > >  	ASSERT(xfs_bmap_free_item_zone != NULL);
> > > +
> > >  	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
> > >  	new->xbfi_startblock = bno;
> > >  	new->xbfi_blockcount = (xfs_extlen_t)len;
> > > +	if (oinfo)
> > > +		memcpy(&new->xbfi_oinfo, oinfo, sizeof(struct xfs_owner_info));
> > > +	else
> > > +		memset(&new->xbfi_oinfo, 0, sizeof(struct xfs_owner_info));
> > 
> > How about just using KM_ZERO on the allocation and doing something like
> > 'if (oinfo) new->xbfi_oinfo = *oinfo'?
> > 
> > BTW, what's the use case for a zeroed out oinfo if we explicitly define
> > null/unknown owner types?
> 
> The two main ways we end up altering the rmapbt are as follows:
> 
> 1) Alloc/free of AG metadata blocks.  For this use case, the caller (generally
> a btree ->alloc_block function) bundles the bnobt and rmapbt updates in the
> same transaction by passing ownership info (via this oinfo pointer) to the
> alloc/free function.  Passing the "special" owner value XFS_RMAP_OWN_NULL just
> checks that there are no rmaps for the given range, which is a spot check
> performed by growfs.
> 
> 2) Map/unmap of file blocks.  For this use case, I must treat map/unmap
> separately from alloc/free in order to handle reflink.  Therefore, the map &
> unmap functions schedule rmap updates directly (via the deferred ops mechanism)
> and the alloc/free functions, if they're called, should not update the rmapbt.
> Zeroing out the oinfo indicates this.  However, XFS_RMAP_OWN_UNKNOWN is now
> unused, so I think I can overload that, especially since we should never be
> writing XFS_RMAP_OWN_UNKNOWN to disk.
> 
> I think I can simply create an "xfs_rmap_skip_owner_update()" helper (like the
> other xfs_rmap_*_owner functions) to encapsulate this.
> 
> if (oinfo)
> 	new->xbfi_oinfo = *oinfo;
> else
> 	xfs_rmap_skip_owner_update(&new->xbfi_oinfo);
> 
> Seems clearer, I hope?
> 

Ok, yup. Thanks for the explanation.

> Also, the "Special Case #2: EFIs do not record the owner of the extent, so
> when" comment is now wrong and needs to be changed.
> 
> "Special Case #2: An owner of XFS_RMAP_OWN_UNKNOWN means 'no rmap update'".
> 
> > >  	trace_xfs_bmap_free_defer(mp, XFS_FSB_TO_AGNO(mp, bno), 0,
> > >  			XFS_FSB_TO_AGBNO(mp, bno), len);
> > >  	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_FREE, &new->xbfi_list);
...
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index b5b0901..97f354f 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -1318,6 +1318,71 @@ typedef __be32 xfs_inobt_ptr_t;
> > >   */
> > >  #define	XFS_RMAP_CRC_MAGIC	0x524d4233	/* 'RMB3' */
> > >  
> > > +/*
> > > + * Ownership info for an extent.  This is used to create reverse-mapping
> > > + * entries.
> > > + */
> > > +#define XFS_OWNER_INFO_ATTR_FORK	(1 << 0)
> > > +#define XFS_OWNER_INFO_BMBT_BLOCK	(1 << 1)
> > > +struct xfs_owner_info {
> > > +	uint64_t		oi_owner;
> > > +	xfs_fileoff_t		oi_offset;
> > > +	unsigned int		oi_flags;
> > > +};
> > > +
> > > +static inline void
> > > +xfs_rmap_ag_owner(
> > > +	struct xfs_owner_info	*oi,
> > > +	uint64_t		owner)
> > > +{
> > > +	oi->oi_owner = owner;
> > > +	oi->oi_offset = 0;
> > > +	oi->oi_flags = 0;
> > > +}
> > > +
> > > +static inline void
> > > +xfs_rmap_ino_bmbt_owner(
> > > +	struct xfs_owner_info	*oi,
> > > +	xfs_ino_t		ino,
> > > +	int			whichfork)
> > > +{
> > > +	oi->oi_owner = ino;
> > > +	oi->oi_offset = 0;
> > > +	oi->oi_flags = XFS_OWNER_INFO_BMBT_BLOCK;
> > > +	if (whichfork == XFS_ATTR_FORK)
> > > +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> > > +}
> > > +
> > > +static inline void
> > > +xfs_rmap_ino_owner(
> > > +	struct xfs_owner_info	*oi,
> > > +	xfs_ino_t		ino,
> > > +	int			whichfork,
> > > +	xfs_fileoff_t		offset)
> > > +{
> > > +	oi->oi_owner = ino;
> > > +	oi->oi_offset = offset;
> > > +	oi->oi_flags = 0;
> > > +	if (whichfork == XFS_ATTR_FORK)
> > > +		oi->oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
> > > +}
> > > +
> > > +/*
> > > + * Special owner types.
> > > + *
> > > + * Seeing as we only support up to 8EB, we have the upper bit of the owner field
> > > + * to tell us we have a special owner value. We use these for static metadata
> > > + * allocated at mkfs/growfs time, as well as for freespace management metadata.
> > > + */
> > > +#define XFS_RMAP_OWN_NULL	(-1ULL)	/* No owner, for growfs */
> > > +#define XFS_RMAP_OWN_UNKNOWN	(-2ULL)	/* Unknown owner, for EFI recovery */
> > > +#define XFS_RMAP_OWN_FS		(-3ULL)	/* static fs metadata */
> > > +#define XFS_RMAP_OWN_LOG	(-4ULL)	/* static fs metadata */
> > > +#define XFS_RMAP_OWN_AG		(-5ULL)	/* AG freespace btree blocks */
> > 
> > How about XFS_RMAP_OWN_AGFL? OWN_AG confuses me into thinking it's for
> > AG headers, but IIUC that is covered by OWN_FS.
> 
> or _SPACEBT for AG {free,rmap} space btrees?
> 

I was thinking that this type only represented free list blocks and that
the mapping would be updated when the block was actually allocated to a
btree. As Dave points out in his followup response, that is not the
case. OWN_AG actually makes more sense to me in that light, so feel free
to disregard this comment.

Brian

> > > +#define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
> > > +#define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
> > > +#define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
> > > +
> > >  #define	XFS_RMAP_BLOCK(mp) \
> > >  	(xfs_sb_version_hasfinobt(&((mp)->m_sb)) ? \
> > >  	 XFS_FIBT_BLOCK(mp) + 1 : \
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > > index dbc3e35..1982561 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > > @@ -615,6 +615,7 @@ xfs_ialloc_ag_alloc(
> > >  	args.tp = tp;
> > >  	args.mp = tp->t_mountp;
> > >  	args.fsbno = NULLFSBLOCK;
> > > +	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INODES);
> > >  
> > >  #ifdef DEBUG
> > >  	/* randomly do sparse inode allocations */
> > > @@ -1825,12 +1826,14 @@ xfs_difree_inode_chunk(
> > >  	int		nextbit;
> > >  	xfs_agblock_t	agbno;
> > >  	int		contigblk;
> > > +	struct xfs_owner_info	oinfo;
> > >  	DECLARE_BITMAP(holemask, XFS_INOBT_HOLEMASK_BITS);
> > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INODES);
> > >  
> > >  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
> > >  		/* not sparse, calculate extent info directly */
> > >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, sagbno),
> > > -				  mp->m_ialloc_blks);
> > > +				  mp->m_ialloc_blks, &oinfo);
> > >  		return;
> > >  	}
> > >  
> > > @@ -1874,7 +1877,7 @@ xfs_difree_inode_chunk(
> > >  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
> > >  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
> > >  		xfs_bmap_add_free(mp, dfops, XFS_AGB_TO_FSB(mp, agno, agbno),
> > > -				  contigblk);
> > > +				  contigblk, &oinfo);
> > >  
> > >  		/* reset range to current bit and carry on... */
> > >  		startidx = endidx = nextbit;
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > index 88da2ad..f9ea86b 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > @@ -96,6 +96,7 @@ xfs_inobt_alloc_block(
> > >  	memset(&args, 0, sizeof(args));
> > >  	args.tp = cur->bc_tp;
> > >  	args.mp = cur->bc_mp;
> > > +	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_INOBT);
> > >  	args.fsbno = XFS_AGB_TO_FSB(args.mp, cur->bc_private.a.agno, sbno);
> > >  	args.minlen = 1;
> > >  	args.maxlen = 1;
> > > @@ -125,8 +126,12 @@ xfs_inobt_free_block(
> > >  	struct xfs_btree_cur	*cur,
> > >  	struct xfs_buf		*bp)
> > >  {
> > > +	struct xfs_owner_info	oinfo;
> > > +
> > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> > >  	return xfs_free_extent(cur->bc_tp,
> > > -			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1);
> > > +			XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1,
> > > +			&oinfo);
> > >  }
> > >  
> > >  STATIC int
> > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > index 127a54e..1c2d556 100644
> > > --- a/fs/xfs/xfs_defer_item.c
> > > +++ b/fs/xfs/xfs_defer_item.c
> > > @@ -99,7 +99,8 @@ xfs_bmap_free_finish_item(
> > >  	free = container_of(item, struct xfs_bmap_free_item, xbfi_list);
> > >  	error = xfs_trans_free_extent(tp, done_item,
> > >  			free->xbfi_startblock,
> > > -			free->xbfi_blockcount);
> > > +			free->xbfi_blockcount,
> > > +			&free->xbfi_oinfo);
> > >  	kmem_free(free);
> > >  	return error;
> > >  }
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 62162d4..d60bb97 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -436,6 +436,8 @@ xfs_growfs_data_private(
> > >  	 * There are new blocks in the old last a.g.
> > >  	 */
> > >  	if (new) {
> > > +		struct xfs_owner_info	oinfo;
> > > +
> > >  		/*
> > >  		 * Change the agi length.
> > >  		 */
> > > @@ -463,14 +465,20 @@ xfs_growfs_data_private(
> > >  		       be32_to_cpu(agi->agi_length));
> > >  
> > >  		xfs_alloc_log_agf(tp, bp, XFS_AGF_LENGTH);
> > > +
> > >  		/*
> > >  		 * Free the new space.
> > > +		 *
> > > +		 * XFS_RMAP_OWN_NULL is used here to tell the rmap btree that
> > > +		 * this doesn't actually exist in the rmap btree.
> > >  		 */
> > > -		error = xfs_free_extent(tp, XFS_AGB_TO_FSB(mp, agno,
> > > -			be32_to_cpu(agf->agf_length) - new), new);
> > > -		if (error) {
> > > +		xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_NULL);
> > > +		error = xfs_free_extent(tp,
> > > +				XFS_AGB_TO_FSB(mp, agno,
> > > +					be32_to_cpu(agf->agf_length) - new),
> > > +				new, &oinfo);
> > > +		if (error)
> > >  			goto error0;
> > > -		}
> > >  	}
> > >  
> > >  	/*
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index 080b54b..0c41bd2 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -4180,6 +4180,7 @@ xlog_recover_process_efi(
> > >  	int			error = 0;
> > >  	xfs_extent_t		*extp;
> > >  	xfs_fsblock_t		startblock_fsb;
> > > +	struct xfs_owner_info	oinfo;
> > >  
> > >  	ASSERT(!test_bit(XFS_EFI_RECOVERED, &efip->efi_flags));
> > >  
> > > @@ -4211,10 +4212,12 @@ xlog_recover_process_efi(
> > >  		return error;
> > >  	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
> > >  
> > > +	oinfo.oi_owner = 0;
> > 
> > Should this be XFS_RMAP_OWN_UNKNOWN?
> 
> xfs_rmap_skip_owner_update(), but yes.
> 
> --D
> 
> > 
> > Brian
> > 
> > >  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> > >  		extp = &(efip->efi_format.efi_extents[i]);
> > >  		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
> > > -					      extp->ext_len);
> > > +					      extp->ext_len,
> > > +					      &oinfo);
> > >  		if (error)
> > >  			goto abort_error;
> > >  
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index 9a462e8..f8d363f 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -219,7 +219,7 @@ struct xfs_efd_log_item	*xfs_trans_get_efd(xfs_trans_t *,
> > >  				  uint);
> > >  int		xfs_trans_free_extent(struct xfs_trans *,
> > >  				      struct xfs_efd_log_item *, xfs_fsblock_t,
> > > -				      xfs_extlen_t);
> > > +				      xfs_extlen_t, struct xfs_owner_info *);
> > >  int		xfs_trans_commit(struct xfs_trans *);
> > >  int		__xfs_trans_roll(struct xfs_trans **, struct xfs_inode *, int *);
> > >  int		xfs_trans_roll(struct xfs_trans **, struct xfs_inode *);
> > > diff --git a/fs/xfs/xfs_trans_extfree.c b/fs/xfs/xfs_trans_extfree.c
> > > index a96ae54..d1b8833 100644
> > > --- a/fs/xfs/xfs_trans_extfree.c
> > > +++ b/fs/xfs/xfs_trans_extfree.c
> > > @@ -118,13 +118,14 @@ xfs_trans_free_extent(
> > >  	struct xfs_trans	*tp,
> > >  	struct xfs_efd_log_item	*efdp,
> > >  	xfs_fsblock_t		start_block,
> > > -	xfs_extlen_t		ext_len)
> > > +	xfs_extlen_t		ext_len,
> > > +	struct xfs_owner_info	*oinfo)
> > >  {
> > >  	uint			next_extent;
> > >  	struct xfs_extent	*extp;
> > >  	int			error;
> > >  
> > > -	error = xfs_free_extent(tp, start_block, ext_len);
> > > +	error = xfs_free_extent(tp, start_block, ext_len, oinfo);
> > >  
> > >  	/*
> > >  	 * Mark the transaction dirty, even on error. This ensures the
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 030/119] xfs: rmap btree transaction reservations
  2016-06-17  1:21 ` [PATCH 030/119] xfs: rmap btree transaction reservations Darrick J. Wong
@ 2016-07-08 13:21   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-08 13:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:21:04PM -0700, Darrick J. Wong wrote:
> The rmap btrees will use the AGFL as the block allocation source, so
> we need to ensure that the transaction reservations reflect the fact
> this tree is modified by allocation and freeing. Hence we need to
> extend all the extent allocation/free reservations used in
> transactions to handle this.
> 
> Note that this also gets rid of the unused XFS_ALLOCFREE_LOG_RES
> macro, as we now do buffer reservations based on the number of
> buffers logged via xfs_calc_buf_res(). Hence we only need the buffer
> count calculation now.
> 
> [darrick: use rmap_maxlevels when calculating log block resv]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_trans_resv.c |   58 ++++++++++++++++++++++++++++------------
>  fs/xfs/libxfs/xfs_trans_resv.h |   10 -------
>  2 files changed, 41 insertions(+), 27 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 4c7eb9d..301ef2f 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -64,6 +64,30 @@ xfs_calc_buf_res(
>  }
>  
>  /*
> + * Per-extent log reservation for the btree changes involved in freeing or
> + * allocating an extent.  In classic XFS there were two trees that will be
> + * modified (bnobt + cntbt).  With rmap enabled, there are three trees
> + * (rmapbt).  The number of blocks reserved is based on the formula:
> + *
> + * num trees * ((2 blocks/level * max depth) - 1)
> + *
> + * Keep in mind that max depth is calculated separately for each type of tree.
> + */
> +static uint
> +xfs_allocfree_log_count(
> +	struct xfs_mount *mp,
> +	uint		num_ops)
> +{
> +	uint		blocks;
> +
> +	blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
> +
> +	return blocks;
> +}
> +
> +/*
>   * Logging inodes is really tricksy. They are logged in memory format,
>   * which means that what we write into the log doesn't directly translate into
>   * the amount of space they use on disk.
> @@ -126,7 +150,7 @@ xfs_calc_inode_res(
>   */
>  STATIC uint
>  xfs_calc_finobt_res(
> -	struct xfs_mount 	*mp,
> +	struct xfs_mount	*mp,
>  	int			alloc,
>  	int			modify)
>  {
> @@ -137,7 +161,7 @@ xfs_calc_finobt_res(
>  
>  	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
>  	if (alloc)
> -		res += xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1), 
> +		res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  					XFS_FSB_TO_B(mp, 1));
>  	if (modify)
>  		res += (uint)XFS_FSB_TO_B(mp, 1);
> @@ -188,10 +212,10 @@ xfs_calc_write_reservation(
>  		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
>  				      XFS_FSB_TO_B(mp, 1)) +
>  		     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> -		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
>  				      XFS_FSB_TO_B(mp, 1))),
>  		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> -		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
>  				      XFS_FSB_TO_B(mp, 1))));
>  }
>  
> @@ -217,10 +241,10 @@ xfs_calc_itruncate_reservation(
>  		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1,
>  				      XFS_FSB_TO_B(mp, 1))),
>  		    (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> -		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 4),
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
>  				      XFS_FSB_TO_B(mp, 1)) +
>  		    xfs_calc_buf_res(5, 0) +
> -		    xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				     XFS_FSB_TO_B(mp, 1)) +
>  		    xfs_calc_buf_res(2 + mp->m_ialloc_blks +
>  				     mp->m_in_maxlevels, 0)));
> @@ -247,7 +271,7 @@ xfs_calc_rename_reservation(
>  		     xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
>  				      XFS_FSB_TO_B(mp, 1))),
>  		    (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
> -		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 3),
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
>  				      XFS_FSB_TO_B(mp, 1))));
>  }
>  
> @@ -286,7 +310,7 @@ xfs_calc_link_reservation(
>  		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
>  				      XFS_FSB_TO_B(mp, 1))),
>  		    (xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> -		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				      XFS_FSB_TO_B(mp, 1))));
>  }
>  
> @@ -324,7 +348,7 @@ xfs_calc_remove_reservation(
>  		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
>  				      XFS_FSB_TO_B(mp, 1))),
>  		    (xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
> -		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
>  				      XFS_FSB_TO_B(mp, 1))));
>  }
>  
> @@ -371,7 +395,7 @@ xfs_calc_create_resv_alloc(
>  		mp->m_sb.sb_sectsize +
>  		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1));
>  }
>  
> @@ -399,7 +423,7 @@ xfs_calc_icreate_resv_alloc(
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		mp->m_sb.sb_sectsize +
>  		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_finobt_res(mp, 0, 0);
>  }
> @@ -483,7 +507,7 @@ xfs_calc_ifree_reservation(
>  		xfs_calc_buf_res(1, 0) +
>  		xfs_calc_buf_res(2 + mp->m_ialloc_blks +
>  				 mp->m_in_maxlevels, 0) +
> -		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_finobt_res(mp, 0, 1);
>  }
> @@ -513,7 +537,7 @@ xfs_calc_growdata_reservation(
>  	struct xfs_mount	*mp)
>  {
>  	return xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> -		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1));
>  }
>  
> @@ -535,7 +559,7 @@ xfs_calc_growrtalloc_reservation(
>  		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
>  				 XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_inode_res(mp, 1) +
> -		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1));
>  }
>  
> @@ -611,7 +635,7 @@ xfs_calc_addafork_reservation(
>  		xfs_calc_buf_res(1, mp->m_dir_geo->blksize) +
>  		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1,
>  				 XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1));
>  }
>  
> @@ -634,7 +658,7 @@ xfs_calc_attrinval_reservation(
>  		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
>  				     XFS_FSB_TO_B(mp, 1))),
>  		   (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> -		    xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 4),
> +		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
>  				     XFS_FSB_TO_B(mp, 1))));
>  }
>  
> @@ -701,7 +725,7 @@ xfs_calc_attrrm_reservation(
>  					XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) +
>  		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), 0)),
>  		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> -		     xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 2),
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
>  				      XFS_FSB_TO_B(mp, 1))));
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 7978150..0eb46ed 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -68,16 +68,6 @@ struct xfs_trans_resv {
>  #define M_RES(mp)	(&(mp)->m_resv)
>  
>  /*
> - * Per-extent log reservation for the allocation btree changes
> - * involved in freeing or allocating an extent.
> - * 2 trees * (2 blocks/level * max depth - 1) * block size
> - */
> -#define	XFS_ALLOCFREE_LOG_RES(mp,nx) \
> -	((nx) * (2 * XFS_FSB_TO_B((mp), 2 * (mp)->m_ag_maxlevels - 1)))
> -#define	XFS_ALLOCFREE_LOG_COUNT(mp,nx) \
> -	((nx) * (2 * (2 * (mp)->m_ag_maxlevels - 1)))
> -
> -/*
>   * Per-directory log reservation for any directory change.
>   * dir blocks: (1 btree block per level + data block + free block) * dblock size
>   * bmap btree: (levels + 2) * max depth * block size
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 031/119] xfs: rmap btree requires more reserved free space
  2016-06-17  1:21 ` [PATCH 031/119] xfs: rmap btree requires more reserved free space Darrick J. Wong
@ 2016-07-08 13:21   ` Brian Foster
  2016-07-13 16:50     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-08 13:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:21:11PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The rmap btree is allocated from the AGFL, which means we have to
> ensure ENOSPC is reported to userspace before we run out of free
> space in each AG. The last allocation in an AG can cause a full
> height rmap btree split, and that means we have to reserve at least
> this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
> Update the various space calculation functiosn to handle this.

				       functions

> 
> Also, because the macros are now executing conditional code and are called quite
> frequently, convert them to functions that initialise varaibles in the struct
> xfs_mount, use the new variables everywhere and document the calculations
> better.
> 
> v2: If rmapbt is disabled, it is incorrect to require 1 extra AGFL block
> for the rmapbt (due to the + 1); the entire clause needs to be gated
> on the feature flag.
> 
> v3: Use m_rmap_maxlevels to determine min_free.
> 
> [darrick.wong@oracle.com: don't reserve blocks if !rmap]
> [dchinner@redhat.com: update m_ag_max_usable after growfs]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |   71 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_alloc.h |   41 +++-----------------------
>  fs/xfs/libxfs/xfs_bmap.c  |    2 +
>  fs/xfs/libxfs/xfs_sb.c    |    2 +
>  fs/xfs/xfs_discard.c      |    2 +
>  fs/xfs/xfs_fsops.c        |    5 ++-
>  fs/xfs/xfs_log_recover.c  |    1 +
>  fs/xfs/xfs_mount.c        |    2 +
>  fs/xfs/xfs_mount.h        |    2 +
>  fs/xfs/xfs_super.c        |    2 +
>  10 files changed, 88 insertions(+), 42 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 570ca17..4c8ffd4 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -63,6 +63,72 @@ xfs_prealloc_blocks(
>  }
>  
>  /*
> + * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
> + * AGF buffer (PV 947395), we place constraints on the relationship among
> + * actual allocations for data blocks, freelist blocks, and potential file data
> + * bmap btree blocks. However, these restrictions may result in no actual space
> + * allocated for a delayed extent, for example, a data block in a certain AG is
> + * allocated but there is no additional block for the additional bmap btree
> + * block due to a split of the bmap btree of the file. The result of this may
> + * lead to an infinite loop when the file gets flushed to disk and all delayed
> + * extents need to be actually allocated. To get around this, we explicitly set
> + * aside a few blocks which will not be reserved in delayed allocation.
> + *
> + * The minimum number of needed freelist blocks is 4 fsbs _per AG_ when we are
> + * not using rmap btrees a potential split of file's bmap btree requires 1 fsb,
> + * so we set the number of set-aside blocks to 4 + 4*agcount when not using
> + * rmap btrees.
> + *

That's a bit wordy.

> + * When rmap btrees are active, we have to consider that using the last block
> + * in the AG can cause a full height rmap btree split and we need enough blocks
> + * on the AGFL to be able to handle this. That means we have, in addition to
> + * the above consideration, another (2 * mp->m_rmap_levels) - 1 blocks required
> + * to be available to the free list.

I'm probably missing something, but why does a full tree split require 2
blocks per-level (minus 1)? Wouldn't that involve an allocated block per
level (and possibly a new root block)?

Otherwise, the rest looks good to me.

Brian

> + */
> +unsigned int
> +xfs_alloc_set_aside(
> +	struct xfs_mount *mp)
> +{
> +	unsigned int	blocks;
> +
> +	blocks = 4 + (mp->m_sb.sb_agcount * XFS_ALLOC_AGFL_RESERVE);
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return blocks;
> +	return blocks + (mp->m_sb.sb_agcount * (2 * mp->m_rmap_maxlevels) - 1);
> +}
> +
> +/*
> + * When deciding how much space to allocate out of an AG, we limit the
> + * allocation maximum size to the size the AG. However, we cannot use all the
> + * blocks in the AG - some are permanently used by metadata. These
> + * blocks are generally:
> + *	- the AG superblock, AGF, AGI and AGFL
> + *	- the AGF (bno and cnt) and AGI btree root blocks, and optionally
> + *	  the AGI free inode and rmap btree root blocks.
> + *	- blocks on the AGFL according to xfs_alloc_set_aside() limits
> + *
> + * The AG headers are sector sized, so the amount of space they take up is
> + * dependent on filesystem geometry. The others are all single blocks.
> + */
> +unsigned int
> +xfs_alloc_ag_max_usable(struct xfs_mount *mp)
> +{
> +	unsigned int	blocks;
> +
> +	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
> +	blocks += XFS_ALLOC_AGFL_RESERVE;
> +	blocks += 3;			/* AGF, AGI btree root blocks */
> +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> +		blocks++;		/* finobt root block */
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> +		/* rmap root block + full tree split on full AG */
> +		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;
> +	}
> +
> +	return mp->m_sb.sb_agblocks - blocks;
> +}
> +
> +/*
>   * Lookup the record equal to [bno, len] in the btree given by cur.
>   */
>  STATIC int				/* error */
> @@ -1904,6 +1970,11 @@ xfs_alloc_min_freelist(
>  	/* space needed by-size freespace btree */
>  	min_free += min_t(unsigned int, pag->pagf_levels[XFS_BTNUM_CNTi] + 1,
>  				       mp->m_ag_maxlevels);
> +	/* space needed reverse mapping used space btree */
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		min_free += min_t(unsigned int,
> +				  pag->pagf_levels[XFS_BTNUM_RMAPi] + 1,
> +				  mp->m_rmap_maxlevels);
>  
>  	return min_free;
>  }
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 0721a48..7b6c66b 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -56,42 +56,6 @@ typedef unsigned int xfs_alloctype_t;
>  #define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
>  
>  /*
> - * In order to avoid ENOSPC-related deadlock caused by
> - * out-of-order locking of AGF buffer (PV 947395), we place
> - * constraints on the relationship among actual allocations for
> - * data blocks, freelist blocks, and potential file data bmap
> - * btree blocks. However, these restrictions may result in no
> - * actual space allocated for a delayed extent, for example, a data
> - * block in a certain AG is allocated but there is no additional
> - * block for the additional bmap btree block due to a split of the
> - * bmap btree of the file. The result of this may lead to an
> - * infinite loop in xfssyncd when the file gets flushed to disk and
> - * all delayed extents need to be actually allocated. To get around
> - * this, we explicitly set aside a few blocks which will not be
> - * reserved in delayed allocation. Considering the minimum number of
> - * needed freelist blocks is 4 fsbs _per AG_, a potential split of file's bmap
> - * btree requires 1 fsb, so we set the number of set-aside blocks
> - * to 4 + 4*agcount.
> - */
> -#define XFS_ALLOC_SET_ASIDE(mp)  (4 + ((mp)->m_sb.sb_agcount * 4))
> -
> -/*
> - * When deciding how much space to allocate out of an AG, we limit the
> - * allocation maximum size to the size the AG. However, we cannot use all the
> - * blocks in the AG - some are permanently used by metadata. These
> - * blocks are generally:
> - *	- the AG superblock, AGF, AGI and AGFL
> - *	- the AGF (bno and cnt) and AGI btree root blocks
> - *	- 4 blocks on the AGFL according to XFS_ALLOC_SET_ASIDE() limits
> - *
> - * The AG headers are sector sized, so the amount of space they take up is
> - * dependent on filesystem geometry. The others are all single blocks.
> - */
> -#define XFS_ALLOC_AG_MAX_USABLE(mp)	\
> -	((mp)->m_sb.sb_agblocks - XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)) - 7)
> -
> -
> -/*
>   * Argument structure for xfs_alloc routines.
>   * This is turned into a structure to avoid having 20 arguments passed
>   * down several levels of the stack.
> @@ -133,6 +97,11 @@ typedef struct xfs_alloc_arg {
>  #define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
>  #define XFS_ALLOC_USERDATA_ZERO		(1 << 2)/* zero extent on allocation */
>  
> +/* freespace limit calculations */
> +#define XFS_ALLOC_AGFL_RESERVE	4
> +unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
> +unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
> +
>  xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
>  		struct xfs_perag *pag, xfs_extlen_t need);
>  unsigned int xfs_alloc_min_freelist(struct xfs_mount *mp,
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 2c28f2a..61c0231 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3672,7 +3672,7 @@ xfs_bmap_btalloc(
>  	args.fsbno = ap->blkno;
>  
>  	/* Trim the allocation back to the maximum an AG can fit. */
> -	args.maxlen = MIN(ap->length, XFS_ALLOC_AG_MAX_USABLE(mp));
> +	args.maxlen = MIN(ap->length, mp->m_ag_max_usable);
>  	args.firstblock = *ap->firstblock;
>  	blen = 0;
>  	if (nullfb) {
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index f86226b..59c9f59 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -749,6 +749,8 @@ xfs_sb_mount_common(
>  		mp->m_ialloc_min_blks = sbp->sb_spino_align;
>  	else
>  		mp->m_ialloc_min_blks = mp->m_ialloc_blks;
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> index 272c3f8..4ff499a 100644
> --- a/fs/xfs/xfs_discard.c
> +++ b/fs/xfs/xfs_discard.c
> @@ -179,7 +179,7 @@ xfs_ioc_trim(
>  	 * matter as trimming blocks is an advisory interface.
>  	 */
>  	if (range.start >= XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks) ||
> -	    range.minlen > XFS_FSB_TO_B(mp, XFS_ALLOC_AG_MAX_USABLE(mp)) ||
> +	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
>  	    range.len < mp->m_sb.sb_blocksize)
>  		return -EINVAL;
>  
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 8a85e49..3772f6c 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -583,6 +583,7 @@ xfs_growfs_data_private(
>  	} else
>  		mp->m_maxicount = 0;
>  	xfs_set_low_space_thresholds(mp);
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
>  	/* update secondary superblocks. */
>  	for (agno = 1; agno < nagcount; agno++) {
> @@ -720,7 +721,7 @@ xfs_fs_counts(
>  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
>  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
>  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> -							XFS_ALLOC_SET_ASIDE(mp);
> +						mp->m_alloc_set_aside;
>  
>  	spin_lock(&mp->m_sb_lock);
>  	cnt->freertx = mp->m_sb.sb_frextents;
> @@ -793,7 +794,7 @@ retry:
>  		__int64_t	free;
>  
>  		free = percpu_counter_sum(&mp->m_fdblocks) -
> -							XFS_ALLOC_SET_ASIDE(mp);
> +						mp->m_alloc_set_aside;
>  		if (!free)
>  			goto out; /* ENOSPC and fdblks_delta = 0 */
>  
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 0c41bd2..b33187b 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -5027,6 +5027,7 @@ xlog_do_recover(
>  		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
>  		return error;
>  	}
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
>  	xlog_recover_check_summary(log);
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 8af1c88..879f3ef 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1219,7 +1219,7 @@ xfs_mod_fdblocks(
>  		batch = XFS_FDBLOCKS_BATCH;
>  
>  	__percpu_counter_add(&mp->m_fdblocks, delta, batch);
> -	if (__percpu_counter_compare(&mp->m_fdblocks, XFS_ALLOC_SET_ASIDE(mp),
> +	if (__percpu_counter_compare(&mp->m_fdblocks, mp->m_alloc_set_aside,
>  				     XFS_FDBLOCKS_BATCH) >= 0) {
>  		/* we had space! */
>  		return 0;
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 0ed0f29..b36676c 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -123,6 +123,8 @@ typedef struct xfs_mount {
>  	uint			m_in_maxlevels;	/* max inobt btree levels. */
>  	uint			m_rmap_maxlevels; /* max rmap btree levels */
>  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> +	uint			m_alloc_set_aside; /* space we can't use */
> +	uint			m_ag_max_usable; /* max space per AG */
>  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
>  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
>  	struct mutex		m_growlock;	/* growfs mutex */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index bf63f6d..1575849 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1076,7 +1076,7 @@ xfs_fs_statfs(
>  	statp->f_blocks = sbp->sb_dblocks - lsize;
>  	spin_unlock(&mp->m_sb_lock);
>  
> -	statp->f_bfree = fdblocks - XFS_ALLOC_SET_ASIDE(mp);
> +	statp->f_bfree = fdblocks - mp->m_alloc_set_aside;
>  	statp->f_bavail = statp->f_bfree;
>  
>  	fakeinos = statp->f_bfree << sbp->sb_inopblog;
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 032/119] xfs: add rmap btree operations
  2016-06-17  1:21 ` [PATCH 032/119] xfs: add rmap btree operations Darrick J. Wong
@ 2016-07-08 18:33   ` Brian Foster
  2016-07-08 23:53     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-08 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:21:17PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Implement the generic btree operations needed to manipulate rmap
> btree blocks. This is very similar to the per-ag freespace btree
> implementation, and uses the AGFL for allocation and freeing of
> blocks.
> 
> Adapt the rmap btree to store owner offsets within each rmap record,
> and to handle the primary key being redefined as the tuple
> [agblk, owner, offset].  The expansion of the primary key is crucial
> to allowing multiple owners per extent.
> 
> [darrick: adapt the btree ops to deal with offsets]
> [darrick: remove init_rec_from_key]
> [darrick: move unwritten bit to rm_offset]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/xfs/libxfs/xfs_btree.h      |    1 
>  fs/xfs/libxfs/xfs_rmap.c       |   96 ++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.c |  243 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h |    9 +
>  fs/xfs/xfs_trace.h             |    3 
>  5 files changed, 352 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 90ea2a7..9963c48 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -216,6 +216,7 @@ union xfs_btree_irec {
>  	xfs_alloc_rec_incore_t		a;
>  	xfs_bmbt_irec_t			b;
>  	xfs_inobt_rec_incore_t		i;
> +	struct xfs_rmap_irec		r;
>  };
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index d1fd471..c6a5a0b 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -37,6 +37,102 @@
>  #include "xfs_error.h"
>  #include "xfs_extent_busy.h"
>  
...
> +/*
> + * Update the record referred to by cur to the value given
> + * by [bno, len, owner, offset].
> + * This either works (return 0) or gets an EFSCORRUPTED error.
> + */
> +STATIC int
> +xfs_rmap_update(

This throws an unused warning, but I assume it will be used later.

> +	struct xfs_btree_cur	*cur,
> +	struct xfs_rmap_irec	*irec)
> +{
> +	union xfs_btree_rec	rec;
> +
> +	rec.rmap.rm_startblock = cpu_to_be32(irec->rm_startblock);
> +	rec.rmap.rm_blockcount = cpu_to_be32(irec->rm_blockcount);
> +	rec.rmap.rm_owner = cpu_to_be64(irec->rm_owner);
> +	rec.rmap.rm_offset = cpu_to_be64(
> +			xfs_rmap_irec_offset_pack(irec));
> +	return xfs_btree_update(cur, &rec);
> +}
> +
...
>  int
>  xfs_rmap_free(
>  	struct xfs_trans	*tp,
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index 7a35c78..c50c725 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
...
> @@ -43,6 +68,173 @@ xfs_rmapbt_dup_cursor(
>  			cur->bc_private.a.agbp, cur->bc_private.a.agno);
>  }
>  
> +STATIC void
> +xfs_rmapbt_set_root(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*ptr,
> +	int			inc)
> +{
> +	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
> +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> +	xfs_agnumber_t		seqno = be32_to_cpu(agf->agf_seqno);
> +	int			btnum = cur->bc_btnum;
> +	struct xfs_perag	*pag = xfs_perag_get(cur->bc_mp, seqno);
> +
> +	ASSERT(ptr->s != 0);
> +
> +	agf->agf_roots[btnum] = ptr->s;
> +	be32_add_cpu(&agf->agf_levels[btnum], inc);
> +	pag->pagf_levels[btnum] += inc;
> +	xfs_perag_put(pag);
> +
> +	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_ROOTS | XFS_AGF_LEVELS);
> +}
> +
> +STATIC int
> +xfs_rmapbt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	int			error;
> +	xfs_agblock_t		bno;
> +
> +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> +
> +	/* Allocate the new block from the freelist. If we can't, give up.  */
> +	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
> +				       &bno, 1);
> +	if (error) {
> +		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
> +		return error;
> +	}
> +
> +	trace_xfs_rmapbt_alloc_block(cur->bc_mp, cur->bc_private.a.agno,
> +			bno, 1);
> +	if (bno == NULLAGBLOCK) {
> +		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> +		*stat = 0;
> +		return 0;
> +	}
> +
> +	xfs_extent_busy_reuse(cur->bc_mp, cur->bc_private.a.agno, bno, 1,
> +			false);
> +
> +	xfs_trans_agbtree_delta(cur->bc_tp, 1);
> +	new->s = cpu_to_be32(bno);
> +
> +	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> +	*stat = 1;
> +	return 0;
> +}
> +
> +STATIC int
> +xfs_rmapbt_free_block(
> +	struct xfs_btree_cur	*cur,
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
> +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> +	xfs_agblock_t		bno;
> +	int			error;
> +
> +	bno = xfs_daddr_to_agbno(cur->bc_mp, XFS_BUF_ADDR(bp));
> +	trace_xfs_rmapbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> +			bno, 1);
> +	error = xfs_alloc_put_freelist(cur->bc_tp, agbp, NULL, bno, 1);
> +	if (error)
> +		return error;
> +
> +	xfs_extent_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1,
> +			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> +	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> +
> +	xfs_trans_binval(cur->bc_tp, bp);

This is handled in the generic btree code.

> +	return 0;
> +}
> +
...
> @@ -117,12 +309,63 @@ const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
>  	.verify_write		= xfs_rmapbt_write_verify,
>  };
>  
> +#if defined(DEBUG) || defined(XFS_WARN)
> +STATIC int
> +xfs_rmapbt_keys_inorder(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_key	*k1,
> +	union xfs_btree_key	*k2)
> +{
> +	if (be32_to_cpu(k1->rmap.rm_startblock) <
> +	    be32_to_cpu(k2->rmap.rm_startblock))
> +		return 1;
> +	if (be64_to_cpu(k1->rmap.rm_owner) <
> +	    be64_to_cpu(k2->rmap.rm_owner))
> +		return 1;
> +	if (XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)) <=
> +	    XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)))
> +		return 1;
> +	return 0;

I might just not be familiar enough with the rmapbt ordering rules, but
this doesn't look right. If the rm_startblock values are out of order
(k1 startblock > k2 startblock), but either of the owner or offset
values are in-order, then we call the keys in order. Is that intentional
or should (k1->rmap.rm_startblock > k2->rmap.rm_startblock) always
return 0?

> +}
> +
> +STATIC int
> +xfs_rmapbt_recs_inorder(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_rec	*r1,
> +	union xfs_btree_rec	*r2)
> +{
> +	if (be32_to_cpu(r1->rmap.rm_startblock) <
> +	    be32_to_cpu(r2->rmap.rm_startblock))
> +		return 1;
> +	if (XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)) <
> +	    XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)))
> +		return 1;
> +	if (be64_to_cpu(r1->rmap.rm_owner) <=
> +	    be64_to_cpu(r2->rmap.rm_owner))
> +		return 1;
> +	return 0;
> +}

Same question here.

Brian

> +#endif	/* DEBUG */
> +
>  static const struct xfs_btree_ops xfs_rmapbt_ops = {
>  	.rec_len		= sizeof(struct xfs_rmap_rec),
>  	.key_len		= sizeof(struct xfs_rmap_key),
>  
>  	.dup_cursor		= xfs_rmapbt_dup_cursor,
> +	.set_root		= xfs_rmapbt_set_root,
> +	.alloc_block		= xfs_rmapbt_alloc_block,
> +	.free_block		= xfs_rmapbt_free_block,
> +	.get_minrecs		= xfs_rmapbt_get_minrecs,
> +	.get_maxrecs		= xfs_rmapbt_get_maxrecs,
> +	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
> +	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
> +	.init_ptr_from_cur	= xfs_rmapbt_init_ptr_from_cur,
> +	.key_diff		= xfs_rmapbt_key_diff,
>  	.buf_ops		= &xfs_rmapbt_buf_ops,
> +#if defined(DEBUG) || defined(XFS_WARN)
> +	.keys_inorder		= xfs_rmapbt_keys_inorder,
> +	.recs_inorder		= xfs_rmapbt_recs_inorder,
> +#endif
>  };
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index 462767f..17fa383 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -52,6 +52,15 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
>  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
>  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
>  
> +int xfs_rmap_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> +		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> +		unsigned int flags, int *stat);
> +int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> +		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> +		unsigned int flags, int *stat);
> +int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
> +		int *stat);
> +
>  int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
>  		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
>  		   struct xfs_owner_info *oinfo);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index b4ee9c8..28bd991 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2470,6 +2470,9 @@ DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
>  
> +DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
> +DEFINE_BUSY_EVENT(xfs_rmapbt_free_block);
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 033/119] xfs: support overlapping intervals in the rmap btree
  2016-06-17  1:21 ` [PATCH 033/119] xfs: support overlapping intervals in the rmap btree Darrick J. Wong
@ 2016-07-08 18:33   ` Brian Foster
  2016-07-09  0:14     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-08 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:21:24PM -0700, Darrick J. Wong wrote:
> Now that the generic btree code supports overlapping intervals, plug
> in the rmap btree to this functionality.  We will need it to find
> potential left neighbors in xfs_rmap_{alloc,free} later in the patch
> set.
> 
> v2: Fix bit manipulation bug when generating high key offset.
> v3: Move unwritten bit to rm_offset.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_rmap_btree.c |   59 +++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/libxfs/xfs_rmap_btree.h |   10 +++++--
>  2 files changed, 66 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index c50c725..9adb930 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -181,6 +181,28 @@ xfs_rmapbt_init_key_from_rec(
>  }
>  
>  STATIC void
> +xfs_rmapbt_init_high_key_from_rec(
> +	union xfs_btree_key	*key,
> +	union xfs_btree_rec	*rec)
> +{
> +	__uint64_t		off;
> +	int			adj;
> +
> +	adj = be32_to_cpu(rec->rmap.rm_blockcount) - 1;
> +

Comments please. I had to stare at this for too long than I care to
admit to grok why it is modifying values. :) One liners along the lines
of "shift the startblock/offset to the highest value to form the high
key" or "don't convert offset for non-inode owners because ..." go a
long way for those not familiar with the code.

With regard to rm_offset, could we just copy it unconditionally here
(should it not be 0)?

> +	key->rmap.rm_startblock = rec->rmap.rm_startblock;
> +	be32_add_cpu(&key->rmap.rm_startblock, adj);
> +	key->rmap.rm_owner = rec->rmap.rm_owner;
> +	key->rmap.rm_offset = rec->rmap.rm_offset;
> +	if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) ||
> +	    XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset)))
> +		return;
> +	off = be64_to_cpu(key->rmap.rm_offset);
> +	off = (XFS_RMAP_OFF(off) + adj) | (off & ~XFS_RMAP_OFF_MASK);
> +	key->rmap.rm_offset = cpu_to_be64(off);
> +}
> +
> +STATIC void
>  xfs_rmapbt_init_rec_from_cur(
>  	struct xfs_btree_cur	*cur,
>  	union xfs_btree_rec	*rec)
> @@ -235,6 +257,38 @@ xfs_rmapbt_key_diff(
>  	return 0;
>  }
>  
> +STATIC __int64_t
> +xfs_rmapbt_diff_two_keys(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_key	*k1,
> +	union xfs_btree_key	*k2)
> +{
> +	struct xfs_rmap_key	*kp1 = &k1->rmap;
> +	struct xfs_rmap_key	*kp2 = &k2->rmap;
> +	__int64_t		d;
> +	__u64			x, y;
> +
> +	d = (__int64_t)be32_to_cpu(kp2->rm_startblock) -
> +		       be32_to_cpu(kp1->rm_startblock);
> +	if (d)
> +		return d;
> +
> +	x = be64_to_cpu(kp2->rm_owner);
> +	y = be64_to_cpu(kp1->rm_owner);
> +	if (x > y)
> +		return 1;
> +	else if (y > x)
> +		return -1;
> +
> +	x = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset));
> +	y = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset));
> +	if (x > y)
> +		return 1;
> +	else if (y > x)
> +		return -1;
> +	return 0;
> +}
> +
>  static bool
>  xfs_rmapbt_verify(
>  	struct xfs_buf		*bp)
> @@ -350,6 +404,7 @@ xfs_rmapbt_recs_inorder(
>  static const struct xfs_btree_ops xfs_rmapbt_ops = {
>  	.rec_len		= sizeof(struct xfs_rmap_rec),
>  	.key_len		= sizeof(struct xfs_rmap_key),
> +	.flags			= XFS_BTREE_OPS_OVERLAPPING,
>  
>  	.dup_cursor		= xfs_rmapbt_dup_cursor,
>  	.set_root		= xfs_rmapbt_set_root,
> @@ -358,10 +413,12 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
>  	.get_minrecs		= xfs_rmapbt_get_minrecs,
>  	.get_maxrecs		= xfs_rmapbt_get_maxrecs,
>  	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
> +	.init_high_key_from_rec	= xfs_rmapbt_init_high_key_from_rec,
>  	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
>  	.init_ptr_from_cur	= xfs_rmapbt_init_ptr_from_cur,
>  	.key_diff		= xfs_rmapbt_key_diff,
>  	.buf_ops		= &xfs_rmapbt_buf_ops,
> +	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
>  #if defined(DEBUG) || defined(XFS_WARN)
>  	.keys_inorder		= xfs_rmapbt_keys_inorder,
>  	.recs_inorder		= xfs_rmapbt_recs_inorder,
> @@ -410,7 +467,7 @@ xfs_rmapbt_maxrecs(
>  	if (leaf)
>  		return blocklen / sizeof(struct xfs_rmap_rec);
>  	return blocklen /
> -		(sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> +		(2 * sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));

Same here.. one-liner comment that reminds why we have the 2x please.

>  }
>  
>  /* Compute the maximum height of an rmap btree. */
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index 17fa383..796071c 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -38,12 +38,18 @@ struct xfs_mount;
>  #define XFS_RMAP_KEY_ADDR(block, index) \
>  	((struct xfs_rmap_key *) \
>  		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> -		 ((index) - 1) * sizeof(struct xfs_rmap_key)))
> +		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
> +
> +#define XFS_RMAP_HIGH_KEY_ADDR(block, index) \
> +	((struct xfs_rmap_key *) \
> +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> +		 sizeof(struct xfs_rmap_key) + \
> +		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
>  

Could this just be 'XFS_RMAP_KEY_ADDR(block, index) + sizeof(struct
xfs_rmap_key)'?

Brian

>  #define XFS_RMAP_PTR_ADDR(block, index, maxrecs) \
>  	((xfs_rmap_ptr_t *) \
>  		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> -		 (maxrecs) * sizeof(struct xfs_rmap_key) + \
> +		 (maxrecs) * 2 * sizeof(struct xfs_rmap_key) + \
>  		 ((index) - 1) * sizeof(xfs_rmap_ptr_t)))
>  
>  struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 034/119] xfs: teach rmapbt to support interval queries
  2016-06-17  1:21 ` [PATCH 034/119] xfs: teach rmapbt to support interval queries Darrick J. Wong
@ 2016-07-08 18:34   ` Brian Foster
  2016-07-09  0:16     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-08 18:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:21:30PM -0700, Darrick J. Wong wrote:
> Now that the generic btree code supports querying all records within a
> range of keys, use that functionality to allow us to ask for all the
> extents mapped to a range of physical blocks.
> 
> v2: Move unwritten bit to rm_offset.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_rmap.c       |   43 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h |    9 ++++++++
>  2 files changed, 52 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index c6a5a0b..0e1721a 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -184,3 +184,46 @@ out_error:
>  	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
>  	return error;
>  }
> +
> +struct xfs_rmapbt_query_range_info {
> +	xfs_rmapbt_query_range_fn	fn;
> +	void				*priv;
> +};
> +
> +/* Format btree record and pass to our callback. */
> +STATIC int
> +xfs_rmapbt_query_range_helper(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_rec	*rec,
> +	void			*priv)
> +{
> +	struct xfs_rmapbt_query_range_info	*query = priv;
> +	struct xfs_rmap_irec			irec;
> +	int					error;
> +
> +	error = xfs_rmapbt_btrec_to_irec(rec, &irec);
> +	if (error)
> +		return error;
> +	return query->fn(cur, &irec, query->priv);
> +}
> +
> +/* Find all rmaps between two keys. */
> +int
> +xfs_rmapbt_query_range(
> +	struct xfs_btree_cur		*cur,
> +	struct xfs_rmap_irec		*low_rec,
> +	struct xfs_rmap_irec		*high_rec,
> +	xfs_rmapbt_query_range_fn	fn,
> +	void				*priv)
> +{
> +	union xfs_btree_irec		low_brec;
> +	union xfs_btree_irec		high_brec;
> +	struct xfs_rmapbt_query_range_info	query;
> +
> +	low_brec.r = *low_rec;
> +	high_brec.r = *high_rec;

Some checks or asserts that these are actually in order couldn't hurt.
Otherwise looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	query.priv = priv;
> +	query.fn = fn;
> +	return xfs_btree_query_range(cur, &low_brec, &high_brec,
> +			xfs_rmapbt_query_range_helper, &query);
> +}
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index 796071c..e926c6e 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -74,4 +74,13 @@ int xfs_rmap_free(struct xfs_trans *tp, struct xfs_buf *agbp,
>  		  xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
>  		  struct xfs_owner_info *oinfo);
>  
> +typedef int (*xfs_rmapbt_query_range_fn)(
> +	struct xfs_btree_cur	*cur,
> +	struct xfs_rmap_irec	*rec,
> +	void			*priv);
> +
> +int xfs_rmapbt_query_range(struct xfs_btree_cur *cur,
> +		struct xfs_rmap_irec *low_rec, struct xfs_rmap_irec *high_rec,
> +		xfs_rmapbt_query_range_fn fn, void *priv);
> +
>  #endif	/* __XFS_RMAP_BTREE_H__ */
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 035/119] xfs: add tracepoints for the rmap functions
  2016-06-17  1:21 ` [PATCH 035/119] xfs: add tracepoints for the rmap functions Darrick J. Wong
@ 2016-07-08 18:34   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-08 18:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:21:36PM -0700, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_trace.h |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 79 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 28bd991..6daafaf 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2415,8 +2415,6 @@ DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_cancel);
>  DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_finish);
>  DEFINE_DEFER_PENDING_EVENT(xfs_defer_pending_abort);
>  
> -DEFINE_MAP_EXTENT_DEFERRED_EVENT(xfs_defer_map_extent);
> -
>  #define DEFINE_BMAP_FREE_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
>  DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_defer);
>  DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred);
> @@ -2463,6 +2461,36 @@ DEFINE_EVENT(xfs_rmap_class, name, \
>  		 struct xfs_owner_info *oinfo), \
>  	TP_ARGS(mp, agno, agbno, len, unwritten, oinfo))
>  
> +/* simple AG-based error/%ip tracepoint class */
> +DECLARE_EVENT_CLASS(xfs_ag_error_class,
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, int error,
> +		 unsigned long caller_ip),
> +	TP_ARGS(mp, agno, error, caller_ip),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_agnumber_t, agno)
> +		__field(int, error)
> +		__field(unsigned long, caller_ip)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->agno = agno;
> +		__entry->error = error;
> +		__entry->caller_ip = caller_ip;
> +	),
> +	TP_printk("dev %d:%d agno %u error %d caller %ps",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->agno,
> +		  __entry->error,
> +		  (char *)__entry->caller_ip)
> +);
> +
> +#define DEFINE_AG_ERROR_EVENT(name) \
> +DEFINE_EVENT(xfs_ag_error_class, name, \
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, int error, \
> +		 unsigned long caller_ip), \
> +	TP_ARGS(mp, agno, error, caller_ip))
> +
>  DEFINE_RMAP_EVENT(xfs_rmap_free_extent);
>  DEFINE_RMAP_EVENT(xfs_rmap_free_extent_done);
>  DEFINE_RMAP_EVENT(xfs_rmap_free_extent_error);
> @@ -2470,8 +2498,57 @@ DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
>  
> +DECLARE_EVENT_CLASS(xfs_rmapbt_class,
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		 xfs_agblock_t agbno, xfs_extlen_t len,
> +		 uint64_t owner, uint64_t offset, unsigned int flags),
> +	TP_ARGS(mp, agno, agbno, len, owner, offset, flags),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_agnumber_t, agno)
> +		__field(xfs_agblock_t, agbno)
> +		__field(xfs_extlen_t, len)
> +		__field(uint64_t, owner)
> +		__field(uint64_t, offset)
> +		__field(unsigned int, flags)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->agno = agno;
> +		__entry->agbno = agbno;
> +		__entry->len = len;
> +		__entry->owner = owner;
> +		__entry->offset = offset;
> +		__entry->flags = flags;
> +	),
> +	TP_printk("dev %d:%d agno %u agbno %u len %u owner %lld offset %llu flags 0x%x",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->agno,
> +		  __entry->agbno,
> +		  __entry->len,
> +		  __entry->owner,
> +		  __entry->offset,
> +		  __entry->flags)
> +);
> +#define DEFINE_RMAPBT_EVENT(name) \
> +DEFINE_EVENT(xfs_rmapbt_class, name, \
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
> +		 xfs_agblock_t agbno, xfs_extlen_t len, \
> +		 uint64_t owner, uint64_t offset, unsigned int flags), \
> +	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
> +
> +#define DEFINE_RMAP_DEFERRED_EVENT DEFINE_MAP_EXTENT_DEFERRED_EVENT
> +DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_defer);
> +DEFINE_RMAP_DEFERRED_EVENT(xfs_rmap_deferred);
> +
>  DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
>  DEFINE_BUSY_EVENT(xfs_rmapbt_free_block);
> +DEFINE_RMAPBT_EVENT(xfs_rmapbt_update);
> +DEFINE_RMAPBT_EVENT(xfs_rmapbt_insert);
> +DEFINE_RMAPBT_EVENT(xfs_rmapbt_delete);
> +DEFINE_AG_ERROR_EVENT(xfs_rmapbt_insert_error);
> +DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
> +DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
>  
>  #endif /* _TRACE_XFS_H */
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 032/119] xfs: add rmap btree operations
  2016-07-08 18:33   ` Brian Foster
@ 2016-07-08 23:53     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-08 23:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Fri, Jul 08, 2016 at 02:33:47PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:21:17PM -0700, Darrick J. Wong wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Implement the generic btree operations needed to manipulate rmap
> > btree blocks. This is very similar to the per-ag freespace btree
> > implementation, and uses the AGFL for allocation and freeing of
> > blocks.
> > 
> > Adapt the rmap btree to store owner offsets within each rmap record,
> > and to handle the primary key being redefined as the tuple
> > [agblk, owner, offset].  The expansion of the primary key is crucial
> > to allowing multiple owners per extent.
> > 
> > [darrick: adapt the btree ops to deal with offsets]
> > [darrick: remove init_rec_from_key]
> > [darrick: move unwritten bit to rm_offset]
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/xfs/libxfs/xfs_btree.h      |    1 
> >  fs/xfs/libxfs/xfs_rmap.c       |   96 ++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.c |  243 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h |    9 +
> >  fs/xfs/xfs_trace.h             |    3 
> >  5 files changed, 352 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index 90ea2a7..9963c48 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -216,6 +216,7 @@ union xfs_btree_irec {
> >  	xfs_alloc_rec_incore_t		a;
> >  	xfs_bmbt_irec_t			b;
> >  	xfs_inobt_rec_incore_t		i;
> > +	struct xfs_rmap_irec		r;
> >  };
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index d1fd471..c6a5a0b 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -37,6 +37,102 @@
> >  #include "xfs_error.h"
> >  #include "xfs_extent_busy.h"
> >  
> ...
> > +/*
> > + * Update the record referred to by cur to the value given
> > + * by [bno, len, owner, offset].
> > + * This either works (return 0) or gets an EFSCORRUPTED error.
> > + */
> > +STATIC int
> > +xfs_rmap_update(
> 
> This throws an unused warning, but I assume it will be used later.

Yes.

> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_rmap_irec	*irec)
> > +{
> > +	union xfs_btree_rec	rec;
> > +
> > +	rec.rmap.rm_startblock = cpu_to_be32(irec->rm_startblock);
> > +	rec.rmap.rm_blockcount = cpu_to_be32(irec->rm_blockcount);
> > +	rec.rmap.rm_owner = cpu_to_be64(irec->rm_owner);
> > +	rec.rmap.rm_offset = cpu_to_be64(
> > +			xfs_rmap_irec_offset_pack(irec));
> > +	return xfs_btree_update(cur, &rec);
> > +}
> > +
> ...
> >  int
> >  xfs_rmap_free(
> >  	struct xfs_trans	*tp,
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > index 7a35c78..c50c725 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> ...
> > @@ -43,6 +68,173 @@ xfs_rmapbt_dup_cursor(
> >  			cur->bc_private.a.agbp, cur->bc_private.a.agno);
> >  }
> >  
> > +STATIC void
> > +xfs_rmapbt_set_root(
> > +	struct xfs_btree_cur	*cur,
> > +	union xfs_btree_ptr	*ptr,
> > +	int			inc)
> > +{
> > +	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
> > +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > +	xfs_agnumber_t		seqno = be32_to_cpu(agf->agf_seqno);
> > +	int			btnum = cur->bc_btnum;
> > +	struct xfs_perag	*pag = xfs_perag_get(cur->bc_mp, seqno);
> > +
> > +	ASSERT(ptr->s != 0);
> > +
> > +	agf->agf_roots[btnum] = ptr->s;
> > +	be32_add_cpu(&agf->agf_levels[btnum], inc);
> > +	pag->pagf_levels[btnum] += inc;
> > +	xfs_perag_put(pag);
> > +
> > +	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_ROOTS | XFS_AGF_LEVELS);
> > +}
> > +
> > +STATIC int
> > +xfs_rmapbt_alloc_block(
> > +	struct xfs_btree_cur	*cur,
> > +	union xfs_btree_ptr	*start,
> > +	union xfs_btree_ptr	*new,
> > +	int			*stat)
> > +{
> > +	int			error;
> > +	xfs_agblock_t		bno;
> > +
> > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > +
> > +	/* Allocate the new block from the freelist. If we can't, give up.  */
> > +	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
> > +				       &bno, 1);
> > +	if (error) {
> > +		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
> > +		return error;
> > +	}
> > +
> > +	trace_xfs_rmapbt_alloc_block(cur->bc_mp, cur->bc_private.a.agno,
> > +			bno, 1);
> > +	if (bno == NULLAGBLOCK) {
> > +		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > +		*stat = 0;
> > +		return 0;
> > +	}
> > +
> > +	xfs_extent_busy_reuse(cur->bc_mp, cur->bc_private.a.agno, bno, 1,
> > +			false);
> > +
> > +	xfs_trans_agbtree_delta(cur->bc_tp, 1);
> > +	new->s = cpu_to_be32(bno);
> > +
> > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > +	*stat = 1;
> > +	return 0;
> > +}
> > +
> > +STATIC int
> > +xfs_rmapbt_free_block(
> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_buf		*bp)
> > +{
> > +	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
> > +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > +	xfs_agblock_t		bno;
> > +	int			error;
> > +
> > +	bno = xfs_daddr_to_agbno(cur->bc_mp, XFS_BUF_ADDR(bp));
> > +	trace_xfs_rmapbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> > +			bno, 1);
> > +	error = xfs_alloc_put_freelist(cur->bc_tp, agbp, NULL, bno, 1);
> > +	if (error)
> > +		return error;
> > +
> > +	xfs_extent_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1,
> > +			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> > +	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> > +
> > +	xfs_trans_binval(cur->bc_tp, bp);
> 
> This is handled in the generic btree code.

Oh, right, I noticed that this changed since I started developing
rmap/reflink.  Will change both.

> 
> > +	return 0;
> > +}
> > +
> ...
> > @@ -117,12 +309,63 @@ const struct xfs_buf_ops xfs_rmapbt_buf_ops = {
> >  	.verify_write		= xfs_rmapbt_write_verify,
> >  };
> >  
> > +#if defined(DEBUG) || defined(XFS_WARN)
> > +STATIC int
> > +xfs_rmapbt_keys_inorder(
> > +	struct xfs_btree_cur	*cur,
> > +	union xfs_btree_key	*k1,
> > +	union xfs_btree_key	*k2)
> > +{
> > +	if (be32_to_cpu(k1->rmap.rm_startblock) <
> > +	    be32_to_cpu(k2->rmap.rm_startblock))
> > +		return 1;
> > +	if (be64_to_cpu(k1->rmap.rm_owner) <
> > +	    be64_to_cpu(k2->rmap.rm_owner))
> > +		return 1;
> > +	if (XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset)) <=
> > +	    XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset)))
> > +		return 1;
> > +	return 0;
> 
> I might just not be familiar enough with the rmapbt ordering rules, but
> this doesn't look right. If the rm_startblock values are out of order
> (k1 startblock > k2 startblock), but either of the owner or offset
> values are in-order, then we call the keys in order. Is that intentional
> or should (k1->rmap.rm_startblock > k2->rmap.rm_startblock) always
> return 0?

Nope, you are correct about ordering rules.  This is an error.

> > +}
> > +
> > +STATIC int
> > +xfs_rmapbt_recs_inorder(
> > +	struct xfs_btree_cur	*cur,
> > +	union xfs_btree_rec	*r1,
> > +	union xfs_btree_rec	*r2)
> > +{
> > +	if (be32_to_cpu(r1->rmap.rm_startblock) <
> > +	    be32_to_cpu(r2->rmap.rm_startblock))
> > +		return 1;
> > +	if (XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset)) <
> > +	    XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset)))
> > +		return 1;
> > +	if (be64_to_cpu(r1->rmap.rm_owner) <=
> > +	    be64_to_cpu(r2->rmap.rm_owner))
> > +		return 1;
> > +	return 0;
> > +}
> 
> Same question here.

Same answer.  Will fix both of these and take a second look at the refcount
versions of these.

--D

> 
> Brian
> 
> > +#endif	/* DEBUG */
> > +
> >  static const struct xfs_btree_ops xfs_rmapbt_ops = {
> >  	.rec_len		= sizeof(struct xfs_rmap_rec),
> >  	.key_len		= sizeof(struct xfs_rmap_key),
> >  
> >  	.dup_cursor		= xfs_rmapbt_dup_cursor,
> > +	.set_root		= xfs_rmapbt_set_root,
> > +	.alloc_block		= xfs_rmapbt_alloc_block,
> > +	.free_block		= xfs_rmapbt_free_block,
> > +	.get_minrecs		= xfs_rmapbt_get_minrecs,
> > +	.get_maxrecs		= xfs_rmapbt_get_maxrecs,
> > +	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
> > +	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
> > +	.init_ptr_from_cur	= xfs_rmapbt_init_ptr_from_cur,
> > +	.key_diff		= xfs_rmapbt_key_diff,
> >  	.buf_ops		= &xfs_rmapbt_buf_ops,
> > +#if defined(DEBUG) || defined(XFS_WARN)
> > +	.keys_inorder		= xfs_rmapbt_keys_inorder,
> > +	.recs_inorder		= xfs_rmapbt_recs_inorder,
> > +#endif
> >  };
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index 462767f..17fa383 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -52,6 +52,15 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> >  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> >  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> >  
> > +int xfs_rmap_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> > +		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> > +		unsigned int flags, int *stat);
> > +int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> > +		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> > +		unsigned int flags, int *stat);
> > +int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
> > +		int *stat);
> > +
> >  int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
> >  		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> >  		   struct xfs_owner_info *oinfo);
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index b4ee9c8..28bd991 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2470,6 +2470,9 @@ DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
> >  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
> >  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
> >  
> > +DEFINE_BUSY_EVENT(xfs_rmapbt_alloc_block);
> > +DEFINE_BUSY_EVENT(xfs_rmapbt_free_block);
> > +
> >  #endif /* _TRACE_XFS_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 033/119] xfs: support overlapping intervals in the rmap btree
  2016-07-08 18:33   ` Brian Foster
@ 2016-07-09  0:14     ` Darrick J. Wong
  2016-07-09 13:25       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-09  0:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jul 08, 2016 at 02:33:55PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:21:24PM -0700, Darrick J. Wong wrote:
> > Now that the generic btree code supports overlapping intervals, plug
> > in the rmap btree to this functionality.  We will need it to find
> > potential left neighbors in xfs_rmap_{alloc,free} later in the patch
> > set.
> > 
> > v2: Fix bit manipulation bug when generating high key offset.
> > v3: Move unwritten bit to rm_offset.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_rmap_btree.c |   59 +++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/libxfs/xfs_rmap_btree.h |   10 +++++--
> >  2 files changed, 66 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > index c50c725..9adb930 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > @@ -181,6 +181,28 @@ xfs_rmapbt_init_key_from_rec(
> >  }
> >  
> >  STATIC void
> > +xfs_rmapbt_init_high_key_from_rec(
> > +	union xfs_btree_key	*key,
> > +	union xfs_btree_rec	*rec)
> > +{
> > +	__uint64_t		off;
> > +	int			adj;
> > +
> > +	adj = be32_to_cpu(rec->rmap.rm_blockcount) - 1;
> > +
> 
> Comments please. I had to stare at this for too long than I care to
> admit to grok why it is modifying values. :) One liners along the lines
> of "shift the startblock/offset to the highest value to form the high
> key" or "don't convert offset for non-inode owners because ..." go a
> long way for those not familiar with the code.

Fair enough.

/*
 * The high key for a reverse mapping record can be computed by shifting
 * the startblock and offset to the highest value that would still map
 * to that record.  In practice this means that we add blockcount-1 to
 * the startblock for all records, and if the record is for a data/attr
 * fork mapping, we add blockcount-1 to the offset too.
 */

> With regard to rm_offset, could we just copy it unconditionally here
> (should it not be 0)?

No, because one of the rmap operations (once we get to reflink) is to
find any potential left-mappings that we could extend in order to map
in an extent (pblk, owner, lblk) by searching for (pblk-1, owner,
lblk-1).

If the extent we're trying to map is, say, (15, 128, 5) and there's an
existing mapping (10, 128, 0, len=5), we have to be able to compute
the high key of that existing mapping as (14, 128, 4).  We can't
decrement the cursor here because the next record to the left might
be (12, 150, 2, len=1).

(Making that one search reasonably quick is the reason behind the entire
overlapping btree thing.)

> > +	key->rmap.rm_startblock = rec->rmap.rm_startblock;
> > +	be32_add_cpu(&key->rmap.rm_startblock, adj);
> > +	key->rmap.rm_owner = rec->rmap.rm_owner;
> > +	key->rmap.rm_offset = rec->rmap.rm_offset;
> > +	if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) ||
> > +	    XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset)))
> > +		return;
> > +	off = be64_to_cpu(key->rmap.rm_offset);
> > +	off = (XFS_RMAP_OFF(off) + adj) | (off & ~XFS_RMAP_OFF_MASK);
> > +	key->rmap.rm_offset = cpu_to_be64(off);
> > +}
> > +
> > +STATIC void
> >  xfs_rmapbt_init_rec_from_cur(
> >  	struct xfs_btree_cur	*cur,
> >  	union xfs_btree_rec	*rec)
> > @@ -235,6 +257,38 @@ xfs_rmapbt_key_diff(
> >  	return 0;
> >  }
> >  
> > +STATIC __int64_t
> > +xfs_rmapbt_diff_two_keys(
> > +	struct xfs_btree_cur	*cur,
> > +	union xfs_btree_key	*k1,
> > +	union xfs_btree_key	*k2)
> > +{
> > +	struct xfs_rmap_key	*kp1 = &k1->rmap;
> > +	struct xfs_rmap_key	*kp2 = &k2->rmap;
> > +	__int64_t		d;
> > +	__u64			x, y;
> > +
> > +	d = (__int64_t)be32_to_cpu(kp2->rm_startblock) -
> > +		       be32_to_cpu(kp1->rm_startblock);
> > +	if (d)
> > +		return d;
> > +
> > +	x = be64_to_cpu(kp2->rm_owner);
> > +	y = be64_to_cpu(kp1->rm_owner);
> > +	if (x > y)
> > +		return 1;
> > +	else if (y > x)
> > +		return -1;
> > +
> > +	x = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset));
> > +	y = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset));
> > +	if (x > y)
> > +		return 1;
> > +	else if (y > x)
> > +		return -1;
> > +	return 0;
> > +}
> > +
> >  static bool
> >  xfs_rmapbt_verify(
> >  	struct xfs_buf		*bp)
> > @@ -350,6 +404,7 @@ xfs_rmapbt_recs_inorder(
> >  static const struct xfs_btree_ops xfs_rmapbt_ops = {
> >  	.rec_len		= sizeof(struct xfs_rmap_rec),
> >  	.key_len		= sizeof(struct xfs_rmap_key),
> > +	.flags			= XFS_BTREE_OPS_OVERLAPPING,
> >  
> >  	.dup_cursor		= xfs_rmapbt_dup_cursor,
> >  	.set_root		= xfs_rmapbt_set_root,
> > @@ -358,10 +413,12 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
> >  	.get_minrecs		= xfs_rmapbt_get_minrecs,
> >  	.get_maxrecs		= xfs_rmapbt_get_maxrecs,
> >  	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
> > +	.init_high_key_from_rec	= xfs_rmapbt_init_high_key_from_rec,
> >  	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
> >  	.init_ptr_from_cur	= xfs_rmapbt_init_ptr_from_cur,
> >  	.key_diff		= xfs_rmapbt_key_diff,
> >  	.buf_ops		= &xfs_rmapbt_buf_ops,
> > +	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
> >  #if defined(DEBUG) || defined(XFS_WARN)
> >  	.keys_inorder		= xfs_rmapbt_keys_inorder,
> >  	.recs_inorder		= xfs_rmapbt_recs_inorder,
> > @@ -410,7 +467,7 @@ xfs_rmapbt_maxrecs(
> >  	if (leaf)
> >  		return blocklen / sizeof(struct xfs_rmap_rec);
> >  	return blocklen /
> > -		(sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> > +		(2 * sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> 
> Same here.. one-liner comment that reminds why we have the 2x please.

/*
 * Each btree pointer has two keys representing the lowest and highest
 * keys of all records in the subtree.
 */

> >  }
> >  
> >  /* Compute the maximum height of an rmap btree. */
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index 17fa383..796071c 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -38,12 +38,18 @@ struct xfs_mount;
> >  #define XFS_RMAP_KEY_ADDR(block, index) \
> >  	((struct xfs_rmap_key *) \
> >  		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > -		 ((index) - 1) * sizeof(struct xfs_rmap_key)))
> > +		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
> > +
> > +#define XFS_RMAP_HIGH_KEY_ADDR(block, index) \
> > +	((struct xfs_rmap_key *) \
> > +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > +		 sizeof(struct xfs_rmap_key) + \
> > +		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
> >  
> 
> Could this just be 'XFS_RMAP_KEY_ADDR(block, index) + sizeof(struct
> xfs_rmap_key)'?

Yes.

--D

> 
> Brian
> 
> >  #define XFS_RMAP_PTR_ADDR(block, index, maxrecs) \
> >  	((xfs_rmap_ptr_t *) \
> >  		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > -		 (maxrecs) * sizeof(struct xfs_rmap_key) + \
> > +		 (maxrecs) * 2 * sizeof(struct xfs_rmap_key) + \
> >  		 ((index) - 1) * sizeof(xfs_rmap_ptr_t)))
> >  
> >  struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 034/119] xfs: teach rmapbt to support interval queries
  2016-07-08 18:34   ` Brian Foster
@ 2016-07-09  0:16     ` Darrick J. Wong
  2016-07-09 13:25       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-09  0:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jul 08, 2016 at 02:34:03PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:21:30PM -0700, Darrick J. Wong wrote:
> > Now that the generic btree code supports querying all records within a
> > range of keys, use that functionality to allow us to ask for all the
> > extents mapped to a range of physical blocks.
> > 
> > v2: Move unwritten bit to rm_offset.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_rmap.c       |   43 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h |    9 ++++++++
> >  2 files changed, 52 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index c6a5a0b..0e1721a 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -184,3 +184,46 @@ out_error:
> >  	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
> >  	return error;
> >  }
> > +
> > +struct xfs_rmapbt_query_range_info {
> > +	xfs_rmapbt_query_range_fn	fn;
> > +	void				*priv;
> > +};
> > +
> > +/* Format btree record and pass to our callback. */
> > +STATIC int
> > +xfs_rmapbt_query_range_helper(
> > +	struct xfs_btree_cur	*cur,
> > +	union xfs_btree_rec	*rec,
> > +	void			*priv)
> > +{
> > +	struct xfs_rmapbt_query_range_info	*query = priv;
> > +	struct xfs_rmap_irec			irec;
> > +	int					error;
> > +
> > +	error = xfs_rmapbt_btrec_to_irec(rec, &irec);
> > +	if (error)
> > +		return error;
> > +	return query->fn(cur, &irec, query->priv);
> > +}
> > +
> > +/* Find all rmaps between two keys. */
> > +int
> > +xfs_rmapbt_query_range(
> > +	struct xfs_btree_cur		*cur,
> > +	struct xfs_rmap_irec		*low_rec,
> > +	struct xfs_rmap_irec		*high_rec,
> > +	xfs_rmapbt_query_range_fn	fn,
> > +	void				*priv)
> > +{
> > +	union xfs_btree_irec		low_brec;
> > +	union xfs_btree_irec		high_brec;
> > +	struct xfs_rmapbt_query_range_info	query;
> > +
> > +	low_brec.r = *low_rec;
> > +	high_brec.r = *high_rec;
> 
> Some checks or asserts that these are actually in order couldn't hurt.
> Otherwise looks good:

Ok.  If low_rec > high_rec then you'll get no results.  I'm not sure
if that's ok or if we should explicitly return -EINVAL for that case?

--D

> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +	query.priv = priv;
> > +	query.fn = fn;
> > +	return xfs_btree_query_range(cur, &low_brec, &high_brec,
> > +			xfs_rmapbt_query_range_helper, &query);
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index 796071c..e926c6e 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -74,4 +74,13 @@ int xfs_rmap_free(struct xfs_trans *tp, struct xfs_buf *agbp,
> >  		  xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> >  		  struct xfs_owner_info *oinfo);
> >  
> > +typedef int (*xfs_rmapbt_query_range_fn)(
> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_rmap_irec	*rec,
> > +	void			*priv);
> > +
> > +int xfs_rmapbt_query_range(struct xfs_btree_cur *cur,
> > +		struct xfs_rmap_irec *low_rec, struct xfs_rmap_irec *high_rec,
> > +		xfs_rmapbt_query_range_fn fn, void *priv);
> > +
> >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 033/119] xfs: support overlapping intervals in the rmap btree
  2016-07-09  0:14     ` Darrick J. Wong
@ 2016-07-09 13:25       ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-09 13:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jul 08, 2016 at 05:14:28PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 08, 2016 at 02:33:55PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:21:24PM -0700, Darrick J. Wong wrote:
> > > Now that the generic btree code supports overlapping intervals, plug
> > > in the rmap btree to this functionality.  We will need it to find
> > > potential left neighbors in xfs_rmap_{alloc,free} later in the patch
> > > set.
> > > 
> > > v2: Fix bit manipulation bug when generating high key offset.
> > > v3: Move unwritten bit to rm_offset.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_rmap_btree.c |   59 +++++++++++++++++++++++++++++++++++++++-
> > >  fs/xfs/libxfs/xfs_rmap_btree.h |   10 +++++--
> > >  2 files changed, 66 insertions(+), 3 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > index c50c725..9adb930 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > @@ -181,6 +181,28 @@ xfs_rmapbt_init_key_from_rec(
> > >  }
> > >  
> > >  STATIC void
> > > +xfs_rmapbt_init_high_key_from_rec(
> > > +	union xfs_btree_key	*key,
> > > +	union xfs_btree_rec	*rec)
> > > +{
> > > +	__uint64_t		off;
> > > +	int			adj;
> > > +
> > > +	adj = be32_to_cpu(rec->rmap.rm_blockcount) - 1;
> > > +
> > 
> > Comments please. I had to stare at this for too long than I care to
> > admit to grok why it is modifying values. :) One liners along the lines
> > of "shift the startblock/offset to the highest value to form the high
> > key" or "don't convert offset for non-inode owners because ..." go a
> > long way for those not familiar with the code.
> 
> Fair enough.
> 
> /*
>  * The high key for a reverse mapping record can be computed by shifting
>  * the startblock and offset to the highest value that would still map
>  * to that record.  In practice this means that we add blockcount-1 to
>  * the startblock for all records, and if the record is for a data/attr
>  * fork mapping, we add blockcount-1 to the offset too.
>  */
> 

Sounds good. To be clear, that's even more than what I was asking for.
Just something that calls out a potentially unexpected record
transformation in this context is sufficient. E.g.,

/*
 * caller is asking for high key, transform on-disk start block and
 * offset using blockcount
 */

... (but the above is fine too :).

> > With regard to rm_offset, could we just copy it unconditionally here
> > (should it not be 0)?
> 
> No, because one of the rmap operations (once we get to reflink) is to
> find any potential left-mappings that we could extend in order to map
> in an extent (pblk, owner, lblk) by searching for (pblk-1, owner,
> lblk-1).
> 
> If the extent we're trying to map is, say, (15, 128, 5) and there's an
> existing mapping (10, 128, 0, len=5), we have to be able to compute
> the high key of that existing mapping as (14, 128, 4).  We can't
> decrement the cursor here because the next record to the left might
> be (12, 150, 2, len=1).
> 
> (Making that one search reasonably quick is the reason behind the entire
> overlapping btree thing.)
> 

Ok. Can't say I grok this at the moment, but I'll worry about it when I
have more context on the reflink bits. :)

> > > +	key->rmap.rm_startblock = rec->rmap.rm_startblock;
> > > +	be32_add_cpu(&key->rmap.rm_startblock, adj);
> > > +	key->rmap.rm_owner = rec->rmap.rm_owner;
> > > +	key->rmap.rm_offset = rec->rmap.rm_offset;
> > > +	if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) ||
> > > +	    XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset)))
> > > +		return;
> > > +	off = be64_to_cpu(key->rmap.rm_offset);
> > > +	off = (XFS_RMAP_OFF(off) + adj) | (off & ~XFS_RMAP_OFF_MASK);
> > > +	key->rmap.rm_offset = cpu_to_be64(off);
> > > +}
> > > +
> > > +STATIC void
> > >  xfs_rmapbt_init_rec_from_cur(
> > >  	struct xfs_btree_cur	*cur,
> > >  	union xfs_btree_rec	*rec)
> > > @@ -235,6 +257,38 @@ xfs_rmapbt_key_diff(
> > >  	return 0;
> > >  }
> > >  
> > > +STATIC __int64_t
> > > +xfs_rmapbt_diff_two_keys(
> > > +	struct xfs_btree_cur	*cur,
> > > +	union xfs_btree_key	*k1,
> > > +	union xfs_btree_key	*k2)
> > > +{
> > > +	struct xfs_rmap_key	*kp1 = &k1->rmap;
> > > +	struct xfs_rmap_key	*kp2 = &k2->rmap;
> > > +	__int64_t		d;
> > > +	__u64			x, y;
> > > +
> > > +	d = (__int64_t)be32_to_cpu(kp2->rm_startblock) -
> > > +		       be32_to_cpu(kp1->rm_startblock);
> > > +	if (d)
> > > +		return d;
> > > +
> > > +	x = be64_to_cpu(kp2->rm_owner);
> > > +	y = be64_to_cpu(kp1->rm_owner);
> > > +	if (x > y)
> > > +		return 1;
> > > +	else if (y > x)
> > > +		return -1;
> > > +
> > > +	x = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset));
> > > +	y = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset));
> > > +	if (x > y)
> > > +		return 1;
> > > +	else if (y > x)
> > > +		return -1;
> > > +	return 0;
> > > +}
> > > +
> > >  static bool
> > >  xfs_rmapbt_verify(
> > >  	struct xfs_buf		*bp)
> > > @@ -350,6 +404,7 @@ xfs_rmapbt_recs_inorder(
> > >  static const struct xfs_btree_ops xfs_rmapbt_ops = {
> > >  	.rec_len		= sizeof(struct xfs_rmap_rec),
> > >  	.key_len		= sizeof(struct xfs_rmap_key),
> > > +	.flags			= XFS_BTREE_OPS_OVERLAPPING,
> > >  
> > >  	.dup_cursor		= xfs_rmapbt_dup_cursor,
> > >  	.set_root		= xfs_rmapbt_set_root,
> > > @@ -358,10 +413,12 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
> > >  	.get_minrecs		= xfs_rmapbt_get_minrecs,
> > >  	.get_maxrecs		= xfs_rmapbt_get_maxrecs,
> > >  	.init_key_from_rec	= xfs_rmapbt_init_key_from_rec,
> > > +	.init_high_key_from_rec	= xfs_rmapbt_init_high_key_from_rec,
> > >  	.init_rec_from_cur	= xfs_rmapbt_init_rec_from_cur,
> > >  	.init_ptr_from_cur	= xfs_rmapbt_init_ptr_from_cur,
> > >  	.key_diff		= xfs_rmapbt_key_diff,
> > >  	.buf_ops		= &xfs_rmapbt_buf_ops,
> > > +	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
> > >  #if defined(DEBUG) || defined(XFS_WARN)
> > >  	.keys_inorder		= xfs_rmapbt_keys_inorder,
> > >  	.recs_inorder		= xfs_rmapbt_recs_inorder,
> > > @@ -410,7 +467,7 @@ xfs_rmapbt_maxrecs(
> > >  	if (leaf)
> > >  		return blocklen / sizeof(struct xfs_rmap_rec);
> > >  	return blocklen /
> > > -		(sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> > > +		(2 * sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> > 
> > Same here.. one-liner comment that reminds why we have the 2x please.
> 
> /*
>  * Each btree pointer has two keys representing the lowest and highest
>  * keys of all records in the subtree.
>  */
> 

I suggest to correlate it to XFS_BTREE_OPS_OVERLAPPING support:

/* double the key size for overlapping trees (2 keys per pointer) */

Thanks!

Brian

> > >  }
> > >  
> > >  /* Compute the maximum height of an rmap btree. */
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > index 17fa383..796071c 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > @@ -38,12 +38,18 @@ struct xfs_mount;
> > >  #define XFS_RMAP_KEY_ADDR(block, index) \
> > >  	((struct xfs_rmap_key *) \
> > >  		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > > -		 ((index) - 1) * sizeof(struct xfs_rmap_key)))
> > > +		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
> > > +
> > > +#define XFS_RMAP_HIGH_KEY_ADDR(block, index) \
> > > +	((struct xfs_rmap_key *) \
> > > +		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > > +		 sizeof(struct xfs_rmap_key) + \
> > > +		 ((index) - 1) * 2 * sizeof(struct xfs_rmap_key)))
> > >  
> > 
> > Could this just be 'XFS_RMAP_KEY_ADDR(block, index) + sizeof(struct
> > xfs_rmap_key)'?
> 
> Yes.
> 
> --D
> 
> > 
> > Brian
> > 
> > >  #define XFS_RMAP_PTR_ADDR(block, index, maxrecs) \
> > >  	((xfs_rmap_ptr_t *) \
> > >  		((char *)(block) + XFS_RMAP_BLOCK_LEN + \
> > > -		 (maxrecs) * sizeof(struct xfs_rmap_key) + \
> > > +		 (maxrecs) * 2 * sizeof(struct xfs_rmap_key) + \
> > >  		 ((index) - 1) * sizeof(xfs_rmap_ptr_t)))
> > >  
> > >  struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 034/119] xfs: teach rmapbt to support interval queries
  2016-07-09  0:16     ` Darrick J. Wong
@ 2016-07-09 13:25       ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-09 13:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jul 08, 2016 at 05:16:35PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 08, 2016 at 02:34:03PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:21:30PM -0700, Darrick J. Wong wrote:
> > > Now that the generic btree code supports querying all records within a
> > > range of keys, use that functionality to allow us to ask for all the
> > > extents mapped to a range of physical blocks.
> > > 
> > > v2: Move unwritten bit to rm_offset.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_rmap.c       |   43 ++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_rmap_btree.h |    9 ++++++++
> > >  2 files changed, 52 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > > index c6a5a0b..0e1721a 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > > @@ -184,3 +184,46 @@ out_error:
> > >  	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
> > >  	return error;
> > >  }
> > > +
> > > +struct xfs_rmapbt_query_range_info {
> > > +	xfs_rmapbt_query_range_fn	fn;
> > > +	void				*priv;
> > > +};
> > > +
> > > +/* Format btree record and pass to our callback. */
> > > +STATIC int
> > > +xfs_rmapbt_query_range_helper(
> > > +	struct xfs_btree_cur	*cur,
> > > +	union xfs_btree_rec	*rec,
> > > +	void			*priv)
> > > +{
> > > +	struct xfs_rmapbt_query_range_info	*query = priv;
> > > +	struct xfs_rmap_irec			irec;
> > > +	int					error;
> > > +
> > > +	error = xfs_rmapbt_btrec_to_irec(rec, &irec);
> > > +	if (error)
> > > +		return error;
> > > +	return query->fn(cur, &irec, query->priv);
> > > +}
> > > +
> > > +/* Find all rmaps between two keys. */
> > > +int
> > > +xfs_rmapbt_query_range(
> > > +	struct xfs_btree_cur		*cur,
> > > +	struct xfs_rmap_irec		*low_rec,
> > > +	struct xfs_rmap_irec		*high_rec,
> > > +	xfs_rmapbt_query_range_fn	fn,
> > > +	void				*priv)
> > > +{
> > > +	union xfs_btree_irec		low_brec;
> > > +	union xfs_btree_irec		high_brec;
> > > +	struct xfs_rmapbt_query_range_info	query;
> > > +
> > > +	low_brec.r = *low_rec;
> > > +	high_brec.r = *high_rec;
> > 
> > Some checks or asserts that these are actually in order couldn't hurt.
> > Otherwise looks good:
> 
> Ok.  If low_rec > high_rec then you'll get no results.  I'm not sure
> if that's ok or if we should explicitly return -EINVAL for that case?
> 

IMO, it's more robust to short circuit this case one way or another
rather than rely on the search implementation, but I'm happy as long as
there's at least an assert. I have no strong preference really as to
whether it returns an error or 0 and an empty set. If that is truly an
unexpected usage, perhaps it's best to just return an error?

Brian

> --D
> 
> > 
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > 
> > > +	query.priv = priv;
> > > +	query.fn = fn;
> > > +	return xfs_btree_query_range(cur, &low_brec, &high_brec,
> > > +			xfs_rmapbt_query_range_helper, &query);
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > index 796071c..e926c6e 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > @@ -74,4 +74,13 @@ int xfs_rmap_free(struct xfs_trans *tp, struct xfs_buf *agbp,
> > >  		  xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > >  		  struct xfs_owner_info *oinfo);
> > >  
> > > +typedef int (*xfs_rmapbt_query_range_fn)(
> > > +	struct xfs_btree_cur	*cur,
> > > +	struct xfs_rmap_irec	*rec,
> > > +	void			*priv);
> > > +
> > > +int xfs_rmapbt_query_range(struct xfs_btree_cur *cur,
> > > +		struct xfs_rmap_irec *low_rec, struct xfs_rmap_irec *high_rec,
> > > +		xfs_rmapbt_query_range_fn fn, void *priv);
> > > +
> > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 036/119] xfs: add an extent to the rmap btree
  2016-06-17  1:21 ` [PATCH 036/119] xfs: add an extent to the rmap btree Darrick J. Wong
@ 2016-07-11 18:49   ` Brian Foster
  2016-07-11 23:01     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-11 18:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:21:43PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now all the btree, free space and transaction infrastructure is in
> place, we can finally add the code to insert reverse mappings to the
> rmap btree. Freeing will be done in a separate patch, so just the
> addition operation can be focussed on here.
> 
> v2: Update alloc function to handle non-shared file data.  Isolate the
> part that makes changes from the part that initializes the rmap
> cursor; this will be useful for deferred updates.
> 
> [darrick: handle owner offsets when adding rmaps]
> [dchinner: remove remaining debug printk statements]
> [darrick: move unwritten bit to rm_offset]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/xfs/libxfs/xfs_rmap.c       |  225 +++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/libxfs/xfs_rmap_btree.h |    1 
>  fs/xfs/xfs_trace.h             |    2 
>  3 files changed, 223 insertions(+), 5 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index 0e1721a..196e952 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -159,6 +159,218 @@ out_error:
>  	return error;
>  }
>  
> +/*
> + * A mergeable rmap should have the same owner, cannot be unwritten, and

Why can't it be unwritten? According to the code, it just looks like the
unwritten state must match between extents..?

> + * must be a bmbt rmap if we're asking about a bmbt rmap.
> + */
> +static bool
> +xfs_rmap_is_mergeable(
> +	struct xfs_rmap_irec	*irec,
> +	uint64_t		owner,
> +	uint64_t		offset,
> +	xfs_extlen_t		len,
> +	unsigned int		flags)
> +{

Also, why are we passing and not using offset and len? Is this modified
later?

One more comment nit below, otherwise looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	if (irec->rm_owner == XFS_RMAP_OWN_NULL)
> +		return false;
> +	if (irec->rm_owner != owner)
> +		return false;
> +	if ((flags & XFS_RMAP_UNWRITTEN) ^
> +	    (irec->rm_flags & XFS_RMAP_UNWRITTEN))
> +		return false;
> +	if ((flags & XFS_RMAP_ATTR_FORK) ^
> +	    (irec->rm_flags & XFS_RMAP_ATTR_FORK))
> +		return false;
> +	if ((flags & XFS_RMAP_BMBT_BLOCK) ^
> +	    (irec->rm_flags & XFS_RMAP_BMBT_BLOCK))
> +		return false;
> +	return true;
> +}
> +
> +/*
> + * When we allocate a new block, the first thing we do is add a reference to
> + * the extent in the rmap btree. This takes the form of a [agbno, length,
> + * owner, offset] record.  Flags are encoded in the high bits of the offset
> + * field.
> + */
> +STATIC int
> +__xfs_rmap_alloc(
> +	struct xfs_btree_cur	*cur,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	bool			unwritten,
> +	struct xfs_owner_info	*oinfo)
> +{
> +	struct xfs_mount	*mp = cur->bc_mp;
> +	struct xfs_rmap_irec	ltrec;
> +	struct xfs_rmap_irec	gtrec;
> +	int			have_gt;
> +	int			have_lt;
> +	int			error = 0;
> +	int			i;
> +	uint64_t		owner;
> +	uint64_t		offset;
> +	unsigned int		flags = 0;
> +	bool			ignore_off;
> +
> +	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
> +	ignore_off = XFS_RMAP_NON_INODE_OWNER(owner) ||
> +			(flags & XFS_RMAP_BMBT_BLOCK);
> +	if (unwritten)
> +		flags |= XFS_RMAP_UNWRITTEN;
> +	trace_xfs_rmap_alloc_extent(mp, cur->bc_private.a.agno, bno, len,
> +			unwritten, oinfo);
> +
> +	/*
> +	 * For the initial lookup, look for and exact match or the left-adjacent

					    an

Brian

> +	 * record for our insertion point. This will also give us the record for
> +	 * start block contiguity tests.
> +	 */
> +	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, flags,
> +			&have_lt);
> +	if (error)
> +		goto out_error;
> +	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> +
> +	error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
> +	if (error)
> +		goto out_error;
> +	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> +	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> +			cur->bc_private.a.agno, ltrec.rm_startblock,
> +			ltrec.rm_blockcount, ltrec.rm_owner,
> +			ltrec.rm_offset, ltrec.rm_flags);
> +
> +	if (!xfs_rmap_is_mergeable(&ltrec, owner, offset, len, flags))
> +		have_lt = 0;
> +
> +	XFS_WANT_CORRUPTED_GOTO(mp,
> +		have_lt == 0 ||
> +		ltrec.rm_startblock + ltrec.rm_blockcount <= bno, out_error);
> +
> +	/*
> +	 * Increment the cursor to see if we have a right-adjacent record to our
> +	 * insertion point. This will give us the record for end block
> +	 * contiguity tests.
> +	 */
> +	error = xfs_btree_increment(cur, 0, &have_gt);
> +	if (error)
> +		goto out_error;
> +	if (have_gt) {
> +		error = xfs_rmap_get_rec(cur, &gtrec, &have_gt);
> +		if (error)
> +			goto out_error;
> +		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
> +		XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= gtrec.rm_startblock,
> +					out_error);
> +		trace_xfs_rmap_map_gtrec(cur->bc_mp,
> +			cur->bc_private.a.agno, gtrec.rm_startblock,
> +			gtrec.rm_blockcount, gtrec.rm_owner,
> +			gtrec.rm_offset, gtrec.rm_flags);
> +		if (!xfs_rmap_is_mergeable(&gtrec, owner, offset, len, flags))
> +			have_gt = 0;
> +	}
> +
> +	/*
> +	 * Note: cursor currently points one record to the right of ltrec, even
> +	 * if there is no record in the tree to the right.
> +	 */
> +	if (have_lt &&
> +	    ltrec.rm_startblock + ltrec.rm_blockcount == bno &&
> +	    (ignore_off || ltrec.rm_offset + ltrec.rm_blockcount == offset)) {
> +		/*
> +		 * left edge contiguous, merge into left record.
> +		 *
> +		 *       ltbno     ltlen
> +		 * orig:   |ooooooooo|
> +		 * adding:           |aaaaaaaaa|
> +		 * result: |rrrrrrrrrrrrrrrrrrr|
> +		 *                  bno       len
> +		 */
> +		ltrec.rm_blockcount += len;
> +		if (have_gt &&
> +		    bno + len == gtrec.rm_startblock &&
> +		    (ignore_off || offset + len == gtrec.rm_offset) &&
> +		    (unsigned long)ltrec.rm_blockcount + len +
> +				gtrec.rm_blockcount <= XFS_RMAP_LEN_MAX) {
> +			/*
> +			 * right edge also contiguous, delete right record
> +			 * and merge into left record.
> +			 *
> +			 *       ltbno     ltlen    gtbno     gtlen
> +			 * orig:   |ooooooooo|         |ooooooooo|
> +			 * adding:           |aaaaaaaaa|
> +			 * result: |rrrrrrrrrrrrrrrrrrrrrrrrrrrrr|
> +			 */
> +			ltrec.rm_blockcount += gtrec.rm_blockcount;
> +			trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> +					gtrec.rm_startblock,
> +					gtrec.rm_blockcount,
> +					gtrec.rm_owner,
> +					gtrec.rm_offset,
> +					gtrec.rm_flags);
> +			error = xfs_btree_delete(cur, &i);
> +			if (error)
> +				goto out_error;
> +			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
> +		}
> +
> +		/* point the cursor back to the left record and update */
> +		error = xfs_btree_decrement(cur, 0, &have_gt);
> +		if (error)
> +			goto out_error;
> +		error = xfs_rmap_update(cur, &ltrec);
> +		if (error)
> +			goto out_error;
> +	} else if (have_gt &&
> +		   bno + len == gtrec.rm_startblock &&
> +		   (ignore_off || offset + len == gtrec.rm_offset)) {
> +		/*
> +		 * right edge contiguous, merge into right record.
> +		 *
> +		 *                 gtbno     gtlen
> +		 * Orig:             |ooooooooo|
> +		 * adding: |aaaaaaaaa|
> +		 * Result: |rrrrrrrrrrrrrrrrrrr|
> +		 *        bno       len
> +		 */
> +		gtrec.rm_startblock = bno;
> +		gtrec.rm_blockcount += len;
> +		if (!ignore_off)
> +			gtrec.rm_offset = offset;
> +		error = xfs_rmap_update(cur, &gtrec);
> +		if (error)
> +			goto out_error;
> +	} else {
> +		/*
> +		 * no contiguous edge with identical owner, insert
> +		 * new record at current cursor position.
> +		 */
> +		cur->bc_rec.r.rm_startblock = bno;
> +		cur->bc_rec.r.rm_blockcount = len;
> +		cur->bc_rec.r.rm_owner = owner;
> +		cur->bc_rec.r.rm_offset = offset;
> +		cur->bc_rec.r.rm_flags = flags;
> +		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno, len,
> +			owner, offset, flags);
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto out_error;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
> +	}
> +
> +	trace_xfs_rmap_alloc_extent_done(mp, cur->bc_private.a.agno, bno, len,
> +			unwritten, oinfo);
> +out_error:
> +	if (error)
> +		trace_xfs_rmap_alloc_extent_error(mp, cur->bc_private.a.agno,
> +				bno, len, unwritten, oinfo);
> +	return error;
> +}
> +
> +/*
> + * Add a reference to an extent in the rmap btree.
> + */
>  int
>  xfs_rmap_alloc(
>  	struct xfs_trans	*tp,
> @@ -169,19 +381,22 @@ xfs_rmap_alloc(
>  	struct xfs_owner_info	*oinfo)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	int			error = 0;
> +	struct xfs_btree_cur	*cur;
> +	int			error;
>  
>  	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
>  		return 0;
>  
> -	trace_xfs_rmap_alloc_extent(mp, agno, bno, len, false, oinfo);
> -	if (1)
> +	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> +	error = __xfs_rmap_alloc(cur, bno, len, false, oinfo);
> +	if (error)
>  		goto out_error;
> -	trace_xfs_rmap_alloc_extent_done(mp, agno, bno, len, false, oinfo);
> +
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>  	return 0;
>  
>  out_error:
> -	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index e926c6e..9d92da5 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -67,6 +67,7 @@ int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
>  int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
>  		int *stat);
>  
> +/* functions for updating the rmapbt for bmbt blocks and AG btree blocks */
>  int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
>  		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
>  		   struct xfs_owner_info *oinfo);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 6daafaf..3ebceb0 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2549,6 +2549,8 @@ DEFINE_RMAPBT_EVENT(xfs_rmapbt_delete);
>  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_insert_error);
>  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
>  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
> +DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
> +DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
>  
>  #endif /* _TRACE_XFS_H */
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 037/119] xfs: remove an extent from the rmap btree
  2016-06-17  1:21 ` [PATCH 037/119] xfs: remove an extent from " Darrick J. Wong
@ 2016-07-11 18:49   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-11 18:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:21:49PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have records in the rmap btree, we need to remove them
> when extents are freed. This needs to find the relevant record in
> the btree and remove/trim/split it accordingly.
> 
> v2: Update the free function to deal with non-shared file data, and
> isolate the part that does the rmap update from the part that deals
> with cursors.  This will be useful for deferred ops.
> 
> [darrick.wong@oracle.com: make rmap routines handle the enlarged keyspace]
> [dchinner: remove remaining unused debug printks]
> [darrick: fix a bug when growfs in an AG with an rmap ending at EOFS]
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_rmap.c |  220 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 215 insertions(+), 5 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index 196e952..1043c63 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -133,6 +133,212 @@ xfs_rmap_get_rec(
>  	return xfs_rmapbt_btrec_to_irec(rec, irec);
>  }
>  
> +/*
> + * Find the extent in the rmap btree and remove it.
> + *
> + * The record we find should always be an exact match for the extent that we're
> + * looking for, since we insert them into the btree without modification.
> + *
> + * Special Case #1: when growing the filesystem, we "free" an extent when
> + * growing the last AG. This extent is new space and so it is not tracked as
> + * used space in the btree. The growfs code will pass in an owner of
> + * XFS_RMAP_OWN_NULL to indicate that it expected that there is no owner of this
> + * extent. We verify that - the extent lookup result in a record that does not
> + * overlap.
> + *
> + * Special Case #2: EFIs do not record the owner of the extent, so when
> + * recovering EFIs from the log we pass in XFS_RMAP_OWN_UNKNOWN to tell the rmap
> + * btree to ignore the owner (i.e. wildcard match) so we don't trigger
> + * corruption checks during log recovery.
> + */
> +STATIC int
> +__xfs_rmap_free(
> +	struct xfs_btree_cur	*cur,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	bool			unwritten,
> +	struct xfs_owner_info	*oinfo)
> +{
> +	struct xfs_mount	*mp = cur->bc_mp;
> +	struct xfs_rmap_irec	ltrec;
> +	uint64_t		ltoff;
> +	int			error = 0;
> +	int			i;
> +	uint64_t		owner;
> +	uint64_t		offset;
> +	unsigned int		flags;
> +	bool			ignore_off;
> +
> +	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
> +	ignore_off = XFS_RMAP_NON_INODE_OWNER(owner) ||
> +			(flags & XFS_RMAP_BMBT_BLOCK);
> +	if (unwritten)
> +		flags |= XFS_RMAP_UNWRITTEN;
> +	trace_xfs_rmap_free_extent(mp, cur->bc_private.a.agno, bno, len,
> +			unwritten, oinfo);
> +
> +	/*
> +	 * We should always have a left record because there's a static record
> +	 * for the AG headers at rm_startblock == 0 created by mkfs/growfs that
> +	 * will not ever be removed from the tree.
> +	 */
> +	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, flags, &i);
> +	if (error)
> +		goto out_error;
> +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
> +
> +	error = xfs_rmap_get_rec(cur, &ltrec, &i);
> +	if (error)
> +		goto out_error;
> +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
> +	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> +			cur->bc_private.a.agno, ltrec.rm_startblock,
> +			ltrec.rm_blockcount, ltrec.rm_owner,
> +			ltrec.rm_offset, ltrec.rm_flags);
> +	ltoff = ltrec.rm_offset;
> +
> +	/*
> +	 * For growfs, the incoming extent must be beyond the left record we
> +	 * just found as it is new space and won't be used by anyone. This is
> +	 * just a corruption check as we don't actually do anything with this
> +	 * extent.  Note that we need to use >= instead of > because it might
> +	 * be the case that the "left" extent goes all the way to EOFS.
> +	 */
> +	if (owner == XFS_RMAP_OWN_NULL) {
> +		XFS_WANT_CORRUPTED_GOTO(mp, bno >= ltrec.rm_startblock +
> +						ltrec.rm_blockcount, out_error);
> +		goto out_done;
> +	}
> +
> +	/* Make sure the unwritten flag matches. */
> +	XFS_WANT_CORRUPTED_GOTO(mp, (flags & XFS_RMAP_UNWRITTEN) ==
> +			(ltrec.rm_flags & XFS_RMAP_UNWRITTEN), out_error);
> +
> +	/* Make sure the extent we found covers the entire freeing range. */
> +	XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_startblock <= bno &&
> +		ltrec.rm_startblock + ltrec.rm_blockcount >=
> +		bno + len, out_error);
> +
> +	/* Make sure the owner matches what we expect to find in the tree. */
> +	XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner ||
> +				    XFS_RMAP_NON_INODE_OWNER(owner), out_error);
> +
> +	/* Check the offset, if necessary. */
> +	if (!XFS_RMAP_NON_INODE_OWNER(owner)) {
> +		if (flags & XFS_RMAP_BMBT_BLOCK) {
> +			XFS_WANT_CORRUPTED_GOTO(mp,
> +					ltrec.rm_flags & XFS_RMAP_BMBT_BLOCK,
> +					out_error);
> +		} else {
> +			XFS_WANT_CORRUPTED_GOTO(mp,
> +					ltrec.rm_offset <= offset, out_error);
> +			XFS_WANT_CORRUPTED_GOTO(mp,
> +					ltoff + ltrec.rm_blockcount >= offset + len,
> +					out_error);
> +		}
> +	}
> +
> +	if (ltrec.rm_startblock == bno && ltrec.rm_blockcount == len) {
> +		/* exact match, simply remove the record from rmap tree */
> +		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> +				ltrec.rm_startblock, ltrec.rm_blockcount,
> +				ltrec.rm_owner, ltrec.rm_offset,
> +				ltrec.rm_flags);
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto out_error;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
> +	} else if (ltrec.rm_startblock == bno) {
> +		/*
> +		 * overlap left hand side of extent: move the start, trim the
> +		 * length and update the current record.
> +		 *
> +		 *       ltbno                ltlen
> +		 * Orig:    |oooooooooooooooooooo|
> +		 * Freeing: |fffffffff|
> +		 * Result:            |rrrrrrrrrr|
> +		 *         bno       len
> +		 */
> +		ltrec.rm_startblock += len;
> +		ltrec.rm_blockcount -= len;
> +		if (!ignore_off)
> +			ltrec.rm_offset += len;
> +		error = xfs_rmap_update(cur, &ltrec);
> +		if (error)
> +			goto out_error;
> +	} else if (ltrec.rm_startblock + ltrec.rm_blockcount == bno + len) {
> +		/*
> +		 * overlap right hand side of extent: trim the length and update
> +		 * the current record.
> +		 *
> +		 *       ltbno                ltlen
> +		 * Orig:    |oooooooooooooooooooo|
> +		 * Freeing:            |fffffffff|
> +		 * Result:  |rrrrrrrrrr|
> +		 *                    bno       len
> +		 */
> +		ltrec.rm_blockcount -= len;
> +		error = xfs_rmap_update(cur, &ltrec);
> +		if (error)
> +			goto out_error;
> +	} else {
> +
> +		/*
> +		 * overlap middle of extent: trim the length of the existing
> +		 * record to the length of the new left-extent size, increment
> +		 * the insertion position so we can insert a new record
> +		 * containing the remaining right-extent space.
> +		 *
> +		 *       ltbno                ltlen
> +		 * Orig:    |oooooooooooooooooooo|
> +		 * Freeing:       |fffffffff|
> +		 * Result:  |rrrrr|         |rrrr|
> +		 *               bno       len
> +		 */
> +		xfs_extlen_t	orig_len = ltrec.rm_blockcount;
> +
> +		ltrec.rm_blockcount = bno - ltrec.rm_startblock;
> +		error = xfs_rmap_update(cur, &ltrec);
> +		if (error)
> +			goto out_error;
> +
> +		error = xfs_btree_increment(cur, 0, &i);
> +		if (error)
> +			goto out_error;
> +
> +		cur->bc_rec.r.rm_startblock = bno + len;
> +		cur->bc_rec.r.rm_blockcount = orig_len - len -
> +						     ltrec.rm_blockcount;
> +		cur->bc_rec.r.rm_owner = ltrec.rm_owner;
> +		if (ignore_off)
> +			cur->bc_rec.r.rm_offset = 0;
> +		else
> +			cur->bc_rec.r.rm_offset = offset + len;
> +		cur->bc_rec.r.rm_flags = flags;
> +		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno,
> +				cur->bc_rec.r.rm_startblock,
> +				cur->bc_rec.r.rm_blockcount,
> +				cur->bc_rec.r.rm_owner,
> +				cur->bc_rec.r.rm_offset,
> +				cur->bc_rec.r.rm_flags);
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto out_error;
> +	}
> +
> +out_done:
> +	trace_xfs_rmap_free_extent_done(mp, cur->bc_private.a.agno, bno, len,
> +			unwritten, oinfo);
> +out_error:
> +	if (error)
> +		trace_xfs_rmap_free_extent_error(mp, cur->bc_private.a.agno,
> +				bno, len, unwritten, oinfo);
> +	return error;
> +}
> +
> +/*
> + * Remove a reference to an extent in the rmap btree.
> + */
>  int
>  xfs_rmap_free(
>  	struct xfs_trans	*tp,
> @@ -143,19 +349,23 @@ xfs_rmap_free(
>  	struct xfs_owner_info	*oinfo)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	int			error = 0;
> +	struct xfs_btree_cur	*cur;
> +	int			error;
>  
>  	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
>  		return 0;
>  
> -	trace_xfs_rmap_free_extent(mp, agno, bno, len, false, oinfo);
> -	if (1)
> +	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> +
> +	error = __xfs_rmap_free(cur, bno, len, false, oinfo);
> +	if (error)
>  		goto out_error;
> -	trace_xfs_rmap_free_extent_done(mp, agno, bno, len, false, oinfo);
> +
> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
>  	return 0;
>  
>  out_error:
> -	trace_xfs_rmap_free_extent_error(mp, agno, bno, len, false, oinfo);
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
>  	return error;
>  }
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 036/119] xfs: add an extent to the rmap btree
  2016-07-11 18:49   ` Brian Foster
@ 2016-07-11 23:01     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-11 23:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Mon, Jul 11, 2016 at 02:49:09PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:21:43PM -0700, Darrick J. Wong wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now all the btree, free space and transaction infrastructure is in
> > place, we can finally add the code to insert reverse mappings to the
> > rmap btree. Freeing will be done in a separate patch, so just the
> > addition operation can be focussed on here.
> > 
> > v2: Update alloc function to handle non-shared file data.  Isolate the
> > part that makes changes from the part that initializes the rmap
> > cursor; this will be useful for deferred updates.
> > 
> > [darrick: handle owner offsets when adding rmaps]
> > [dchinner: remove remaining debug printk statements]
> > [darrick: move unwritten bit to rm_offset]
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/xfs/libxfs/xfs_rmap.c       |  225 +++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/libxfs/xfs_rmap_btree.h |    1 
> >  fs/xfs/xfs_trace.h             |    2 
> >  3 files changed, 223 insertions(+), 5 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index 0e1721a..196e952 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -159,6 +159,218 @@ out_error:
> >  	return error;
> >  }
> >  
> > +/*
> > + * A mergeable rmap should have the same owner, cannot be unwritten, and
> 
> Why can't it be unwritten? According to the code, it just looks like the
> unwritten state must match between extents..?

Correct.  The comment needs to be updated.

/*
 * A mergeable rmap must have the same owner and the same values for
 * the unwritten, attr_fork, and bmbt flags.  The startblock and
 * offsets are checked separately.
 */

> 
> > + * must be a bmbt rmap if we're asking about a bmbt rmap.
> > + */
> > +static bool
> > +xfs_rmap_is_mergeable(
> > +	struct xfs_rmap_irec	*irec,
> > +	uint64_t		owner,
> > +	uint64_t		offset,
> > +	xfs_extlen_t		len,
> > +	unsigned int		flags)
> > +{
> 
> Also, why are we passing and not using offset and len? Is this modified
> later?

Actually... offset and len are unnecessary.  len falls out in a later
patch, so I will eliminate both when I clean this up.

> One more comment nit below, otherwise looks good:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +	if (irec->rm_owner == XFS_RMAP_OWN_NULL)
> > +		return false;
> > +	if (irec->rm_owner != owner)
> > +		return false;
> > +	if ((flags & XFS_RMAP_UNWRITTEN) ^
> > +	    (irec->rm_flags & XFS_RMAP_UNWRITTEN))
> > +		return false;
> > +	if ((flags & XFS_RMAP_ATTR_FORK) ^
> > +	    (irec->rm_flags & XFS_RMAP_ATTR_FORK))
> > +		return false;
> > +	if ((flags & XFS_RMAP_BMBT_BLOCK) ^
> > +	    (irec->rm_flags & XFS_RMAP_BMBT_BLOCK))
> > +		return false;
> > +	return true;
> > +}
> > +
> > +/*
> > + * When we allocate a new block, the first thing we do is add a reference to
> > + * the extent in the rmap btree. This takes the form of a [agbno, length,
> > + * owner, offset] record.  Flags are encoded in the high bits of the offset
> > + * field.
> > + */
> > +STATIC int
> > +__xfs_rmap_alloc(
> > +	struct xfs_btree_cur	*cur,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len,
> > +	bool			unwritten,
> > +	struct xfs_owner_info	*oinfo)
> > +{
> > +	struct xfs_mount	*mp = cur->bc_mp;
> > +	struct xfs_rmap_irec	ltrec;
> > +	struct xfs_rmap_irec	gtrec;
> > +	int			have_gt;
> > +	int			have_lt;
> > +	int			error = 0;
> > +	int			i;
> > +	uint64_t		owner;
> > +	uint64_t		offset;
> > +	unsigned int		flags = 0;
> > +	bool			ignore_off;
> > +
> > +	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
> > +	ignore_off = XFS_RMAP_NON_INODE_OWNER(owner) ||
> > +			(flags & XFS_RMAP_BMBT_BLOCK);
> > +	if (unwritten)
> > +		flags |= XFS_RMAP_UNWRITTEN;
> > +	trace_xfs_rmap_alloc_extent(mp, cur->bc_private.a.agno, bno, len,
> > +			unwritten, oinfo);
> > +
> > +	/*
> > +	 * For the initial lookup, look for and exact match or the left-adjacent
> 
> 					    an

Noted.  Thanks for catching these!

--D

> 
> Brian
> 
> > +	 * record for our insertion point. This will also give us the record for
> > +	 * start block contiguity tests.
> > +	 */
> > +	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, flags,
> > +			&have_lt);
> > +	if (error)
> > +		goto out_error;
> > +	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> > +
> > +	error = xfs_rmap_get_rec(cur, &ltrec, &have_lt);
> > +	if (error)
> > +		goto out_error;
> > +	XFS_WANT_CORRUPTED_GOTO(mp, have_lt == 1, out_error);
> > +	trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
> > +			cur->bc_private.a.agno, ltrec.rm_startblock,
> > +			ltrec.rm_blockcount, ltrec.rm_owner,
> > +			ltrec.rm_offset, ltrec.rm_flags);
> > +
> > +	if (!xfs_rmap_is_mergeable(&ltrec, owner, offset, len, flags))
> > +		have_lt = 0;
> > +
> > +	XFS_WANT_CORRUPTED_GOTO(mp,
> > +		have_lt == 0 ||
> > +		ltrec.rm_startblock + ltrec.rm_blockcount <= bno, out_error);
> > +
> > +	/*
> > +	 * Increment the cursor to see if we have a right-adjacent record to our
> > +	 * insertion point. This will give us the record for end block
> > +	 * contiguity tests.
> > +	 */
> > +	error = xfs_btree_increment(cur, 0, &have_gt);
> > +	if (error)
> > +		goto out_error;
> > +	if (have_gt) {
> > +		error = xfs_rmap_get_rec(cur, &gtrec, &have_gt);
> > +		if (error)
> > +			goto out_error;
> > +		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
> > +		XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= gtrec.rm_startblock,
> > +					out_error);
> > +		trace_xfs_rmap_map_gtrec(cur->bc_mp,
> > +			cur->bc_private.a.agno, gtrec.rm_startblock,
> > +			gtrec.rm_blockcount, gtrec.rm_owner,
> > +			gtrec.rm_offset, gtrec.rm_flags);
> > +		if (!xfs_rmap_is_mergeable(&gtrec, owner, offset, len, flags))
> > +			have_gt = 0;
> > +	}
> > +
> > +	/*
> > +	 * Note: cursor currently points one record to the right of ltrec, even
> > +	 * if there is no record in the tree to the right.
> > +	 */
> > +	if (have_lt &&
> > +	    ltrec.rm_startblock + ltrec.rm_blockcount == bno &&
> > +	    (ignore_off || ltrec.rm_offset + ltrec.rm_blockcount == offset)) {
> > +		/*
> > +		 * left edge contiguous, merge into left record.
> > +		 *
> > +		 *       ltbno     ltlen
> > +		 * orig:   |ooooooooo|
> > +		 * adding:           |aaaaaaaaa|
> > +		 * result: |rrrrrrrrrrrrrrrrrrr|
> > +		 *                  bno       len
> > +		 */
> > +		ltrec.rm_blockcount += len;
> > +		if (have_gt &&
> > +		    bno + len == gtrec.rm_startblock &&
> > +		    (ignore_off || offset + len == gtrec.rm_offset) &&
> > +		    (unsigned long)ltrec.rm_blockcount + len +
> > +				gtrec.rm_blockcount <= XFS_RMAP_LEN_MAX) {
> > +			/*
> > +			 * right edge also contiguous, delete right record
> > +			 * and merge into left record.
> > +			 *
> > +			 *       ltbno     ltlen    gtbno     gtlen
> > +			 * orig:   |ooooooooo|         |ooooooooo|
> > +			 * adding:           |aaaaaaaaa|
> > +			 * result: |rrrrrrrrrrrrrrrrrrrrrrrrrrrrr|
> > +			 */
> > +			ltrec.rm_blockcount += gtrec.rm_blockcount;
> > +			trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> > +					gtrec.rm_startblock,
> > +					gtrec.rm_blockcount,
> > +					gtrec.rm_owner,
> > +					gtrec.rm_offset,
> > +					gtrec.rm_flags);
> > +			error = xfs_btree_delete(cur, &i);
> > +			if (error)
> > +				goto out_error;
> > +			XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
> > +		}
> > +
> > +		/* point the cursor back to the left record and update */
> > +		error = xfs_btree_decrement(cur, 0, &have_gt);
> > +		if (error)
> > +			goto out_error;
> > +		error = xfs_rmap_update(cur, &ltrec);
> > +		if (error)
> > +			goto out_error;
> > +	} else if (have_gt &&
> > +		   bno + len == gtrec.rm_startblock &&
> > +		   (ignore_off || offset + len == gtrec.rm_offset)) {
> > +		/*
> > +		 * right edge contiguous, merge into right record.
> > +		 *
> > +		 *                 gtbno     gtlen
> > +		 * Orig:             |ooooooooo|
> > +		 * adding: |aaaaaaaaa|
> > +		 * Result: |rrrrrrrrrrrrrrrrrrr|
> > +		 *        bno       len
> > +		 */
> > +		gtrec.rm_startblock = bno;
> > +		gtrec.rm_blockcount += len;
> > +		if (!ignore_off)
> > +			gtrec.rm_offset = offset;
> > +		error = xfs_rmap_update(cur, &gtrec);
> > +		if (error)
> > +			goto out_error;
> > +	} else {
> > +		/*
> > +		 * no contiguous edge with identical owner, insert
> > +		 * new record at current cursor position.
> > +		 */
> > +		cur->bc_rec.r.rm_startblock = bno;
> > +		cur->bc_rec.r.rm_blockcount = len;
> > +		cur->bc_rec.r.rm_owner = owner;
> > +		cur->bc_rec.r.rm_offset = offset;
> > +		cur->bc_rec.r.rm_flags = flags;
> > +		trace_xfs_rmapbt_insert(mp, cur->bc_private.a.agno, bno, len,
> > +			owner, offset, flags);
> > +		error = xfs_btree_insert(cur, &i);
> > +		if (error)
> > +			goto out_error;
> > +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
> > +	}
> > +
> > +	trace_xfs_rmap_alloc_extent_done(mp, cur->bc_private.a.agno, bno, len,
> > +			unwritten, oinfo);
> > +out_error:
> > +	if (error)
> > +		trace_xfs_rmap_alloc_extent_error(mp, cur->bc_private.a.agno,
> > +				bno, len, unwritten, oinfo);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Add a reference to an extent in the rmap btree.
> > + */
> >  int
> >  xfs_rmap_alloc(
> >  	struct xfs_trans	*tp,
> > @@ -169,19 +381,22 @@ xfs_rmap_alloc(
> >  	struct xfs_owner_info	*oinfo)
> >  {
> >  	struct xfs_mount	*mp = tp->t_mountp;
> > -	int			error = 0;
> > +	struct xfs_btree_cur	*cur;
> > +	int			error;
> >  
> >  	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> >  		return 0;
> >  
> > -	trace_xfs_rmap_alloc_extent(mp, agno, bno, len, false, oinfo);
> > -	if (1)
> > +	cur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> > +	error = __xfs_rmap_alloc(cur, bno, len, false, oinfo);
> > +	if (error)
> >  		goto out_error;
> > -	trace_xfs_rmap_alloc_extent_done(mp, agno, bno, len, false, oinfo);
> > +
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> >  	return 0;
> >  
> >  out_error:
> > -	trace_xfs_rmap_alloc_extent_error(mp, agno, bno, len, false, oinfo);
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> >  	return error;
> >  }
> >  
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index e926c6e..9d92da5 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -67,6 +67,7 @@ int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> >  int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
> >  		int *stat);
> >  
> > +/* functions for updating the rmapbt for bmbt blocks and AG btree blocks */
> >  int xfs_rmap_alloc(struct xfs_trans *tp, struct xfs_buf *agbp,
> >  		   xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> >  		   struct xfs_owner_info *oinfo);
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 6daafaf..3ebceb0 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2549,6 +2549,8 @@ DEFINE_RMAPBT_EVENT(xfs_rmapbt_delete);
> >  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_insert_error);
> >  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
> >  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
> > +DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
> > +DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
> >  
> >  #endif /* _TRACE_XFS_H */
> >  
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 031/119] xfs: rmap btree requires more reserved free space
  2016-07-08 13:21   ` Brian Foster
@ 2016-07-13 16:50     ` Darrick J. Wong
  2016-07-13 18:32       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-13 16:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Fri, Jul 08, 2016 at 09:21:55AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:21:11PM -0700, Darrick J. Wong wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The rmap btree is allocated from the AGFL, which means we have to
> > ensure ENOSPC is reported to userspace before we run out of free
> > space in each AG. The last allocation in an AG can cause a full
> > height rmap btree split, and that means we have to reserve at least
> > this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
> > Update the various space calculation functiosn to handle this.
> 
> 				       functions
> 
> > 
> > Also, because the macros are now executing conditional code and are called quite
> > frequently, convert them to functions that initialise varaibles in the struct
> > xfs_mount, use the new variables everywhere and document the calculations
> > better.
> > 
> > v2: If rmapbt is disabled, it is incorrect to require 1 extra AGFL block
> > for the rmapbt (due to the + 1); the entire clause needs to be gated
> > on the feature flag.
> > 
> > v3: Use m_rmap_maxlevels to determine min_free.
> > 
> > [darrick.wong@oracle.com: don't reserve blocks if !rmap]
> > [dchinner@redhat.com: update m_ag_max_usable after growfs]
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |   71 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_alloc.h |   41 +++-----------------------
> >  fs/xfs/libxfs/xfs_bmap.c  |    2 +
> >  fs/xfs/libxfs/xfs_sb.c    |    2 +
> >  fs/xfs/xfs_discard.c      |    2 +
> >  fs/xfs/xfs_fsops.c        |    5 ++-
> >  fs/xfs/xfs_log_recover.c  |    1 +
> >  fs/xfs/xfs_mount.c        |    2 +
> >  fs/xfs/xfs_mount.h        |    2 +
> >  fs/xfs/xfs_super.c        |    2 +
> >  10 files changed, 88 insertions(+), 42 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 570ca17..4c8ffd4 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -63,6 +63,72 @@ xfs_prealloc_blocks(
> >  }
> >  
> >  /*
> > + * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
> > + * AGF buffer (PV 947395), we place constraints on the relationship among
> > + * actual allocations for data blocks, freelist blocks, and potential file data
> > + * bmap btree blocks. However, these restrictions may result in no actual space
> > + * allocated for a delayed extent, for example, a data block in a certain AG is
> > + * allocated but there is no additional block for the additional bmap btree
> > + * block due to a split of the bmap btree of the file. The result of this may
> > + * lead to an infinite loop when the file gets flushed to disk and all delayed
> > + * extents need to be actually allocated. To get around this, we explicitly set
> > + * aside a few blocks which will not be reserved in delayed allocation.
> > + *
> > + * The minimum number of needed freelist blocks is 4 fsbs _per AG_ when we are
> > + * not using rmap btrees a potential split of file's bmap btree requires 1 fsb,
> > + * so we set the number of set-aside blocks to 4 + 4*agcount when not using
> > + * rmap btrees.
> > + *
> 
> That's a bit wordy.

Yikes, that whole thing is a single sentence!

One thing I'm not really sure about is how "a potential split of file's bmap
btree requires 1 fsb" seems to translate to 4 in the actual formula.  I'd
have thought it would be m_bm_maxlevels or something... not just 4.

/* 
 * When rmap is disabled, we need to reserve 4 fsbs _per AG_ for the freelist
 * and 4 more to handle a potential split of the file's bmap btree.
 *
 * When rmap is enabled, we must also be able to handle two rmap btree inserts
 * to record both the file data extent and a new bmbt block.  The bmbt block
 * might not be in the same AG as the file data extent.  In the worst case
 * the bmap btree splits multiple levels and all the new blocks come from
 * different AGs, so set aside enough to handle rmap btree splits in all AGs.
 */

> > + * When rmap btrees are active, we have to consider that using the last block
> > + * in the AG can cause a full height rmap btree split and we need enough blocks
> > + * on the AGFL to be able to handle this. That means we have, in addition to
> > + * the above consideration, another (2 * mp->m_rmap_levels) - 1 blocks required
> > + * to be available to the free list.
> 
> I'm probably missing something, but why does a full tree split require 2
> blocks per-level (minus 1)? Wouldn't that involve an allocated block per
> level (and possibly a new root block)?

The whole rmap clause is wrong. :(

I think we'll be fine with agcount * m_rmap_maxlevels.

> Otherwise, the rest looks good to me.

Cool.

<keep going downwards>

> Brian
> 
> > + */
> > +unsigned int
> > +xfs_alloc_set_aside(
> > +	struct xfs_mount *mp)
> > +{
> > +	unsigned int	blocks;
> > +
> > +	blocks = 4 + (mp->m_sb.sb_agcount * XFS_ALLOC_AGFL_RESERVE);
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return blocks;
> > +	return blocks + (mp->m_sb.sb_agcount * (2 * mp->m_rmap_maxlevels) - 1);
> > +}
> > +
> > +/*
> > + * When deciding how much space to allocate out of an AG, we limit the
> > + * allocation maximum size to the size the AG. However, we cannot use all the
> > + * blocks in the AG - some are permanently used by metadata. These
> > + * blocks are generally:
> > + *	- the AG superblock, AGF, AGI and AGFL
> > + *	- the AGF (bno and cnt) and AGI btree root blocks, and optionally
> > + *	  the AGI free inode and rmap btree root blocks.
> > + *	- blocks on the AGFL according to xfs_alloc_set_aside() limits
> > + *
> > + * The AG headers are sector sized, so the amount of space they take up is
> > + * dependent on filesystem geometry. The others are all single blocks.
> > + */
> > +unsigned int
> > +xfs_alloc_ag_max_usable(struct xfs_mount *mp)
> > +{
> > +	unsigned int	blocks;
> > +
> > +	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
> > +	blocks += XFS_ALLOC_AGFL_RESERVE;
> > +	blocks += 3;			/* AGF, AGI btree root blocks */
> > +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> > +		blocks++;		/* finobt root block */
> > +	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> > +		/* rmap root block + full tree split on full AG */
> > +		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;

I think this could be blocks++ since we now have AG reservations.

--D

> > +	}
> > +
> > +	return mp->m_sb.sb_agblocks - blocks;
> > +}
> > +
> > +/*
> >   * Lookup the record equal to [bno, len] in the btree given by cur.
> >   */
> >  STATIC int				/* error */
> > @@ -1904,6 +1970,11 @@ xfs_alloc_min_freelist(
> >  	/* space needed by-size freespace btree */
> >  	min_free += min_t(unsigned int, pag->pagf_levels[XFS_BTNUM_CNTi] + 1,
> >  				       mp->m_ag_maxlevels);
> > +	/* space needed reverse mapping used space btree */
> > +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		min_free += min_t(unsigned int,
> > +				  pag->pagf_levels[XFS_BTNUM_RMAPi] + 1,
> > +				  mp->m_rmap_maxlevels);
> >  
> >  	return min_free;
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > index 0721a48..7b6c66b 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.h
> > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > @@ -56,42 +56,6 @@ typedef unsigned int xfs_alloctype_t;
> >  #define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
> >  
> >  /*
> > - * In order to avoid ENOSPC-related deadlock caused by
> > - * out-of-order locking of AGF buffer (PV 947395), we place
> > - * constraints on the relationship among actual allocations for
> > - * data blocks, freelist blocks, and potential file data bmap
> > - * btree blocks. However, these restrictions may result in no
> > - * actual space allocated for a delayed extent, for example, a data
> > - * block in a certain AG is allocated but there is no additional
> > - * block for the additional bmap btree block due to a split of the
> > - * bmap btree of the file. The result of this may lead to an
> > - * infinite loop in xfssyncd when the file gets flushed to disk and
> > - * all delayed extents need to be actually allocated. To get around
> > - * this, we explicitly set aside a few blocks which will not be
> > - * reserved in delayed allocation. Considering the minimum number of
> > - * needed freelist blocks is 4 fsbs _per AG_, a potential split of file's bmap
> > - * btree requires 1 fsb, so we set the number of set-aside blocks
> > - * to 4 + 4*agcount.
> > - */
> > -#define XFS_ALLOC_SET_ASIDE(mp)  (4 + ((mp)->m_sb.sb_agcount * 4))
> > -
> > -/*
> > - * When deciding how much space to allocate out of an AG, we limit the
> > - * allocation maximum size to the size the AG. However, we cannot use all the
> > - * blocks in the AG - some are permanently used by metadata. These
> > - * blocks are generally:
> > - *	- the AG superblock, AGF, AGI and AGFL
> > - *	- the AGF (bno and cnt) and AGI btree root blocks
> > - *	- 4 blocks on the AGFL according to XFS_ALLOC_SET_ASIDE() limits
> > - *
> > - * The AG headers are sector sized, so the amount of space they take up is
> > - * dependent on filesystem geometry. The others are all single blocks.
> > - */
> > -#define XFS_ALLOC_AG_MAX_USABLE(mp)	\
> > -	((mp)->m_sb.sb_agblocks - XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)) - 7)
> > -
> > -
> > -/*
> >   * Argument structure for xfs_alloc routines.
> >   * This is turned into a structure to avoid having 20 arguments passed
> >   * down several levels of the stack.
> > @@ -133,6 +97,11 @@ typedef struct xfs_alloc_arg {
> >  #define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
> >  #define XFS_ALLOC_USERDATA_ZERO		(1 << 2)/* zero extent on allocation */
> >  
> > +/* freespace limit calculations */
> > +#define XFS_ALLOC_AGFL_RESERVE	4
> > +unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
> > +unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
> > +
> >  xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
> >  		struct xfs_perag *pag, xfs_extlen_t need);
> >  unsigned int xfs_alloc_min_freelist(struct xfs_mount *mp,
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 2c28f2a..61c0231 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3672,7 +3672,7 @@ xfs_bmap_btalloc(
> >  	args.fsbno = ap->blkno;
> >  
> >  	/* Trim the allocation back to the maximum an AG can fit. */
> > -	args.maxlen = MIN(ap->length, XFS_ALLOC_AG_MAX_USABLE(mp));
> > +	args.maxlen = MIN(ap->length, mp->m_ag_max_usable);
> >  	args.firstblock = *ap->firstblock;
> >  	blen = 0;
> >  	if (nullfb) {
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index f86226b..59c9f59 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -749,6 +749,8 @@ xfs_sb_mount_common(
> >  		mp->m_ialloc_min_blks = sbp->sb_spino_align;
> >  	else
> >  		mp->m_ialloc_min_blks = mp->m_ialloc_blks;
> > +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > +	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> > index 272c3f8..4ff499a 100644
> > --- a/fs/xfs/xfs_discard.c
> > +++ b/fs/xfs/xfs_discard.c
> > @@ -179,7 +179,7 @@ xfs_ioc_trim(
> >  	 * matter as trimming blocks is an advisory interface.
> >  	 */
> >  	if (range.start >= XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks) ||
> > -	    range.minlen > XFS_FSB_TO_B(mp, XFS_ALLOC_AG_MAX_USABLE(mp)) ||
> > +	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
> >  	    range.len < mp->m_sb.sb_blocksize)
> >  		return -EINVAL;
> >  
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 8a85e49..3772f6c 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -583,6 +583,7 @@ xfs_growfs_data_private(
> >  	} else
> >  		mp->m_maxicount = 0;
> >  	xfs_set_low_space_thresholds(mp);
> > +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> >  
> >  	/* update secondary superblocks. */
> >  	for (agno = 1; agno < nagcount; agno++) {
> > @@ -720,7 +721,7 @@ xfs_fs_counts(
> >  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
> >  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
> >  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> > -							XFS_ALLOC_SET_ASIDE(mp);
> > +						mp->m_alloc_set_aside;
> >  
> >  	spin_lock(&mp->m_sb_lock);
> >  	cnt->freertx = mp->m_sb.sb_frextents;
> > @@ -793,7 +794,7 @@ retry:
> >  		__int64_t	free;
> >  
> >  		free = percpu_counter_sum(&mp->m_fdblocks) -
> > -							XFS_ALLOC_SET_ASIDE(mp);
> > +						mp->m_alloc_set_aside;
> >  		if (!free)
> >  			goto out; /* ENOSPC and fdblks_delta = 0 */
> >  
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 0c41bd2..b33187b 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -5027,6 +5027,7 @@ xlog_do_recover(
> >  		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
> >  		return error;
> >  	}
> > +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> >  
> >  	xlog_recover_check_summary(log);
> >  
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 8af1c88..879f3ef 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -1219,7 +1219,7 @@ xfs_mod_fdblocks(
> >  		batch = XFS_FDBLOCKS_BATCH;
> >  
> >  	__percpu_counter_add(&mp->m_fdblocks, delta, batch);
> > -	if (__percpu_counter_compare(&mp->m_fdblocks, XFS_ALLOC_SET_ASIDE(mp),
> > +	if (__percpu_counter_compare(&mp->m_fdblocks, mp->m_alloc_set_aside,
> >  				     XFS_FDBLOCKS_BATCH) >= 0) {
> >  		/* we had space! */
> >  		return 0;
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 0ed0f29..b36676c 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -123,6 +123,8 @@ typedef struct xfs_mount {
> >  	uint			m_in_maxlevels;	/* max inobt btree levels. */
> >  	uint			m_rmap_maxlevels; /* max rmap btree levels */
> >  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> > +	uint			m_alloc_set_aside; /* space we can't use */
> > +	uint			m_ag_max_usable; /* max space per AG */
> >  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> >  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> >  	struct mutex		m_growlock;	/* growfs mutex */
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index bf63f6d..1575849 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1076,7 +1076,7 @@ xfs_fs_statfs(
> >  	statp->f_blocks = sbp->sb_dblocks - lsize;
> >  	spin_unlock(&mp->m_sb_lock);
> >  
> > -	statp->f_bfree = fdblocks - XFS_ALLOC_SET_ASIDE(mp);
> > +	statp->f_bfree = fdblocks - mp->m_alloc_set_aside;
> >  	statp->f_bavail = statp->f_bfree;
> >  
> >  	fakeinos = statp->f_bfree << sbp->sb_inopblog;
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 038/119] xfs: convert unwritten status of reverse mappings
  2016-06-17  1:21 ` [PATCH 038/119] xfs: convert unwritten status of reverse mappings Darrick J. Wong
  2016-06-30  0:15   ` Darrick J. Wong
@ 2016-07-13 18:27   ` Brian Foster
  2016-07-13 20:43     ` Darrick J. Wong
  1 sibling, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-13 18:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:21:55PM -0700, Darrick J. Wong wrote:
> Provide a function to convert an unwritten extent to a real one and
> vice versa.
> 
> v2: Move unwritten bit to rm_offset.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Just a few nits below. Those aside and with Darrick's bc_rec.b ->
bc_rec.r fix:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_rmap.c |  442 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trace.h       |    6 +
>  2 files changed, 448 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index 1043c63..53ba14e 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -610,6 +610,448 @@ out_error:
>  	return error;
>  }
>  
> +#define RMAP_LEFT_CONTIG	(1 << 0)
> +#define RMAP_RIGHT_CONTIG	(1 << 1)
> +#define RMAP_LEFT_FILLING	(1 << 2)
> +#define RMAP_RIGHT_FILLING	(1 << 3)
> +#define RMAP_LEFT_VALID		(1 << 6)
> +#define RMAP_RIGHT_VALID	(1 << 7)
> +
> +#define LEFT		r[0]
> +#define RIGHT		r[1]
> +#define PREV		r[2]
> +#define NEW		r[3]
> +
> +/*
> + * Convert an unwritten extent to a real extent or vice versa.
> + * Does not handle overlapping extents.
> + */
> +STATIC int
> +__xfs_rmap_convert(
> +	struct xfs_btree_cur	*cur,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	bool			unwritten,
> +	struct xfs_owner_info	*oinfo)
> +{
...
> +
> +	/*
> +	 * For the initial lookup, look for and exact match or the left-adjacent

Typo:					    an

> +	 * record for our insertion point. This will also give us the record for
> +	 * start block contiguity tests.
> +	 */
> +	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, oldext, &i);
> +	if (error)
> +		goto done;
> +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +
...
> +
> +	/*
> +	 * Switch out based on the FILLING and CONTIG state bits.
> +	 */
> +	switch (state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
> +			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) {
...
> +	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
> +		/*
> +		 * Setting all of a previous oldext extent to newext.
> +		 * The right neighbor is contiguous, the left is not.
> +		 */
> +		error = xfs_btree_increment(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> +				RIGHT.rm_startblock, RIGHT.rm_blockcount,
> +				RIGHT.rm_owner, RIGHT.rm_offset,
> +				RIGHT.rm_flags);
> +		error = xfs_btree_delete(cur, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		error = xfs_btree_decrement(cur, 0, &i);
> +		if (error)
> +			goto done;
> +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> +		NEW.rm_startblock = bno;
> +		NEW.rm_owner = owner;
> +		NEW.rm_offset = offset;

		NEW = PREV ?

> +		NEW.rm_blockcount = len + RIGHT.rm_blockcount;
> +		NEW.rm_flags = newext;
> +		error = xfs_rmap_update(cur, &NEW);
> +		if (error)
> +			goto done;
> +		break;
> +
...
>  struct xfs_rmapbt_query_range_info {
>  	xfs_rmapbt_query_range_fn	fn;
>  	void				*priv;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 3ebceb0..6466adc 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2497,6 +2497,10 @@ DEFINE_RMAP_EVENT(xfs_rmap_free_extent_error);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
>  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
> +DEFINE_RMAP_EVENT(xfs_rmap_convert);
> +DEFINE_RMAP_EVENT(xfs_rmap_convert_done);
> +DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_error);
> +DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_state);
>  
>  DECLARE_EVENT_CLASS(xfs_rmapbt_class,
>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> @@ -2551,6 +2555,8 @@ DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
>  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
>  DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
>  DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
> +DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
> +DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);

xfs_rmap_convert_ltrec ?

Brian

>  
>  #endif /* _TRACE_XFS_H */
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 039/119] xfs: add rmap btree insert and delete helpers
  2016-06-17  1:22 ` [PATCH 039/119] xfs: add rmap btree insert and delete helpers Darrick J. Wong
@ 2016-07-13 18:28   ` Brian Foster
  2016-07-13 18:37     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-13 18:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:22:02PM -0700, Darrick J. Wong wrote:
> Add a couple of helper functions to encapsulate rmap btree insert and
> delete operations.  Add tracepoints to the update function.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/xfs/libxfs/xfs_rmap.c       |   78 +++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/libxfs/xfs_rmap_btree.h |    3 ++
>  2 files changed, 80 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index 53ba14e..f92eaa1 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -92,13 +92,89 @@ xfs_rmap_update(
...
> +STATIC int
> +xfs_rmapbt_delete(

This throws an unused warning that persists to the end of the rmap
patches..?

Brian

> +	struct xfs_btree_cur	*rcur,
> +	xfs_agblock_t		agbno,
> +	xfs_extlen_t		len,
> +	uint64_t		owner,
> +	uint64_t		offset,
> +	unsigned int		flags)
> +{
> +	int			i;
> +	int			error;
> +
> +	trace_xfs_rmapbt_delete(rcur->bc_mp, rcur->bc_private.a.agno, agbno,
> +			len, owner, offset, flags);
> +
> +	error = xfs_rmap_lookup_eq(rcur, agbno, len, owner, offset, flags, &i);
> +	if (error)
> +		goto done;
> +	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
> +
> +	error = xfs_btree_delete(rcur, &i);
> +	if (error)
> +		goto done;
> +	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
> +done:
> +	if (error)
> +		trace_xfs_rmapbt_delete_error(rcur->bc_mp,
> +				rcur->bc_private.a.agno, error, _RET_IP_);
> +	return error;
>  }
>  
>  static int
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index 9d92da5..6674340 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -64,6 +64,9 @@ int xfs_rmap_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
>  int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
>  		xfs_extlen_t len, uint64_t owner, uint64_t offset,
>  		unsigned int flags, int *stat);
> +int xfs_rmapbt_insert(struct xfs_btree_cur *rcur, xfs_agblock_t agbno,
> +		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> +		unsigned int flags);
>  int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
>  		int *stat);
>  
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents
  2016-06-17  1:22 ` [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents Darrick J. Wong
@ 2016-07-13 18:28   ` Brian Foster
  2016-07-13 18:47     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-13 18:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:22:08PM -0700, Darrick J. Wong wrote:
> Create two helper functions to assist with mapping, unmapping, and
> converting flag status of extents in a file's data/attr forks.  For
> non-shared files we can use the _alloc, _free, and _convert functions;
> when reflink comes these functions will be augmented to deal with
> shared extents.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_rmap.c |   42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index f92eaa1..76fc5c2 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -1123,11 +1123,53 @@ done:
>  	return error;
>  }
>  
> +/*
> + * Convert an unwritten extent to a real extent or vice versa.
> + */
> +STATIC int
> +xfs_rmap_convert(
> +	struct xfs_btree_cur	*cur,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	bool			unwritten,
> +	struct xfs_owner_info	*oinfo)
> +{
> +	return __xfs_rmap_convert(cur, bno, len, unwritten, oinfo);
> +}
> +

Hmm, these all look like 1-1 mappings and they're static as well. Is the
additional interface for reflink? If so, I think it might be better to
punt this down to where it is really used (reflink).

Brian

>  #undef	NEW
>  #undef	LEFT
>  #undef	RIGHT
>  #undef	PREV
>  
> +/*
> + * Find an extent in the rmap btree and unmap it.
> + */
> +STATIC int
> +xfs_rmap_unmap(
> +	struct xfs_btree_cur	*cur,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	bool			unwritten,
> +	struct xfs_owner_info	*oinfo)
> +{
> +	return __xfs_rmap_free(cur, bno, len, unwritten, oinfo);
> +}
> +
> +/*
> + * Find an extent in the rmap btree and map it.
> + */
> +STATIC int
> +xfs_rmap_map(
> +	struct xfs_btree_cur	*cur,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	bool			unwritten,
> +	struct xfs_owner_info	*oinfo)
> +{
> +	return __xfs_rmap_alloc(cur, bno, len, unwritten, oinfo);
> +}
> +
>  struct xfs_rmapbt_query_range_info {
>  	xfs_rmapbt_query_range_fn	fn;
>  	void				*priv;
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 031/119] xfs: rmap btree requires more reserved free space
  2016-07-13 16:50     ` Darrick J. Wong
@ 2016-07-13 18:32       ` Brian Foster
  2016-07-13 23:50         ` Dave Chinner
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-13 18:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Wed, Jul 13, 2016 at 09:50:08AM -0700, Darrick J. Wong wrote:
> On Fri, Jul 08, 2016 at 09:21:55AM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:21:11PM -0700, Darrick J. Wong wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The rmap btree is allocated from the AGFL, which means we have to
> > > ensure ENOSPC is reported to userspace before we run out of free
> > > space in each AG. The last allocation in an AG can cause a full
> > > height rmap btree split, and that means we have to reserve at least
> > > this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
> > > Update the various space calculation functiosn to handle this.
> > 
> > 				       functions
> > 
> > > 
> > > Also, because the macros are now executing conditional code and are called quite
> > > frequently, convert them to functions that initialise varaibles in the struct
> > > xfs_mount, use the new variables everywhere and document the calculations
> > > better.
> > > 
> > > v2: If rmapbt is disabled, it is incorrect to require 1 extra AGFL block
> > > for the rmapbt (due to the + 1); the entire clause needs to be gated
> > > on the feature flag.
> > > 
> > > v3: Use m_rmap_maxlevels to determine min_free.
> > > 
> > > [darrick.wong@oracle.com: don't reserve blocks if !rmap]
> > > [dchinner@redhat.com: update m_ag_max_usable after growfs]
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c |   71 +++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_alloc.h |   41 +++-----------------------
> > >  fs/xfs/libxfs/xfs_bmap.c  |    2 +
> > >  fs/xfs/libxfs/xfs_sb.c    |    2 +
> > >  fs/xfs/xfs_discard.c      |    2 +
> > >  fs/xfs/xfs_fsops.c        |    5 ++-
> > >  fs/xfs/xfs_log_recover.c  |    1 +
> > >  fs/xfs/xfs_mount.c        |    2 +
> > >  fs/xfs/xfs_mount.h        |    2 +
> > >  fs/xfs/xfs_super.c        |    2 +
> > >  10 files changed, 88 insertions(+), 42 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index 570ca17..4c8ffd4 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -63,6 +63,72 @@ xfs_prealloc_blocks(
> > >  }
> > >  
> > >  /*
> > > + * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
> > > + * AGF buffer (PV 947395), we place constraints on the relationship among
> > > + * actual allocations for data blocks, freelist blocks, and potential file data
> > > + * bmap btree blocks. However, these restrictions may result in no actual space
> > > + * allocated for a delayed extent, for example, a data block in a certain AG is
> > > + * allocated but there is no additional block for the additional bmap btree
> > > + * block due to a split of the bmap btree of the file. The result of this may
> > > + * lead to an infinite loop when the file gets flushed to disk and all delayed
> > > + * extents need to be actually allocated. To get around this, we explicitly set
> > > + * aside a few blocks which will not be reserved in delayed allocation.
> > > + *
> > > + * The minimum number of needed freelist blocks is 4 fsbs _per AG_ when we are
> > > + * not using rmap btrees a potential split of file's bmap btree requires 1 fsb,
> > > + * so we set the number of set-aside blocks to 4 + 4*agcount when not using
> > > + * rmap btrees.
> > > + *
> > 
> > That's a bit wordy.
> 
> Yikes, that whole thing is a single sentence!
> 
> One thing I'm not really sure about is how "a potential split of file's bmap
> btree requires 1 fsb" seems to translate to 4 in the actual formula.  I'd
> have thought it would be m_bm_maxlevels or something... not just 4.
> 

I'm not sure about that either, tbh.

> /* 
>  * When rmap is disabled, we need to reserve 4 fsbs _per AG_ for the freelist
>  * and 4 more to handle a potential split of the file's bmap btree.
>  *
>  * When rmap is enabled, we must also be able to handle two rmap btree inserts
>  * to record both the file data extent and a new bmbt block.  The bmbt block
>  * might not be in the same AG as the file data extent.  In the worst case
>  * the bmap btree splits multiple levels and all the new blocks come from
>  * different AGs, so set aside enough to handle rmap btree splits in all AGs.
>  */
> 

That sounds much better.

> > > + * When rmap btrees are active, we have to consider that using the last block
> > > + * in the AG can cause a full height rmap btree split and we need enough blocks
> > > + * on the AGFL to be able to handle this. That means we have, in addition to
> > > + * the above consideration, another (2 * mp->m_rmap_levels) - 1 blocks required
> > > + * to be available to the free list.
> > 
> > I'm probably missing something, but why does a full tree split require 2
> > blocks per-level (minus 1)? Wouldn't that involve an allocated block per
> > level (and possibly a new root block)?
> 
> The whole rmap clause is wrong. :(
> 
> I think we'll be fine with agcount * m_rmap_maxlevels.
> 

Ok, that certainly makes more sense.

> > Otherwise, the rest looks good to me.
> 
> Cool.
> 
> <keep going downwards>
> 
> > Brian
> > 
> > > + */
> > > +unsigned int
> > > +xfs_alloc_set_aside(
> > > +	struct xfs_mount *mp)
> > > +{
> > > +	unsigned int	blocks;
> > > +
> > > +	blocks = 4 + (mp->m_sb.sb_agcount * XFS_ALLOC_AGFL_RESERVE);
> > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > +		return blocks;
> > > +	return blocks + (mp->m_sb.sb_agcount * (2 * mp->m_rmap_maxlevels) - 1);
> > > +}
> > > +
> > > +/*
> > > + * When deciding how much space to allocate out of an AG, we limit the
> > > + * allocation maximum size to the size the AG. However, we cannot use all the
> > > + * blocks in the AG - some are permanently used by metadata. These
> > > + * blocks are generally:
> > > + *	- the AG superblock, AGF, AGI and AGFL
> > > + *	- the AGF (bno and cnt) and AGI btree root blocks, and optionally
> > > + *	  the AGI free inode and rmap btree root blocks.
> > > + *	- blocks on the AGFL according to xfs_alloc_set_aside() limits
> > > + *
> > > + * The AG headers are sector sized, so the amount of space they take up is
> > > + * dependent on filesystem geometry. The others are all single blocks.
> > > + */
> > > +unsigned int
> > > +xfs_alloc_ag_max_usable(struct xfs_mount *mp)
> > > +{
> > > +	unsigned int	blocks;
> > > +
> > > +	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
> > > +	blocks += XFS_ALLOC_AGFL_RESERVE;
> > > +	blocks += 3;			/* AGF, AGI btree root blocks */
> > > +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> > > +		blocks++;		/* finobt root block */
> > > +	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> > > +		/* rmap root block + full tree split on full AG */
> > > +		blocks += 1 + (2 * mp->m_ag_maxlevels) - 1;
> 
> I think this could be blocks++ since we now have AG reservations.
> 

Sounds good.

Brian

> --D
> 
> > > +	}
> > > +
> > > +	return mp->m_sb.sb_agblocks - blocks;
> > > +}
> > > +
> > > +/*
> > >   * Lookup the record equal to [bno, len] in the btree given by cur.
> > >   */
> > >  STATIC int				/* error */
> > > @@ -1904,6 +1970,11 @@ xfs_alloc_min_freelist(
> > >  	/* space needed by-size freespace btree */
> > >  	min_free += min_t(unsigned int, pag->pagf_levels[XFS_BTNUM_CNTi] + 1,
> > >  				       mp->m_ag_maxlevels);
> > > +	/* space needed reverse mapping used space btree */
> > > +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > +		min_free += min_t(unsigned int,
> > > +				  pag->pagf_levels[XFS_BTNUM_RMAPi] + 1,
> > > +				  mp->m_rmap_maxlevels);
> > >  
> > >  	return min_free;
> > >  }
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > > index 0721a48..7b6c66b 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.h
> > > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > > @@ -56,42 +56,6 @@ typedef unsigned int xfs_alloctype_t;
> > >  #define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
> > >  
> > >  /*
> > > - * In order to avoid ENOSPC-related deadlock caused by
> > > - * out-of-order locking of AGF buffer (PV 947395), we place
> > > - * constraints on the relationship among actual allocations for
> > > - * data blocks, freelist blocks, and potential file data bmap
> > > - * btree blocks. However, these restrictions may result in no
> > > - * actual space allocated for a delayed extent, for example, a data
> > > - * block in a certain AG is allocated but there is no additional
> > > - * block for the additional bmap btree block due to a split of the
> > > - * bmap btree of the file. The result of this may lead to an
> > > - * infinite loop in xfssyncd when the file gets flushed to disk and
> > > - * all delayed extents need to be actually allocated. To get around
> > > - * this, we explicitly set aside a few blocks which will not be
> > > - * reserved in delayed allocation. Considering the minimum number of
> > > - * needed freelist blocks is 4 fsbs _per AG_, a potential split of file's bmap
> > > - * btree requires 1 fsb, so we set the number of set-aside blocks
> > > - * to 4 + 4*agcount.
> > > - */
> > > -#define XFS_ALLOC_SET_ASIDE(mp)  (4 + ((mp)->m_sb.sb_agcount * 4))
> > > -
> > > -/*
> > > - * When deciding how much space to allocate out of an AG, we limit the
> > > - * allocation maximum size to the size the AG. However, we cannot use all the
> > > - * blocks in the AG - some are permanently used by metadata. These
> > > - * blocks are generally:
> > > - *	- the AG superblock, AGF, AGI and AGFL
> > > - *	- the AGF (bno and cnt) and AGI btree root blocks
> > > - *	- 4 blocks on the AGFL according to XFS_ALLOC_SET_ASIDE() limits
> > > - *
> > > - * The AG headers are sector sized, so the amount of space they take up is
> > > - * dependent on filesystem geometry. The others are all single blocks.
> > > - */
> > > -#define XFS_ALLOC_AG_MAX_USABLE(mp)	\
> > > -	((mp)->m_sb.sb_agblocks - XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)) - 7)
> > > -
> > > -
> > > -/*
> > >   * Argument structure for xfs_alloc routines.
> > >   * This is turned into a structure to avoid having 20 arguments passed
> > >   * down several levels of the stack.
> > > @@ -133,6 +97,11 @@ typedef struct xfs_alloc_arg {
> > >  #define XFS_ALLOC_INITIAL_USER_DATA	(1 << 1)/* special case start of file */
> > >  #define XFS_ALLOC_USERDATA_ZERO		(1 << 2)/* zero extent on allocation */
> > >  
> > > +/* freespace limit calculations */
> > > +#define XFS_ALLOC_AGFL_RESERVE	4
> > > +unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
> > > +unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
> > > +
> > >  xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_mount *mp,
> > >  		struct xfs_perag *pag, xfs_extlen_t need);
> > >  unsigned int xfs_alloc_min_freelist(struct xfs_mount *mp,
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 2c28f2a..61c0231 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -3672,7 +3672,7 @@ xfs_bmap_btalloc(
> > >  	args.fsbno = ap->blkno;
> > >  
> > >  	/* Trim the allocation back to the maximum an AG can fit. */
> > > -	args.maxlen = MIN(ap->length, XFS_ALLOC_AG_MAX_USABLE(mp));
> > > +	args.maxlen = MIN(ap->length, mp->m_ag_max_usable);
> > >  	args.firstblock = *ap->firstblock;
> > >  	blen = 0;
> > >  	if (nullfb) {
> > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > index f86226b..59c9f59 100644
> > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > @@ -749,6 +749,8 @@ xfs_sb_mount_common(
> > >  		mp->m_ialloc_min_blks = sbp->sb_spino_align;
> > >  	else
> > >  		mp->m_ialloc_min_blks = mp->m_ialloc_blks;
> > > +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > > +	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
> > >  }
> > >  
> > >  /*
> > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> > > index 272c3f8..4ff499a 100644
> > > --- a/fs/xfs/xfs_discard.c
> > > +++ b/fs/xfs/xfs_discard.c
> > > @@ -179,7 +179,7 @@ xfs_ioc_trim(
> > >  	 * matter as trimming blocks is an advisory interface.
> > >  	 */
> > >  	if (range.start >= XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks) ||
> > > -	    range.minlen > XFS_FSB_TO_B(mp, XFS_ALLOC_AG_MAX_USABLE(mp)) ||
> > > +	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
> > >  	    range.len < mp->m_sb.sb_blocksize)
> > >  		return -EINVAL;
> > >  
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 8a85e49..3772f6c 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -583,6 +583,7 @@ xfs_growfs_data_private(
> > >  	} else
> > >  		mp->m_maxicount = 0;
> > >  	xfs_set_low_space_thresholds(mp);
> > > +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > >  
> > >  	/* update secondary superblocks. */
> > >  	for (agno = 1; agno < nagcount; agno++) {
> > > @@ -720,7 +721,7 @@ xfs_fs_counts(
> > >  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
> > >  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
> > >  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> > > -							XFS_ALLOC_SET_ASIDE(mp);
> > > +						mp->m_alloc_set_aside;
> > >  
> > >  	spin_lock(&mp->m_sb_lock);
> > >  	cnt->freertx = mp->m_sb.sb_frextents;
> > > @@ -793,7 +794,7 @@ retry:
> > >  		__int64_t	free;
> > >  
> > >  		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > -							XFS_ALLOC_SET_ASIDE(mp);
> > > +						mp->m_alloc_set_aside;
> > >  		if (!free)
> > >  			goto out; /* ENOSPC and fdblks_delta = 0 */
> > >  
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index 0c41bd2..b33187b 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -5027,6 +5027,7 @@ xlog_do_recover(
> > >  		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
> > >  		return error;
> > >  	}
> > > +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > >  
> > >  	xlog_recover_check_summary(log);
> > >  
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index 8af1c88..879f3ef 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -1219,7 +1219,7 @@ xfs_mod_fdblocks(
> > >  		batch = XFS_FDBLOCKS_BATCH;
> > >  
> > >  	__percpu_counter_add(&mp->m_fdblocks, delta, batch);
> > > -	if (__percpu_counter_compare(&mp->m_fdblocks, XFS_ALLOC_SET_ASIDE(mp),
> > > +	if (__percpu_counter_compare(&mp->m_fdblocks, mp->m_alloc_set_aside,
> > >  				     XFS_FDBLOCKS_BATCH) >= 0) {
> > >  		/* we had space! */
> > >  		return 0;
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index 0ed0f29..b36676c 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -123,6 +123,8 @@ typedef struct xfs_mount {
> > >  	uint			m_in_maxlevels;	/* max inobt btree levels. */
> > >  	uint			m_rmap_maxlevels; /* max rmap btree levels */
> > >  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> > > +	uint			m_alloc_set_aside; /* space we can't use */
> > > +	uint			m_ag_max_usable; /* max space per AG */
> > >  	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> > >  	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> > >  	struct mutex		m_growlock;	/* growfs mutex */
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index bf63f6d..1575849 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1076,7 +1076,7 @@ xfs_fs_statfs(
> > >  	statp->f_blocks = sbp->sb_dblocks - lsize;
> > >  	spin_unlock(&mp->m_sb_lock);
> > >  
> > > -	statp->f_bfree = fdblocks - XFS_ALLOC_SET_ASIDE(mp);
> > > +	statp->f_bfree = fdblocks - mp->m_alloc_set_aside;
> > >  	statp->f_bavail = statp->f_bfree;
> > >  
> > >  	fakeinos = statp->f_bfree << sbp->sb_inopblog;
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 039/119] xfs: add rmap btree insert and delete helpers
  2016-07-13 18:28   ` Brian Foster
@ 2016-07-13 18:37     ` Darrick J. Wong
  2016-07-13 18:42       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-13 18:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Wed, Jul 13, 2016 at 02:28:13PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:22:02PM -0700, Darrick J. Wong wrote:
> > Add a couple of helper functions to encapsulate rmap btree insert and
> > delete operations.  Add tracepoints to the update function.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/xfs/libxfs/xfs_rmap.c       |   78 +++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/libxfs/xfs_rmap_btree.h |    3 ++
> >  2 files changed, 80 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index 53ba14e..f92eaa1 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -92,13 +92,89 @@ xfs_rmap_update(
> ...
> > +STATIC int
> > +xfs_rmapbt_delete(
> 
> This throws an unused warning that persists to the end of the rmap
> patches..?

Oh, yeah, we don't need it until "xfs: use interval query for rmap alloc
operations on shared files".  Will move.

--D

> 
> Brian
> 
> > +	struct xfs_btree_cur	*rcur,
> > +	xfs_agblock_t		agbno,
> > +	xfs_extlen_t		len,
> > +	uint64_t		owner,
> > +	uint64_t		offset,
> > +	unsigned int		flags)
> > +{
> > +	int			i;
> > +	int			error;
> > +
> > +	trace_xfs_rmapbt_delete(rcur->bc_mp, rcur->bc_private.a.agno, agbno,
> > +			len, owner, offset, flags);
> > +
> > +	error = xfs_rmap_lookup_eq(rcur, agbno, len, owner, offset, flags, &i);
> > +	if (error)
> > +		goto done;
> > +	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
> > +
> > +	error = xfs_btree_delete(rcur, &i);
> > +	if (error)
> > +		goto done;
> > +	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
> > +done:
> > +	if (error)
> > +		trace_xfs_rmapbt_delete_error(rcur->bc_mp,
> > +				rcur->bc_private.a.agno, error, _RET_IP_);
> > +	return error;
> >  }
> >  
> >  static int
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index 9d92da5..6674340 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -64,6 +64,9 @@ int xfs_rmap_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> >  int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> >  		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> >  		unsigned int flags, int *stat);
> > +int xfs_rmapbt_insert(struct xfs_btree_cur *rcur, xfs_agblock_t agbno,
> > +		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> > +		unsigned int flags);
> >  int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
> >  		int *stat);
> >  
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 039/119] xfs: add rmap btree insert and delete helpers
  2016-07-13 18:37     ` Darrick J. Wong
@ 2016-07-13 18:42       ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-13 18:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs, Dave Chinner

On Wed, Jul 13, 2016 at 11:37:04AM -0700, Darrick J. Wong wrote:
> On Wed, Jul 13, 2016 at 02:28:13PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:22:02PM -0700, Darrick J. Wong wrote:
> > > Add a couple of helper functions to encapsulate rmap btree insert and
> > > delete operations.  Add tracepoints to the update function.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Dave Chinner <david@fromorbit.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_rmap.c       |   78 +++++++++++++++++++++++++++++++++++++++-
> > >  fs/xfs/libxfs/xfs_rmap_btree.h |    3 ++
> > >  2 files changed, 80 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > > index 53ba14e..f92eaa1 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > > @@ -92,13 +92,89 @@ xfs_rmap_update(
> > ...
> > > +STATIC int
> > > +xfs_rmapbt_delete(
> > 
> > This throws an unused warning that persists to the end of the rmap
> > patches..?
> 
> Oh, yeah, we don't need it until "xfs: use interval query for rmap alloc
> operations on shared files".  Will move.
> 

Ok, with that snipped out:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> --D
> 
> > 
> > Brian
> > 
> > > +	struct xfs_btree_cur	*rcur,
> > > +	xfs_agblock_t		agbno,
> > > +	xfs_extlen_t		len,
> > > +	uint64_t		owner,
> > > +	uint64_t		offset,
> > > +	unsigned int		flags)
> > > +{
> > > +	int			i;
> > > +	int			error;
> > > +
> > > +	trace_xfs_rmapbt_delete(rcur->bc_mp, rcur->bc_private.a.agno, agbno,
> > > +			len, owner, offset, flags);
> > > +
> > > +	error = xfs_rmap_lookup_eq(rcur, agbno, len, owner, offset, flags, &i);
> > > +	if (error)
> > > +		goto done;
> > > +	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
> > > +
> > > +	error = xfs_btree_delete(rcur, &i);
> > > +	if (error)
> > > +		goto done;
> > > +	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
> > > +done:
> > > +	if (error)
> > > +		trace_xfs_rmapbt_delete_error(rcur->bc_mp,
> > > +				rcur->bc_private.a.agno, error, _RET_IP_);
> > > +	return error;
> > >  }
> > >  
> > >  static int
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > index 9d92da5..6674340 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > @@ -64,6 +64,9 @@ int xfs_rmap_lookup_le(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> > >  int xfs_rmap_lookup_eq(struct xfs_btree_cur *cur, xfs_agblock_t bno,
> > >  		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> > >  		unsigned int flags, int *stat);
> > > +int xfs_rmapbt_insert(struct xfs_btree_cur *rcur, xfs_agblock_t agbno,
> > > +		xfs_extlen_t len, uint64_t owner, uint64_t offset,
> > > +		unsigned int flags);
> > >  int xfs_rmap_get_rec(struct xfs_btree_cur *cur, struct xfs_rmap_irec *irec,
> > >  		int *stat);
> > >  
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents
  2016-07-13 18:28   ` Brian Foster
@ 2016-07-13 18:47     ` Darrick J. Wong
  2016-07-13 23:54       ` Dave Chinner
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-13 18:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Wed, Jul 13, 2016 at 02:28:25PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:22:08PM -0700, Darrick J. Wong wrote:
> > Create two helper functions to assist with mapping, unmapping, and
> > converting flag status of extents in a file's data/attr forks.  For
> > non-shared files we can use the _alloc, _free, and _convert functions;
> > when reflink comes these functions will be augmented to deal with
> > shared extents.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_rmap.c |   42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index f92eaa1..76fc5c2 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -1123,11 +1123,53 @@ done:
> >  	return error;
> >  }
> >  
> > +/*
> > + * Convert an unwritten extent to a real extent or vice versa.
> > + */
> > +STATIC int
> > +xfs_rmap_convert(
> > +	struct xfs_btree_cur	*cur,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len,
> > +	bool			unwritten,
> > +	struct xfs_owner_info	*oinfo)
> > +{
> > +	return __xfs_rmap_convert(cur, bno, len, unwritten, oinfo);
> > +}
> > +
> 
> Hmm, these all look like 1-1 mappings and they're static as well. Is the
> additional interface for reflink? If so, I think it might be better to
> punt this down to where it is really used (reflink).

Originally they were, but since the only caller of these functions is
_rmap_finish_one, this whole patch can drop out.

Later on in reflink, map/unmap/convert for reflinked files get totally
separate "shared" variants, along with corresponding RUI type codes.

Speaking of which, the shared and non-shared alloc/free/convert
functions are at a high level the same.  Each function has 8-10 places
where they differ (mostly in which btree functions they call) and I
wondered -- should I refactor them into a single megafunction that
takes a bunch of function pointers?  It's a little unwieldly to have
so much to pass in, but on the other hand we wouldn't have to maintain
two versions of basically the same code.

--D

> 
> Brian
> 
> >  #undef	NEW
> >  #undef	LEFT
> >  #undef	RIGHT
> >  #undef	PREV
> >  
> > +/*
> > + * Find an extent in the rmap btree and unmap it.
> > + */
> > +STATIC int
> > +xfs_rmap_unmap(
> > +	struct xfs_btree_cur	*cur,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len,
> > +	bool			unwritten,
> > +	struct xfs_owner_info	*oinfo)
> > +{
> > +	return __xfs_rmap_free(cur, bno, len, unwritten, oinfo);
> > +}
> > +
> > +/*
> > + * Find an extent in the rmap btree and map it.
> > + */
> > +STATIC int
> > +xfs_rmap_map(
> > +	struct xfs_btree_cur	*cur,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len,
> > +	bool			unwritten,
> > +	struct xfs_owner_info	*oinfo)
> > +{
> > +	return __xfs_rmap_alloc(cur, bno, len, unwritten, oinfo);
> > +}
> > +
> >  struct xfs_rmapbt_query_range_info {
> >  	xfs_rmapbt_query_range_fn	fn;
> >  	void				*priv;
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 038/119] xfs: convert unwritten status of reverse mappings
  2016-07-13 18:27   ` Brian Foster
@ 2016-07-13 20:43     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-13 20:43 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Wed, Jul 13, 2016 at 02:27:55PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:21:55PM -0700, Darrick J. Wong wrote:
> > Provide a function to convert an unwritten extent to a real one and
> > vice versa.
> > 
> > v2: Move unwritten bit to rm_offset.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Just a few nits below. Those aside and with Darrick's bc_rec.b ->
> bc_rec.r fix:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> >  fs/xfs/libxfs/xfs_rmap.c |  442 ++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trace.h       |    6 +
> >  2 files changed, 448 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index 1043c63..53ba14e 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -610,6 +610,448 @@ out_error:
> >  	return error;
> >  }
> >  
> > +#define RMAP_LEFT_CONTIG	(1 << 0)
> > +#define RMAP_RIGHT_CONTIG	(1 << 1)
> > +#define RMAP_LEFT_FILLING	(1 << 2)
> > +#define RMAP_RIGHT_FILLING	(1 << 3)
> > +#define RMAP_LEFT_VALID		(1 << 6)
> > +#define RMAP_RIGHT_VALID	(1 << 7)
> > +
> > +#define LEFT		r[0]
> > +#define RIGHT		r[1]
> > +#define PREV		r[2]
> > +#define NEW		r[3]
> > +
> > +/*
> > + * Convert an unwritten extent to a real extent or vice versa.
> > + * Does not handle overlapping extents.
> > + */
> > +STATIC int
> > +__xfs_rmap_convert(
> > +	struct xfs_btree_cur	*cur,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len,
> > +	bool			unwritten,
> > +	struct xfs_owner_info	*oinfo)
> > +{
> ...
> > +
> > +	/*
> > +	 * For the initial lookup, look for and exact match or the left-adjacent
> 
> Typo:					    an
> 
> > +	 * record for our insertion point. This will also give us the record for
> > +	 * start block contiguity tests.
> > +	 */
> > +	error = xfs_rmap_lookup_le(cur, bno, len, owner, offset, oldext, &i);
> > +	if (error)
> > +		goto done;
> > +	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> > +
> ...
> > +
> > +	/*
> > +	 * Switch out based on the FILLING and CONTIG state bits.
> > +	 */
> > +	switch (state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
> > +			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) {
> ...
> > +	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
> > +		/*
> > +		 * Setting all of a previous oldext extent to newext.
> > +		 * The right neighbor is contiguous, the left is not.
> > +		 */
> > +		error = xfs_btree_increment(cur, 0, &i);
> > +		if (error)
> > +			goto done;
> > +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> > +		trace_xfs_rmapbt_delete(mp, cur->bc_private.a.agno,
> > +				RIGHT.rm_startblock, RIGHT.rm_blockcount,
> > +				RIGHT.rm_owner, RIGHT.rm_offset,
> > +				RIGHT.rm_flags);
> > +		error = xfs_btree_delete(cur, &i);
> > +		if (error)
> > +			goto done;
> > +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> > +		error = xfs_btree_decrement(cur, 0, &i);
> > +		if (error)
> > +			goto done;
> > +		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
> > +		NEW.rm_startblock = bno;
> > +		NEW.rm_owner = owner;
> > +		NEW.rm_offset = offset;
> 
> 		NEW = PREV ?
> 
> > +		NEW.rm_blockcount = len + RIGHT.rm_blockcount;
> > +		NEW.rm_flags = newext;
> > +		error = xfs_rmap_update(cur, &NEW);
> > +		if (error)
> > +			goto done;
> > +		break;
> > +
> ...
> >  struct xfs_rmapbt_query_range_info {
> >  	xfs_rmapbt_query_range_fn	fn;
> >  	void				*priv;
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 3ebceb0..6466adc 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2497,6 +2497,10 @@ DEFINE_RMAP_EVENT(xfs_rmap_free_extent_error);
> >  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent);
> >  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_done);
> >  DEFINE_RMAP_EVENT(xfs_rmap_alloc_extent_error);
> > +DEFINE_RMAP_EVENT(xfs_rmap_convert);
> > +DEFINE_RMAP_EVENT(xfs_rmap_convert_done);
> > +DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_error);
> > +DEFINE_AG_ERROR_EVENT(xfs_rmap_convert_state);
> >  
> >  DECLARE_EVENT_CLASS(xfs_rmapbt_class,
> >  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > @@ -2551,6 +2555,8 @@ DEFINE_AG_ERROR_EVENT(xfs_rmapbt_delete_error);
> >  DEFINE_AG_ERROR_EVENT(xfs_rmapbt_update_error);
> >  DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
> >  DEFINE_RMAPBT_EVENT(xfs_rmap_map_gtrec);
> > +DEFINE_RMAPBT_EVENT(xfs_rmap_convert_gtrec);
> > +DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
> 
> xfs_rmap_convert_ltrec ?

Originally there was a xfs_rmap_convert_ltrec and a xfs_map_convert_ltrec.
Then I had to create a real "find left extent" helper function for the
reflink versions of map/convert, and that became
xfs_rmap_find_left_neighbor*, with its own tracepoint.

It seemed silly to have different tracepoints for "here's what I found when
I went looking for a left-adjacent extent", so the non-reflink versions of
map/convert simply started (ab)using the xfs_rmap_find_left_neighbor_result
tracepoint, even though the non-shared versions open-code btree cursor
manipulation without doing a lookup.

I could refactor the whole mess to have functions to find the left and right
neighbors in shared and not-shared mode, but I find it easier to keep track
of the cursor manipulation if they all stay in one function.

Oh.  Or I could just change the xfs_rmap_*_gtrec tracepoints into
xfs_rmap_find_right_neighbor_result.

Yeah, I'll do that since we already have a trace point at the top of the
function so we already know what we're doing.

--D

> 
> Brian
> 
> >  
> >  #endif /* _TRACE_XFS_H */
> >  
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 006/119] xfs: port differences from xfsprogs libxfs
  2016-06-20  0:21   ` Dave Chinner
@ 2016-07-13 23:39     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-13 23:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jun 20, 2016 at 10:21:07AM +1000, Dave Chinner wrote:
> On Thu, Jun 16, 2016 at 06:18:30PM -0700, Darrick J. Wong wrote:
> > Port various differences between xfsprogs and the kernel.  This
> > cleans up both so that we can develop rmap and reflink on the
> > same libxfs code.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Nak. I'm essentially trying to keep the little hacks needed in 
> userspace out of the kernel libxfs tree. We quite regularly get
> people scanning the kernel tree and trying to remove things like
> exported function prototypes that are not used in kernel space,
> so the headers in userspace carry those simply to prevent people
> continually sending kernel patches that we have to look at and then
> ignore...

Fair enough, I merely diff'd the two libxfs and figured I'd remove
all the differences to try to develop atop as close to identical libxfs as I
could get. :)

> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 99b077c..58bdca7 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2415,7 +2415,9 @@ xfs_alloc_read_agf(
> >  			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
> >  		spin_lock_init(&pag->pagb_lock);
> >  		pag->pagb_count = 0;
> > +#ifdef __KERNEL__
> >  		pag->pagb_tree = RB_ROOT;
> > +#endif
> >  		pag->pagf_init = 1;
> >  	}
> >  #ifdef DEBUG
> 
> e.g. this is an indication that reminds us that there is
> functionality in the libxfs kernel tree that isn't in userspace...
> 
> > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
> > index 4f2aed0..8ef420a 100644
> > --- a/fs/xfs/libxfs/xfs_attr_leaf.h
> > +++ b/fs/xfs/libxfs/xfs_attr_leaf.h
> > @@ -51,7 +51,7 @@ int	xfs_attr_shortform_getvalue(struct xfs_da_args *args);
> >  int	xfs_attr_shortform_to_leaf(struct xfs_da_args *args);
> >  int	xfs_attr_shortform_remove(struct xfs_da_args *args);
> >  int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
> > -int	xfs_attr_shortform_bytesfit(xfs_inode_t *dp, int bytes);
> > +int	xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
> >  void	xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
> 
> Things like this are fine...

Ok.

> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 932381c..499e980 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -1425,7 +1425,7 @@ xfs_bmap_search_multi_extents(
> >   * Else, *lastxp will be set to the index of the found
> >   * entry; *gotp will contain the entry.
> >   */
> > -STATIC xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
> > +xfs_bmbt_rec_host_t *                 /* pointer to found extent entry */
> >  xfs_bmap_search_extents(
> >  	xfs_inode_t     *ip,            /* incore inode pointer */
> >  	xfs_fileoff_t   bno,            /* block number searched for */
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 423a34e..79e3ebe 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -231,4 +231,10 @@ int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
> >  		int num_exts);
> >  int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
> >  
> > +struct xfs_bmbt_rec_host *
> > +	xfs_bmap_search_extents(struct xfs_inode *ip, xfs_fileoff_t bno,
> > +				int fork, int *eofp, xfs_extnum_t *lastxp,
> > +				struct xfs_bmbt_irec *gotp,
> > +				struct xfs_bmbt_irec *prevp);
> > +
> >  #endif	/* __XFS_BMAP_H__ */
> 
> But these are the sort of "clean up the kernel patches" that I was
> refering to. If there's a user in kernel space, then fine, otherwise
> it doesn't hurt to keep it only in userspace. There are relatively
> few of these....
> 
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 1f88e1c..105979d 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -2532,6 +2532,7 @@ error0:
> >  	return error;
> >  }
> >  
> > +#ifdef __KERNEL__
> >  struct xfs_btree_split_args {
> >  	struct xfs_btree_cur	*cur;
> >  	int			level;
> > @@ -2609,6 +2610,9 @@ xfs_btree_split(
> >  	destroy_work_on_stack(&args.work);
> >  	return args.result;
> >  }
> > +#else /* !KERNEL */
> > +#define xfs_btree_split	__xfs_btree_split
> > +#endif
> 
> Same again -this is 4 lines of code that is userspace only. It's a
> tiny amount compared to the original difference that these
> kernel-only stack splits required, and so not a huge issue.

Will drop these two.

> > --- a/fs/xfs/libxfs/xfs_dquot_buf.c
> > +++ b/fs/xfs/libxfs/xfs_dquot_buf.c
> > @@ -31,10 +31,16 @@
> >  #include "xfs_cksum.h"
> >  #include "xfs_trace.h"
> >  
> > +/*
> > + * XXX: kernel implementation causes ndquots calc to go real
> > + * bad. Just leaving the existing userspace calc here right now.
> > + */
> >  int
> >  xfs_calc_dquots_per_chunk(
> >  	unsigned int		nbblks)	/* basic block units */
> >  {
> > +#ifdef __KERNEL__
> > +	/* kernel code that goes wrong in userspace! */
> >  	unsigned int	ndquots;
> >  
> >  	ASSERT(nbblks > 0);
> > @@ -42,6 +48,10 @@ xfs_calc_dquots_per_chunk(
> >  	do_div(ndquots, sizeof(xfs_dqblk_t));
> >  
> >  	return ndquots;
> > +#else
> > +	ASSERT(nbblks > 0);
> > +	return BBTOB(nbblks) / sizeof(xfs_dqblk_t);
> > +#endif
> >  }
> 
> This is a clear case that we need to fix the code to be
> correct for both kernel and userspace without modification, not
> propagate the userspace hack back into the kernel code.

I /think/ it does this because libxfs/libxfs_priv.h's __do_div expects
to be passed a pointer to an unsigned long long (which is later dereferenced
and used as an unsigned long), whereas ndquots is an int?

I'm not sure why we need do_div either, since AFAICT we only ever process
quota in chunks of 1FSB, for which 32-bit division should be fine.

> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 9d9559e..794fa66 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -56,6 +56,17 @@ xfs_inobp_check(
> >  }
> >  #endif
> >  
> > +bool
> > +xfs_dinode_good_version(
> > +	struct xfs_mount *mp,
> > +	__u8		version)
> > +{
> > +	if (xfs_sb_version_hascrc(&mp->m_sb))
> > +		return version == 3;
> > +
> > +	return version == 1 || version == 2;
> > +}
> 
> This xfs_dinode_good_version() change needs to be a separate patch

Ok.

> >  void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
> >  #else
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e8f49c0..e5baba3 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -462,8 +462,8 @@ static inline uint xfs_log_dinode_size(int version)
> >  typedef struct xfs_buf_log_format {
> >  	unsigned short	blf_type;	/* buf log item type indicator */
> >  	unsigned short	blf_size;	/* size of this item */
> > -	ushort		blf_flags;	/* misc state */
> > -	ushort		blf_len;	/* number of blocks in this buf */
> > +	unsigned short	blf_flags;	/* misc state */
> > +	unsigned short	blf_len;	/* number of blocks in this buf */
> >  	__int64_t	blf_blkno;	/* starting blkno of this buf */
> >  	unsigned int	blf_map_size;	/* used size of data bitmap in words */
> >  	unsigned int	blf_data_map[XFS_BLF_DATAMAP_SIZE]; /* dirty bitmap */
> 
> The removal of ushort/uint from the kernel code needs to be a
> separate patch that addresses all the users, not just the couple in
> shared headers....

Ok.

> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index 12ca867..09d6fd0 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -261,6 +261,7 @@ xfs_mount_validate_sb(
> >  	/*
> >  	 * Until this is fixed only page-sized or smaller data blocks work.
> >  	 */
> > +#ifdef __KERNEL__
> >  	if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) {
> >  		xfs_warn(mp,
> >  		"File system with blocksize %d bytes. "
> > @@ -268,6 +269,7 @@ xfs_mount_validate_sb(
> >  				sbp->sb_blocksize, PAGE_SIZE);
> >  		return -ENOSYS;
> >  	}
> > +#endif
> >  
> >  	/*
> >  	 * Currently only very few inode sizes are supported.
> > @@ -291,10 +293,12 @@ xfs_mount_validate_sb(
> >  		return -EFBIG;
> >  	}
> >  
> > +#ifdef __KERNEL__
> >  	if (check_inprogress && sbp->sb_inprogress) {
> >  		xfs_warn(mp, "Offline file system operation in progress!");
> >  		return -EFSCORRUPTED;
> >  	}
> > +#endif
> >  	return 0;
> >  }
> 
> Again, I don't think this needs to be propagated back into the
> kernel code...

Will drop.

--D
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 031/119] xfs: rmap btree requires more reserved free space
  2016-07-13 18:32       ` Brian Foster
@ 2016-07-13 23:50         ` Dave Chinner
  0 siblings, 0 replies; 236+ messages in thread
From: Dave Chinner @ 2016-07-13 23:50 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Wed, Jul 13, 2016 at 02:32:17PM -0400, Brian Foster wrote:
> On Wed, Jul 13, 2016 at 09:50:08AM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 08, 2016 at 09:21:55AM -0400, Brian Foster wrote:
> > > >  /*
> > > > + * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
> > > > + * AGF buffer (PV 947395), we place constraints on the relationship among
> > > > + * actual allocations for data blocks, freelist blocks, and potential file data
> > > > + * bmap btree blocks. However, these restrictions may result in no actual space
> > > > + * allocated for a delayed extent, for example, a data block in a certain AG is
> > > > + * allocated but there is no additional block for the additional bmap btree
> > > > + * block due to a split of the bmap btree of the file. The result of this may
> > > > + * lead to an infinite loop when the file gets flushed to disk and all delayed
> > > > + * extents need to be actually allocated. To get around this, we explicitly set
> > > > + * aside a few blocks which will not be reserved in delayed allocation.
> > > > + *
> > > > + * The minimum number of needed freelist blocks is 4 fsbs _per AG_ when we are
> > > > + * not using rmap btrees a potential split of file's bmap btree requires 1 fsb,
> > > > + * so we set the number of set-aside blocks to 4 + 4*agcount when not using
> > > > + * rmap btrees.
> > > > + *
> > > 
> > > That's a bit wordy.
> > 
> > Yikes, that whole thing is a single sentence!
> > 
> > One thing I'm not really sure about is how "a potential split of file's bmap
> > btree requires 1 fsb" seems to translate to 4 in the actual formula.  I'd
> > have thought it would be m_bm_maxlevels or something... not just 4.
> > 
> 
> I'm not sure about that either, tbh.

So, a trip down memory lane. 

<wavy line dissolve>

Back in 2006, I fixed a bug that changed XFS_ALLOC_SET_ASIDE from
a  fixed value of 8 blocks to 4 blocks + 4 AGFL blocks per AG in
commit 4be536de ("[XFS] Prevent free space
oversubscription and xfssyncd looping."). The original value of
8 was for 4 blocks for the bmbt split, and 4 blocks from the current
AG for the AGFL (commit message explains the reason this was a
problem (Yay for writing good commit messages 10 years ago!)). The
oringinal comment text was:

- * reserved in delayed allocation. Considering the minimum number of
- * needed freelist blocks is 4 fsbs, a potential split of file's bmap
- * btree requires 1 fsb, so we set the number of set-aside blocks to 8.
-*/

So we need to go back further. We have an obvious git log search
target in the comment (PV#947395), and that points to:

commit d210a28cd851082cec9b282443f8cc0e6fc09830
Author: Yingping Lu <yingping@sgi.com>
Date:   Fri Jun 9 14:55:18 2006 +1000

    [XFS] In actual allocation of file system blocks and freeing extents, the
    transaction within each such operation may involve multiple locking of AGF
    buffer. While the freeing extent function has sorted the extents based on
    AGF number before entering into transaction, however, when the file system
    space is very limited, the allocation of space would try every AGF to get
    space allocated, this could potentially cause out-of-order locking, thus
    deadlock could happen. This fix mitigates the scarce space for allocation
    by setting aside a few blocks without reservation, and avoid deadlock by
    maintaining ascending order of AGF locking.
    
    SGI-PV: 947395
    SGI-Modid: xfs-linux-melb:xfs-kern:210801a
    
    Signed-off-by: Yingping Lu <yingping@sgi.com>
    Signed-off-by: Nathan Scott <nathans@sgi.com>

Which tells us nothing about why 1 fsb for the bmbt split was
actually reserved as 4fsbs. IIRC, I ended up having to find and fix
the problem because Yingping left SGI soon after this fix was made,
and at the time nobody understood or could work out why that was
done. It worked, however, so we left it that way, and just fixed the
per-ag reservation problem this bug fix had.


> > /* 
> >  * When rmap is disabled, we need to reserve 4 fsbs _per AG_ for the freelist
> >  * and 4 more to handle a potential split of the file's bmap btree.

As such, I'm not sure that is any more correct than the original
comment.

Looking back on this now with 10 years more time working on XFS, my
suspicion is that a single level bmap btree split requires 1
block to be allocated, but that allocation will call
xfs_alloc_fix_freelist() to refill the freelist to the minimum
(which is 4 blocks), and so we need at least 4 blocks for the
allocation to succeed (4 blocks for the freelist fill, and if we are
at ENOSPC then the bmap btree block will be allocated from the
AGFL).

Whether the value of 4 is correct or not for this purpose is just a
guess based on the per-ag AGFL requirements, so my only comment
right now is: it's worked for 10 years, so let's not change it until
there's at least some evidence that is it wrong.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents
  2016-07-13 18:47     ` Darrick J. Wong
@ 2016-07-13 23:54       ` Dave Chinner
  2016-07-13 23:55         ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Dave Chinner @ 2016-07-13 23:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, vishal.l.verma, xfs

On Wed, Jul 13, 2016 at 11:47:50AM -0700, Darrick J. Wong wrote:
> On Wed, Jul 13, 2016 at 02:28:25PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:22:08PM -0700, Darrick J. Wong wrote:
> > > Create two helper functions to assist with mapping, unmapping, and
> > > converting flag status of extents in a file's data/attr forks.  For
> > > non-shared files we can use the _alloc, _free, and _convert functions;
> > > when reflink comes these functions will be augmented to deal with
> > > shared extents.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_rmap.c |   42 ++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 42 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > > index f92eaa1..76fc5c2 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > > @@ -1123,11 +1123,53 @@ done:
> > >  	return error;
> > >  }
> > >  
> > > +/*
> > > + * Convert an unwritten extent to a real extent or vice versa.
> > > + */
> > > +STATIC int
> > > +xfs_rmap_convert(
> > > +	struct xfs_btree_cur	*cur,
> > > +	xfs_agblock_t		bno,
> > > +	xfs_extlen_t		len,
> > > +	bool			unwritten,
> > > +	struct xfs_owner_info	*oinfo)
> > > +{
> > > +	return __xfs_rmap_convert(cur, bno, len, unwritten, oinfo);
> > > +}
> > > +
> > 
> > Hmm, these all look like 1-1 mappings and they're static as well. Is the
> > additional interface for reflink? If so, I think it might be better to
> > punt this down to where it is really used (reflink).
> 
> Originally they were, but since the only caller of these functions is
> _rmap_finish_one, this whole patch can drop out.
> 
> Later on in reflink, map/unmap/convert for reflinked files get totally
> separate "shared" variants, along with corresponding RUI type codes.
> 
> Speaking of which, the shared and non-shared alloc/free/convert
> functions are at a high level the same.  Each function has 8-10 places
> where they differ (mostly in which btree functions they call) and I
> wondered -- should I refactor them into a single megafunction that
> takes a bunch of function pointers?

Use an ops structure containing function pointers. But that can be
doen once the code is merged - it doesn't need to be done right
away.

> It's a little unwieldly to have
> so much to pass in, but on the other hand we wouldn't have to maintain
> two versions of basically the same code.

An ops structure fixes that problem.


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents
  2016-07-13 23:54       ` Dave Chinner
@ 2016-07-13 23:55         ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-13 23:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jul 14, 2016 at 09:54:08AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2016 at 11:47:50AM -0700, Darrick J. Wong wrote:
> > On Wed, Jul 13, 2016 at 02:28:25PM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:22:08PM -0700, Darrick J. Wong wrote:
> > > > Create two helper functions to assist with mapping, unmapping, and
> > > > converting flag status of extents in a file's data/attr forks.  For
> > > > non-shared files we can use the _alloc, _free, and _convert functions;
> > > > when reflink comes these functions will be augmented to deal with
> > > > shared extents.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_rmap.c |   42 ++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 42 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > > > index f92eaa1..76fc5c2 100644
> > > > --- a/fs/xfs/libxfs/xfs_rmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > > > @@ -1123,11 +1123,53 @@ done:
> > > >  	return error;
> > > >  }
> > > >  
> > > > +/*
> > > > + * Convert an unwritten extent to a real extent or vice versa.
> > > > + */
> > > > +STATIC int
> > > > +xfs_rmap_convert(
> > > > +	struct xfs_btree_cur	*cur,
> > > > +	xfs_agblock_t		bno,
> > > > +	xfs_extlen_t		len,
> > > > +	bool			unwritten,
> > > > +	struct xfs_owner_info	*oinfo)
> > > > +{
> > > > +	return __xfs_rmap_convert(cur, bno, len, unwritten, oinfo);
> > > > +}
> > > > +
> > > 
> > > Hmm, these all look like 1-1 mappings and they're static as well. Is the
> > > additional interface for reflink? If so, I think it might be better to
> > > punt this down to where it is really used (reflink).
> > 
> > Originally they were, but since the only caller of these functions is
> > _rmap_finish_one, this whole patch can drop out.
> > 
> > Later on in reflink, map/unmap/convert for reflinked files get totally
> > separate "shared" variants, along with corresponding RUI type codes.
> > 
> > Speaking of which, the shared and non-shared alloc/free/convert
> > functions are at a high level the same.  Each function has 8-10 places
> > where they differ (mostly in which btree functions they call) and I
> > wondered -- should I refactor them into a single megafunction that
> > takes a bunch of function pointers?
> 
> Use an ops structure containing function pointers. But that can be
> doen once the code is merged - it doesn't need to be done right
> away.
> 
> > It's a little unwieldly to have
> > so much to pass in, but on the other hand we wouldn't have to maintain
> > two versions of basically the same code.
> 
> An ops structure fixes that problem.

I actually did mean an ops structure; passing in function pointers as
args is wayyyy fugly... but I'd considered that making a bunch of small
functions + a struct might not be much better. :)

--D

> 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 041/119] xfs: create rmap update intent log items
  2016-06-17  1:22 ` [PATCH 041/119] xfs: create rmap update intent log items Darrick J. Wong
@ 2016-07-15 18:33   ` Brian Foster
  2016-07-16  7:10     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-15 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:22:14PM -0700, Darrick J. Wong wrote:
> Create rmap update intent/done log items to record redo information in
> the log.  Because we need to roll transactions between updating the
> bmbt mapping and updating the reverse mapping, we also have to track
> the status of the metadata updates that will be recorded in the
> post-roll transactions, just in case we crash before committing the
> final transaction.  This mechanism enables log recovery to finish what
> was already started.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

A couple nits below, otherwise looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/Makefile                |    1 
>  fs/xfs/libxfs/xfs_log_format.h |   67 ++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h |   19 ++
>  fs/xfs/xfs_rmap_item.c         |  459 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_rmap_item.h         |  100 +++++++++
>  fs/xfs/xfs_super.c             |   21 ++
>  6 files changed, 665 insertions(+), 2 deletions(-)
>  create mode 100644 fs/xfs/xfs_rmap_item.c
>  create mode 100644 fs/xfs/xfs_rmap_item.h
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 2de8c20..8ae0a10 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -104,6 +104,7 @@ xfs-y				+= xfs_log.o \
>  				   xfs_extfree_item.o \
>  				   xfs_icreate_item.o \
>  				   xfs_inode_item.o \
> +				   xfs_rmap_item.o \
>  				   xfs_log_recover.o \
>  				   xfs_trans_ail.o \
>  				   xfs_trans_buf.o \
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e5baba3..b9627b7 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -110,7 +110,9 @@ static inline uint xlog_get_cycle(char *ptr)
>  #define XLOG_REG_TYPE_COMMIT		18
>  #define XLOG_REG_TYPE_TRANSHDR		19
>  #define XLOG_REG_TYPE_ICREATE		20
> -#define XLOG_REG_TYPE_MAX		20
> +#define XLOG_REG_TYPE_RUI_FORMAT	21
> +#define XLOG_REG_TYPE_RUD_FORMAT	22
> +#define XLOG_REG_TYPE_MAX		22
>  
>  /*
>   * Flags to log operation header
> @@ -227,6 +229,8 @@ typedef struct xfs_trans_header {
>  #define	XFS_LI_DQUOT		0x123d
>  #define	XFS_LI_QUOTAOFF		0x123e
>  #define	XFS_LI_ICREATE		0x123f
> +#define	XFS_LI_RUI		0x1240	/* rmap update intent */
> +#define	XFS_LI_RUD		0x1241
>  
>  #define XFS_LI_TYPE_DESC \
>  	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
> @@ -236,7 +240,9 @@ typedef struct xfs_trans_header {
>  	{ XFS_LI_BUF,		"XFS_LI_BUF" }, \
>  	{ XFS_LI_DQUOT,		"XFS_LI_DQUOT" }, \
>  	{ XFS_LI_QUOTAOFF,	"XFS_LI_QUOTAOFF" }, \
> -	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }
> +	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }, \
> +	{ XFS_LI_RUI,		"XFS_LI_RUI" }, \
> +	{ XFS_LI_RUD,		"XFS_LI_RUD" }
>  
>  /*
>   * Inode Log Item Format definitions.
> @@ -604,6 +610,63 @@ typedef struct xfs_efd_log_format_64 {
>  } xfs_efd_log_format_64_t;
>  
>  /*
> + * RUI/RUD (reverse mapping) log format definitions
> + */
> +struct xfs_map_extent {
> +	__uint64_t		me_owner;
> +	__uint64_t		me_startblock;
> +	__uint64_t		me_startoff;
> +	__uint32_t		me_len;
> +	__uint32_t		me_flags;
> +};
> +
> +/* rmap me_flags: upper bits are flags, lower byte is type code */
> +#define XFS_RMAP_EXTENT_MAP		1
> +#define XFS_RMAP_EXTENT_MAP_SHARED	2
> +#define XFS_RMAP_EXTENT_UNMAP		3
> +#define XFS_RMAP_EXTENT_UNMAP_SHARED	4
> +#define XFS_RMAP_EXTENT_CONVERT		5
> +#define XFS_RMAP_EXTENT_CONVERT_SHARED	6
> +#define XFS_RMAP_EXTENT_ALLOC		7
> +#define XFS_RMAP_EXTENT_FREE		8
> +#define XFS_RMAP_EXTENT_TYPE_MASK	0xFF

I assume all of the _SHARED stuff defined here and throughout is not
used until reflink.. (not that big of a deal if it's a PITA to remove).

> +
> +#define XFS_RMAP_EXTENT_ATTR_FORK	(1U << 31)
> +#define XFS_RMAP_EXTENT_BMBT_BLOCK	(1U << 30)
> +#define XFS_RMAP_EXTENT_UNWRITTEN	(1U << 29)
> +
> +#define XFS_RMAP_EXTENT_FLAGS		(XFS_RMAP_EXTENT_TYPE_MASK | \
> +					 XFS_RMAP_EXTENT_ATTR_FORK | \
> +					 XFS_RMAP_EXTENT_BMBT_BLOCK | \
> +					 XFS_RMAP_EXTENT_UNWRITTEN)
> +
> +/*
> + * This is the structure used to lay out an rui log item in the
> + * log.  The rui_extents field is a variable size array whose
> + * size is given by rui_nextents.
> + */
> +struct xfs_rui_log_format {
> +	__uint16_t		rui_type;	/* rui log item type */
> +	__uint16_t		rui_size;	/* size of this item */
> +	__uint32_t		rui_nextents;	/* # extents to free */
> +	__uint64_t		rui_id;		/* rui identifier */
> +	struct xfs_map_extent	rui_extents[1];	/* array of extents to rmap */
> +};
> +
> +/*
> + * This is the structure used to lay out an rud log item in the
> + * log.  The rud_extents array is a variable size array whose
> + * size is given by rud_nextents;
> + */
> +struct xfs_rud_log_format {
> +	__uint16_t		rud_type;	/* rud log item type */
> +	__uint16_t		rud_size;	/* size of this item */
> +	__uint32_t		rud_nextents;	/* # of extents freed */
> +	__uint64_t		rud_rui_id;	/* id of corresponding rui */
> +	struct xfs_map_extent	rud_extents[1];	/* array of extents rmapped */
> +};
> +
> +/*
>   * Dquot Log format definitions.
>   *
>   * The first two fields must be the type and size fitting into
...
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> new file mode 100644
> index 0000000..91a3b2c
> --- /dev/null
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -0,0 +1,459 @@
...
> +/*
> + * Copy an RUI format buffer from the given buf, and into the destination
> + * RUI format structure.  The RUI/RUD items were designed not to need any
> + * special alignment handling.
> + */
> +int
> +xfs_rui_copy_format(
> +	struct xfs_log_iovec		*buf,
> +	struct xfs_rui_log_format	*dst_rui_fmt)
> +{
> +	struct xfs_rui_log_format	*src_rui_fmt;
> +	uint				len;
> +
> +	src_rui_fmt = buf->i_addr;
> +	len = sizeof(struct xfs_rui_log_format) +
> +			(src_rui_fmt->rui_nextents - 1) *
> +			sizeof(struct xfs_map_extent);
> +
> +	if (buf->i_len == len) {
> +		memcpy((char *)dst_rui_fmt, (char *)src_rui_fmt, len);
> +		return 0;
> +	}
> +	return -EFSCORRUPTED;

I'd switch this around since we don't have the mess that
xfs_efi_copy_format() has to deal with. E.g.,

	if (buf->i_len != len)
		return -EFSCORRUPTED;

	memcpy(..);
	return 0;

Brian

> +}
> +
> +/*
> + * Freeing the RUI requires that we remove it from the AIL if it has already
> + * been placed there. However, the RUI may not yet have been placed in the AIL
> + * when called by xfs_rui_release() from RUD processing due to the ordering of
> + * committed vs unpin operations in bulk insert operations. Hence the reference
> + * count to ensure only the last caller frees the RUI.
> + */
> +void
> +xfs_rui_release(
> +	struct xfs_rui_log_item	*ruip)
> +{
> +	if (atomic_dec_and_test(&ruip->rui_refcount)) {
> +		xfs_trans_ail_remove(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
> +		xfs_rui_item_free(ruip);
> +	}
> +}
> +
> +static inline struct xfs_rud_log_item *RUD_ITEM(struct xfs_log_item *lip)
> +{
> +	return container_of(lip, struct xfs_rud_log_item, rud_item);
> +}
> +
> +STATIC void
> +xfs_rud_item_free(struct xfs_rud_log_item *rudp)
> +{
> +	if (rudp->rud_format.rud_nextents > XFS_RUD_MAX_FAST_EXTENTS)
> +		kmem_free(rudp);
> +	else
> +		kmem_zone_free(xfs_rud_zone, rudp);
> +}
> +
> +/*
> + * This returns the number of iovecs needed to log the given rud item.
> + * We only need 1 iovec for an rud item.  It just logs the rud_log_format
> + * structure.
> + */
> +static inline int
> +xfs_rud_item_sizeof(
> +	struct xfs_rud_log_item	*rudp)
> +{
> +	return sizeof(struct xfs_rud_log_format) +
> +			(rudp->rud_format.rud_nextents - 1) *
> +			sizeof(struct xfs_map_extent);
> +}
> +
> +STATIC void
> +xfs_rud_item_size(
> +	struct xfs_log_item	*lip,
> +	int			*nvecs,
> +	int			*nbytes)
> +{
> +	*nvecs += 1;
> +	*nbytes += xfs_rud_item_sizeof(RUD_ITEM(lip));
> +}
> +
> +/*
> + * This is called to fill in the vector of log iovecs for the
> + * given rud log item. We use only 1 iovec, and we point that
> + * at the rud_log_format structure embedded in the rud item.
> + * It is at this point that we assert that all of the extent
> + * slots in the rud item have been filled.
> + */
> +STATIC void
> +xfs_rud_item_format(
> +	struct xfs_log_item	*lip,
> +	struct xfs_log_vec	*lv)
> +{
> +	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
> +	struct xfs_log_iovec	*vecp = NULL;
> +
> +	ASSERT(rudp->rud_next_extent == rudp->rud_format.rud_nextents);
> +
> +	rudp->rud_format.rud_type = XFS_LI_RUD;
> +	rudp->rud_format.rud_size = 1;
> +
> +	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_RUD_FORMAT, &rudp->rud_format,
> +			xfs_rud_item_sizeof(rudp));
> +}
> +
> +/*
> + * Pinning has no meaning for an rud item, so just return.
> + */
> +STATIC void
> +xfs_rud_item_pin(
> +	struct xfs_log_item	*lip)
> +{
> +}
> +
> +/*
> + * Since pinning has no meaning for an rud item, unpinning does
> + * not either.
> + */
> +STATIC void
> +xfs_rud_item_unpin(
> +	struct xfs_log_item	*lip,
> +	int			remove)
> +{
> +}
> +
> +/*
> + * There isn't much you can do to push on an rud item.  It is simply stuck
> + * waiting for the log to be flushed to disk.
> + */
> +STATIC uint
> +xfs_rud_item_push(
> +	struct xfs_log_item	*lip,
> +	struct list_head	*buffer_list)
> +{
> +	return XFS_ITEM_PINNED;
> +}
> +
> +/*
> + * The RUD is either committed or aborted if the transaction is cancelled. If
> + * the transaction is cancelled, drop our reference to the RUI and free the
> + * RUD.
> + */
> +STATIC void
> +xfs_rud_item_unlock(
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
> +
> +	if (lip->li_flags & XFS_LI_ABORTED) {
> +		xfs_rui_release(rudp->rud_ruip);
> +		xfs_rud_item_free(rudp);
> +	}
> +}
> +
> +/*
> + * When the rud item is committed to disk, all we need to do is delete our
> + * reference to our partner rui item and then free ourselves. Since we're
> + * freeing ourselves we must return -1 to keep the transaction code from
> + * further referencing this item.
> + */
> +STATIC xfs_lsn_t
> +xfs_rud_item_committed(
> +	struct xfs_log_item	*lip,
> +	xfs_lsn_t		lsn)
> +{
> +	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
> +
> +	/*
> +	 * Drop the RUI reference regardless of whether the RUD has been
> +	 * aborted. Once the RUD transaction is constructed, it is the sole
> +	 * responsibility of the RUD to release the RUI (even if the RUI is
> +	 * aborted due to log I/O error).
> +	 */
> +	xfs_rui_release(rudp->rud_ruip);
> +	xfs_rud_item_free(rudp);
> +
> +	return (xfs_lsn_t)-1;
> +}
> +
> +/*
> + * The RUD dependency tracking op doesn't do squat.  It can't because
> + * it doesn't know where the free extent is coming from.  The dependency
> + * tracking has to be handled by the "enclosing" metadata object.  For
> + * example, for inodes, the inode is locked throughout the extent freeing
> + * so the dependency should be recorded there.
> + */
> +STATIC void
> +xfs_rud_item_committing(
> +	struct xfs_log_item	*lip,
> +	xfs_lsn_t		lsn)
> +{
> +}
> +
> +/*
> + * This is the ops vector shared by all rud log items.
> + */
> +static const struct xfs_item_ops xfs_rud_item_ops = {
> +	.iop_size	= xfs_rud_item_size,
> +	.iop_format	= xfs_rud_item_format,
> +	.iop_pin	= xfs_rud_item_pin,
> +	.iop_unpin	= xfs_rud_item_unpin,
> +	.iop_unlock	= xfs_rud_item_unlock,
> +	.iop_committed	= xfs_rud_item_committed,
> +	.iop_push	= xfs_rud_item_push,
> +	.iop_committing = xfs_rud_item_committing,
> +};
> +
> +/*
> + * Allocate and initialize an rud item with the given number of extents.
> + */
> +struct xfs_rud_log_item *
> +xfs_rud_init(
> +	struct xfs_mount		*mp,
> +	struct xfs_rui_log_item		*ruip,
> +	uint				nextents)
> +
> +{
> +	struct xfs_rud_log_item	*rudp;
> +	uint			size;
> +
> +	ASSERT(nextents > 0);
> +	if (nextents > XFS_RUD_MAX_FAST_EXTENTS) {
> +		size = (uint)(sizeof(struct xfs_rud_log_item) +
> +			((nextents - 1) * sizeof(struct xfs_map_extent)));
> +		rudp = kmem_zalloc(size, KM_SLEEP);
> +	} else {
> +		rudp = kmem_zone_zalloc(xfs_rud_zone, KM_SLEEP);
> +	}
> +
> +	xfs_log_item_init(mp, &rudp->rud_item, XFS_LI_RUD, &xfs_rud_item_ops);
> +	rudp->rud_ruip = ruip;
> +	rudp->rud_format.rud_nextents = nextents;
> +	rudp->rud_format.rud_rui_id = ruip->rui_format.rui_id;
> +
> +	return rudp;
> +}
> diff --git a/fs/xfs/xfs_rmap_item.h b/fs/xfs/xfs_rmap_item.h
> new file mode 100644
> index 0000000..bd36ab5
> --- /dev/null
> +++ b/fs/xfs/xfs_rmap_item.h
> @@ -0,0 +1,100 @@
> +/*
> + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#ifndef	__XFS_RMAP_ITEM_H__
> +#define	__XFS_RMAP_ITEM_H__
> +
> +/*
> + * There are (currently) three pairs of rmap btree redo item types: map, unmap,
> + * and convert.  The common abbreviations for these are RUI (rmap update
> + * intent) and RUD (rmap update done).  The redo item type is encoded in the
> + * flags field of each xfs_map_extent.
> + *
> + * *I items should be recorded in the *first* of a series of rolled
> + * transactions, and the *D items should be recorded in the same transaction
> + * that records the associated rmapbt updates.  Typically, the first
> + * transaction will record a bmbt update, followed by some number of
> + * transactions containing rmapbt updates, and finally transactions with any
> + * bnobt/cntbt updates.
> + *
> + * Should the system crash after the commit of the first transaction but
> + * before the commit of the final transaction in a series, log recovery will
> + * use the redo information recorded by the intent items to replay the
> + * (rmapbt/bnobt/cntbt) metadata updates in the non-first transaction.
> + */
> +
> +/* kernel only RUI/RUD definitions */
> +
> +struct xfs_mount;
> +struct kmem_zone;
> +
> +/*
> + * Max number of extents in fast allocation path.
> + */
> +#define	XFS_RUI_MAX_FAST_EXTENTS	16
> +
> +/*
> + * Define RUI flag bits. Manipulated by set/clear/test_bit operators.
> + */
> +#define	XFS_RUI_RECOVERED		1
> +
> +/*
> + * This is the "rmap update intent" log item.  It is used to log the fact that
> + * some reverse mappings need to change.  It is used in conjunction with the
> + * "rmap update done" log item described below.
> + *
> + * These log items follow the same rules as struct xfs_efi_log_item; see the
> + * comments about that structure (in xfs_extfree_item.h) for more details.
> + */
> +struct xfs_rui_log_item {
> +	struct xfs_log_item		rui_item;
> +	atomic_t			rui_refcount;
> +	atomic_t			rui_next_extent;
> +	unsigned long			rui_flags;	/* misc flags */
> +	struct xfs_rui_log_format	rui_format;
> +};
> +
> +/*
> + * This is the "rmap update done" log item.  It is used to log the fact that
> + * some rmapbt updates mentioned in an earlier rui item have been performed.
> + */
> +struct xfs_rud_log_item {
> +	struct xfs_log_item		rud_item;
> +	struct xfs_rui_log_item		*rud_ruip;
> +	uint				rud_next_extent;
> +	struct xfs_rud_log_format	rud_format;
> +};
> +
> +/*
> + * Max number of extents in fast allocation path.
> + */
> +#define	XFS_RUD_MAX_FAST_EXTENTS	16
> +
> +extern struct kmem_zone	*xfs_rui_zone;
> +extern struct kmem_zone	*xfs_rud_zone;
> +
> +struct xfs_rui_log_item *xfs_rui_init(struct xfs_mount *, uint);
> +struct xfs_rud_log_item *xfs_rud_init(struct xfs_mount *,
> +		struct xfs_rui_log_item *, uint);
> +int xfs_rui_copy_format(struct xfs_log_iovec *buf,
> +		struct xfs_rui_log_format *dst_rui_fmt);
> +void xfs_rui_item_free(struct xfs_rui_log_item *);
> +void xfs_rui_release(struct xfs_rui_log_item *);
> +
> +#endif	/* __XFS_RMAP_ITEM_H__ */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 1575849..a8300e4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -47,6 +47,7 @@
>  #include "xfs_sysfs.h"
>  #include "xfs_ondisk.h"
>  #include "xfs_defer.h"
> +#include "xfs_rmap_item.h"
>  
>  #include <linux/namei.h>
>  #include <linux/init.h>
> @@ -1762,8 +1763,26 @@ xfs_init_zones(void)
>  	if (!xfs_icreate_zone)
>  		goto out_destroy_ili_zone;
>  
> +	xfs_rud_zone = kmem_zone_init((sizeof(struct xfs_rud_log_item) +
> +			((XFS_RUD_MAX_FAST_EXTENTS - 1) *
> +				 sizeof(struct xfs_map_extent))),
> +			"xfs_rud_item");
> +	if (!xfs_rud_zone)
> +		goto out_destroy_icreate_zone;
> +
> +	xfs_rui_zone = kmem_zone_init((sizeof(struct xfs_rui_log_item) +
> +			((XFS_RUI_MAX_FAST_EXTENTS - 1) *
> +				sizeof(struct xfs_map_extent))),
> +			"xfs_rui_item");
> +	if (!xfs_rui_zone)
> +		goto out_destroy_rud_zone;
> +
>  	return 0;
>  
> + out_destroy_rud_zone:
> +	kmem_zone_destroy(xfs_rud_zone);
> + out_destroy_icreate_zone:
> +	kmem_zone_destroy(xfs_icreate_zone);
>   out_destroy_ili_zone:
>  	kmem_zone_destroy(xfs_ili_zone);
>   out_destroy_inode_zone:
> @@ -1802,6 +1821,8 @@ xfs_destroy_zones(void)
>  	 * destroy caches.
>  	 */
>  	rcu_barrier();
> +	kmem_zone_destroy(xfs_rui_zone);
> +	kmem_zone_destroy(xfs_rud_zone);
>  	kmem_zone_destroy(xfs_icreate_zone);
>  	kmem_zone_destroy(xfs_ili_zone);
>  	kmem_zone_destroy(xfs_inode_zone);
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 042/119] xfs: log rmap intent items
  2016-06-17  1:22 ` [PATCH 042/119] xfs: log rmap intent items Darrick J. Wong
@ 2016-07-15 18:33   ` Brian Foster
  2016-07-16  7:34     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-15 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:22:21PM -0700, Darrick J. Wong wrote:
> Provide a mechanism for higher levels to create RUI/RUD items, submit
> them to the log, and a stub function to deal with recovered RUI items.
> These parts will be connected to the rmapbt in a later patch.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

The commit log makes no mention of log recovery.. perhaps this should be
split in two?

>  fs/xfs/Makefile          |    1 
>  fs/xfs/xfs_log_recover.c |  344 +++++++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_trans.h       |   17 ++
>  fs/xfs/xfs_trans_rmap.c  |  235 +++++++++++++++++++++++++++++++
>  4 files changed, 589 insertions(+), 8 deletions(-)
>  create mode 100644 fs/xfs/xfs_trans_rmap.c
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 8ae0a10..1980110 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -110,6 +110,7 @@ xfs-y				+= xfs_log.o \
>  				   xfs_trans_buf.o \
>  				   xfs_trans_extfree.o \
>  				   xfs_trans_inode.o \
> +				   xfs_trans_rmap.o \
>  
>  # optional features
>  xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index b33187b..c9fe0c4 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -44,6 +44,7 @@
>  #include "xfs_bmap_btree.h"
>  #include "xfs_error.h"
>  #include "xfs_dir2.h"
> +#include "xfs_rmap_item.h"
>  
>  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
>  
> @@ -1912,6 +1913,8 @@ xlog_recover_reorder_trans(
>  		case XFS_LI_QUOTAOFF:
>  		case XFS_LI_EFD:
>  		case XFS_LI_EFI:
> +		case XFS_LI_RUI:
> +		case XFS_LI_RUD:
>  			trace_xfs_log_recover_item_reorder_tail(log,
>  							trans, item, pass);
>  			list_move_tail(&item->ri_list, &inode_list);
> @@ -3416,6 +3419,101 @@ xlog_recover_efd_pass2(
>  }
>  
>  /*
> + * This routine is called to create an in-core extent rmap update
> + * item from the rui format structure which was logged on disk.
> + * It allocates an in-core rui, copies the extents from the format
> + * structure into it, and adds the rui to the AIL with the given
> + * LSN.
> + */
> +STATIC int
> +xlog_recover_rui_pass2(
> +	struct xlog			*log,
> +	struct xlog_recover_item	*item,
> +	xfs_lsn_t			lsn)
> +{
> +	int				error;
> +	struct xfs_mount		*mp = log->l_mp;
> +	struct xfs_rui_log_item		*ruip;
> +	struct xfs_rui_log_format	*rui_formatp;
> +
> +	rui_formatp = item->ri_buf[0].i_addr;
> +
> +	ruip = xfs_rui_init(mp, rui_formatp->rui_nextents);
> +	error = xfs_rui_copy_format(&item->ri_buf[0], &ruip->rui_format);
> +	if (error) {
> +		xfs_rui_item_free(ruip);
> +		return error;
> +	}
> +	atomic_set(&ruip->rui_next_extent, rui_formatp->rui_nextents);
> +
> +	spin_lock(&log->l_ailp->xa_lock);
> +	/*
> +	 * The RUI has two references. One for the RUD and one for RUI to ensure
> +	 * it makes it into the AIL. Insert the RUI into the AIL directly and
> +	 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
> +	 * AIL lock.
> +	 */
> +	xfs_trans_ail_update(log->l_ailp, &ruip->rui_item, lsn);
> +	xfs_rui_release(ruip);
> +	return 0;
> +}
> +
> +
> +/*
> + * This routine is called when an RUD format structure is found in a committed
> + * transaction in the log. Its purpose is to cancel the corresponding RUI if it
> + * was still in the log. To do this it searches the AIL for the RUI with an id
> + * equal to that in the RUD format structure. If we find it we drop the RUD
> + * reference, which removes the RUI from the AIL and frees it.
> + */
> +STATIC int
> +xlog_recover_rud_pass2(
> +	struct xlog			*log,
> +	struct xlog_recover_item	*item)
> +{
> +	struct xfs_rud_log_format	*rud_formatp;
> +	struct xfs_rui_log_item		*ruip = NULL;
> +	struct xfs_log_item		*lip;
> +	__uint64_t			rui_id;
> +	struct xfs_ail_cursor		cur;
> +	struct xfs_ail			*ailp = log->l_ailp;
> +
> +	rud_formatp = item->ri_buf[0].i_addr;
> +	ASSERT(item->ri_buf[0].i_len == (sizeof(struct xfs_rud_log_format) +
> +			((rud_formatp->rud_nextents - 1) *
> +			sizeof(struct xfs_map_extent))));
> +	rui_id = rud_formatp->rud_rui_id;
> +
> +	/*
> +	 * Search for the RUI with the id in the RUD format structure in the
> +	 * AIL.
> +	 */
> +	spin_lock(&ailp->xa_lock);
> +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> +	while (lip != NULL) {
> +		if (lip->li_type == XFS_LI_RUI) {
> +			ruip = (struct xfs_rui_log_item *)lip;
> +			if (ruip->rui_format.rui_id == rui_id) {
> +				/*
> +				 * Drop the RUD reference to the RUI. This
> +				 * removes the RUI from the AIL and frees it.
> +				 */
> +				spin_unlock(&ailp->xa_lock);
> +				xfs_rui_release(ruip);
> +				spin_lock(&ailp->xa_lock);
> +				break;
> +			}
> +		}
> +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +	}
> +
> +	xfs_trans_ail_cursor_done(&cur);
> +	spin_unlock(&ailp->xa_lock);
> +
> +	return 0;
> +}
> +
> +/*
>   * This routine is called when an inode create format structure is found in a
>   * committed transaction in the log.  It's purpose is to initialise the inodes
>   * being allocated on disk. This requires us to get inode cluster buffers that
> @@ -3640,6 +3738,8 @@ xlog_recover_ra_pass2(
>  	case XFS_LI_EFI:
>  	case XFS_LI_EFD:
>  	case XFS_LI_QUOTAOFF:
> +	case XFS_LI_RUI:
> +	case XFS_LI_RUD:
>  	default:
>  		break;
>  	}
> @@ -3663,6 +3763,8 @@ xlog_recover_commit_pass1(
>  	case XFS_LI_EFD:
>  	case XFS_LI_DQUOT:
>  	case XFS_LI_ICREATE:
> +	case XFS_LI_RUI:
> +	case XFS_LI_RUD:
>  		/* nothing to do in pass 1 */
>  		return 0;
>  	default:
> @@ -3693,6 +3795,10 @@ xlog_recover_commit_pass2(
>  		return xlog_recover_efi_pass2(log, item, trans->r_lsn);
>  	case XFS_LI_EFD:
>  		return xlog_recover_efd_pass2(log, item);
> +	case XFS_LI_RUI:
> +		return xlog_recover_rui_pass2(log, item, trans->r_lsn);
> +	case XFS_LI_RUD:
> +		return xlog_recover_rud_pass2(log, item);
>  	case XFS_LI_DQUOT:
>  		return xlog_recover_dquot_pass2(log, buffer_list, item,
>  						trans->r_lsn);
> @@ -4165,6 +4271,18 @@ xlog_recover_process_data(
>  	return 0;
>  }
>  
> +/* Is this log item a deferred action intent? */
> +static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
> +{
> +	switch (lip->li_type) {
> +	case XFS_LI_EFI:
> +	case XFS_LI_RUI:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>  /*
>   * Process an extent free intent item that was recovered from
>   * the log.  We need to free the extents that it describes.
> @@ -4265,17 +4383,23 @@ xlog_recover_process_efis(
>  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
>  	while (lip != NULL) {
>  		/*
> -		 * We're done when we see something other than an EFI.
> -		 * There should be no EFIs left in the AIL now.
> +		 * We're done when we see something other than an intent.
> +		 * There should be no intents left in the AIL now.
>  		 */
> -		if (lip->li_type != XFS_LI_EFI) {
> +		if (!xlog_item_is_intent(lip)) {
>  #ifdef DEBUG
>  			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> -				ASSERT(lip->li_type != XFS_LI_EFI);
> +				ASSERT(!xlog_item_is_intent(lip));
>  #endif
>  			break;
>  		}
>  
> +		/* Skip anything that isn't an EFI */
> +		if (lip->li_type != XFS_LI_EFI) {
> +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +			continue;
> +		}
> +

Hmm, so previously this function used the existence of any non-EFI item
as an end of traversal marker, since the freeing operations add more
items to the AIL. It's not immediately clear to me whether this is just
an efficiency thing or a potential problem, but I wonder if we should
grab the last item and use that or its lsn as an end of list marker.

At the very least we need to update the comment at the top of the
function wrt to the current behavior.

>  		/*
>  		 * Skip EFIs that we've already processed.
>  		 */
> @@ -4320,14 +4444,20 @@ xlog_recover_cancel_efis(
>  		 * We're done when we see something other than an EFI.
>  		 * There should be no EFIs left in the AIL now.
>  		 */

Need to update this comment as for process_efis()...

> -		if (lip->li_type != XFS_LI_EFI) {
> +		if (!xlog_item_is_intent(lip)) {
>  #ifdef DEBUG
>  			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> -				ASSERT(lip->li_type != XFS_LI_EFI);
> +				ASSERT(!xlog_item_is_intent(lip));
>  #endif
>  			break;
>  		}
>  
> +		/* Skip anything that isn't an EFI */
> +		if (lip->li_type != XFS_LI_EFI) {
> +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +			continue;
> +		}
> +
>  		efip = container_of(lip, struct xfs_efi_log_item, efi_item);
>  
>  		spin_unlock(&ailp->xa_lock);
> @@ -4343,6 +4473,190 @@ xlog_recover_cancel_efis(
>  }
>  
>  /*
> + * Process an rmap update intent item that was recovered from the log.
> + * We need to update the rmapbt.
> + */
> +STATIC int
> +xlog_recover_process_rui(
> +	struct xfs_mount		*mp,
> +	struct xfs_rui_log_item		*ruip)
> +{
> +	int				i;
> +	int				error = 0;
> +	struct xfs_map_extent		*rmap;
> +	xfs_fsblock_t			startblock_fsb;
> +	bool				op_ok;
> +
> +	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
> +
> +	/*
> +	 * First check the validity of the extents described by the
> +	 * RUI.  If any are bad, then assume that all are bad and
> +	 * just toss the RUI.
> +	 */
> +	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
> +		rmap = &(ruip->rui_format.rui_extents[i]);
> +		startblock_fsb = XFS_BB_TO_FSB(mp,
> +				   XFS_FSB_TO_DADDR(mp, rmap->me_startblock));
> +		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
> +		case XFS_RMAP_EXTENT_MAP:
> +		case XFS_RMAP_EXTENT_MAP_SHARED:
> +		case XFS_RMAP_EXTENT_UNMAP:
> +		case XFS_RMAP_EXTENT_UNMAP_SHARED:
> +		case XFS_RMAP_EXTENT_CONVERT:
> +		case XFS_RMAP_EXTENT_CONVERT_SHARED:
> +		case XFS_RMAP_EXTENT_ALLOC:
> +		case XFS_RMAP_EXTENT_FREE:
> +			op_ok = true;
> +			break;
> +		default:
> +			op_ok = false;
> +			break;
> +		}
> +		if (!op_ok || (startblock_fsb == 0) ||
> +		    (rmap->me_len == 0) ||
> +		    (startblock_fsb >= mp->m_sb.sb_dblocks) ||
> +		    (rmap->me_len >= mp->m_sb.sb_agblocks) ||
> +		    (rmap->me_flags & ~XFS_RMAP_EXTENT_FLAGS)) {
> +			/*
> +			 * This will pull the RUI from the AIL and
> +			 * free the memory associated with it.
> +			 */
> +			set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> +			xfs_rui_release(ruip);
> +			return -EIO;
> +		}
> +	}
> +
> +	/* XXX: do nothing for now */
> +	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> +	xfs_rui_release(ruip);
> +	return error;
> +}
> +
> +/*
> + * When this is called, all of the RUIs which did not have
> + * corresponding RUDs should be in the AIL.  What we do now
> + * is update the rmaps associated with each one.
> + *
> + * Since we process the RUIs in normal transactions, they
> + * will be removed at some point after the commit.  This prevents
> + * us from just walking down the list processing each one.
> + * We'll use a flag in the RUI to skip those that we've already
> + * processed and use the AIL iteration mechanism's generation
> + * count to try to speed this up at least a bit.
> + *
> + * When we start, we know that the RUIs are the only things in
> + * the AIL.  As we process them, however, other items are added
> + * to the AIL.  Since everything added to the AIL must come after
> + * everything already in the AIL, we stop processing as soon as
> + * we see something other than an RUI in the AIL.
> + */
> +STATIC int
> +xlog_recover_process_ruis(
> +	struct xlog		*log)
> +{
> +	struct xfs_log_item	*lip;
> +	struct xfs_rui_log_item	*ruip;
> +	int			error = 0;
> +	struct xfs_ail_cursor	cur;
> +	struct xfs_ail		*ailp;
> +
> +	ailp = log->l_ailp;
> +	spin_lock(&ailp->xa_lock);
> +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> +	while (lip != NULL) {
> +		/*
> +		 * We're done when we see something other than an intent.
> +		 * There should be no intents left in the AIL now.
> +		 */
> +		if (!xlog_item_is_intent(lip)) {
> +#ifdef DEBUG
> +			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> +				ASSERT(!xlog_item_is_intent(lip));
> +#endif
> +			break;
> +		}
> +
> +		/* Skip anything that isn't an RUI */
> +		if (lip->li_type != XFS_LI_RUI) {
> +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +			continue;
> +		}
> +
> +		/*
> +		 * Skip RUIs that we've already processed.
> +		 */
> +		ruip = container_of(lip, struct xfs_rui_log_item, rui_item);
> +		if (test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags)) {
> +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +			continue;
> +		}
> +
> +		spin_unlock(&ailp->xa_lock);
> +		error = xlog_recover_process_rui(log->l_mp, ruip);
> +		spin_lock(&ailp->xa_lock);
> +		if (error)
> +			goto out;
> +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +	}
> +out:
> +	xfs_trans_ail_cursor_done(&cur);
> +	spin_unlock(&ailp->xa_lock);
> +	return error;
> +}
> +
> +/*
> + * A cancel occurs when the mount has failed and we're bailing out. Release all
> + * pending RUIs so they don't pin the AIL.
> + */
> +STATIC int
> +xlog_recover_cancel_ruis(
> +	struct xlog		*log)
> +{
> +	struct xfs_log_item	*lip;
> +	struct xfs_rui_log_item	*ruip;
> +	int			error = 0;
> +	struct xfs_ail_cursor	cur;
> +	struct xfs_ail		*ailp;
> +
> +	ailp = log->l_ailp;
> +	spin_lock(&ailp->xa_lock);
> +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> +	while (lip != NULL) {
> +		/*
> +		 * We're done when we see something other than an RUI.
> +		 * There should be no RUIs left in the AIL now.
> +		 */
> +		if (!xlog_item_is_intent(lip)) {
> +#ifdef DEBUG
> +			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> +				ASSERT(!xlog_item_is_intent(lip));
> +#endif
> +			break;
> +		}
> +
> +		/* Skip anything that isn't an RUI */
> +		if (lip->li_type != XFS_LI_RUI) {
> +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +			continue;
> +		}
> +
> +		ruip = container_of(lip, struct xfs_rui_log_item, rui_item);
> +
> +		spin_unlock(&ailp->xa_lock);
> +		xfs_rui_release(ruip);
> +		spin_lock(&ailp->xa_lock);
> +
> +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +	}
> +
> +	xfs_trans_ail_cursor_done(&cur);
> +	spin_unlock(&ailp->xa_lock);
> +	return error;
> +}

How about we combine this and cancel_efis() into a cancel_intents()
function so we only have to make one pass? It looks like the only
difference is the item-specific release call.

> +
> +/*
>   * This routine performs a transaction to null out a bad inode pointer
>   * in an agi unlinked inode hash bucket.
>   */
> @@ -5144,11 +5458,19 @@ xlog_recover_finish(
>  	 */
>  	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
>  		int	error;
> +
> +		error = xlog_recover_process_ruis(log);
> +		if (error) {
> +			xfs_alert(log->l_mp, "Failed to recover RUIs");
> +			return error;
> +		}
> +
>  		error = xlog_recover_process_efis(log);
>  		if (error) {
>  			xfs_alert(log->l_mp, "Failed to recover EFIs");
>  			return error;
>  		}
> +

Is the order important here in any way (e.g., RUIs before EFIs)? If so,
it might be a good idea to call it out.

>  		/*
>  		 * Sync the log to get all the EFIs out of the AIL.
>  		 * This isn't absolutely necessary, but it helps in
> @@ -5176,9 +5498,15 @@ xlog_recover_cancel(
>  	struct xlog	*log)
>  {
>  	int		error = 0;
> +	int		err2;
>  
> -	if (log->l_flags & XLOG_RECOVERY_NEEDED)
> -		error = xlog_recover_cancel_efis(log);
> +	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
> +		error = xlog_recover_cancel_ruis(log);
> +
> +		err2 = xlog_recover_cancel_efis(log);
> +		if (err2 && !error)
> +			error = err2;
> +	}
>  
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index f8d363f..c48be63 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -235,4 +235,21 @@ void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
>  extern kmem_zone_t	*xfs_trans_zone;
>  extern kmem_zone_t	*xfs_log_item_desc_zone;
>  
> +enum xfs_rmap_intent_type;
> +
> +struct xfs_rui_log_item *xfs_trans_get_rui(struct xfs_trans *tp, uint nextents);
> +void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> +		struct xfs_rui_log_item *ruip, enum xfs_rmap_intent_type type,
> +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> +		xfs_exntst_t state);
> +
> +struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> +		struct xfs_rui_log_item *ruip, uint nextents);
> +int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> +		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> +		xfs_exntst_t state);
> +
>  #endif	/* __XFS_TRANS_H__ */
> diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> new file mode 100644
> index 0000000..b55a725
> --- /dev/null
> +++ b/fs/xfs/xfs_trans_rmap.c
> @@ -0,0 +1,235 @@
> +/*
> + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> +#include "xfs_rmap_item.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap_btree.h"
> +
> +/*
> + * This routine is called to allocate an "rmap update intent"
> + * log item that will hold nextents worth of extents.  The
> + * caller must use all nextents extents, because we are not
> + * flexible about this at all.
> + */
> +struct xfs_rui_log_item *
> +xfs_trans_get_rui(
> +	struct xfs_trans		*tp,
> +	uint				nextents)
> +{
> +	struct xfs_rui_log_item		*ruip;
> +
> +	ASSERT(tp != NULL);
> +	ASSERT(nextents > 0);
> +
> +	ruip = xfs_rui_init(tp->t_mountp, nextents);
> +	ASSERT(ruip != NULL);
> +
> +	/*
> +	 * Get a log_item_desc to point at the new item.
> +	 */
> +	xfs_trans_add_item(tp, &ruip->rui_item);
> +	return ruip;
> +}
> +
> +/*
> + * This routine is called to indicate that the described
> + * extent is to be logged as needing to be freed.  It should
> + * be called once for each extent to be freed.
> + */

Stale comment.

> +void
> +xfs_trans_log_start_rmap_update(
> +	struct xfs_trans		*tp,
> +	struct xfs_rui_log_item		*ruip,
> +	enum xfs_rmap_intent_type	type,
> +	__uint64_t			owner,
> +	int				whichfork,
> +	xfs_fileoff_t			startoff,
> +	xfs_fsblock_t			startblock,
> +	xfs_filblks_t			blockcount,
> +	xfs_exntst_t			state)
> +{
> +	uint				next_extent;
> +	struct xfs_map_extent		*rmap;
> +
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	ruip->rui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> +
> +	/*
> +	 * atomic_inc_return gives us the value after the increment;
> +	 * we want to use it as an array index so we need to subtract 1 from
> +	 * it.
> +	 */
> +	next_extent = atomic_inc_return(&ruip->rui_next_extent) - 1;
> +	ASSERT(next_extent < ruip->rui_format.rui_nextents);
> +	rmap = &(ruip->rui_format.rui_extents[next_extent]);
> +	rmap->me_owner = owner;
> +	rmap->me_startblock = startblock;
> +	rmap->me_startoff = startoff;
> +	rmap->me_len = blockcount;
> +	rmap->me_flags = 0;
> +	if (state == XFS_EXT_UNWRITTEN)
> +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> +	if (whichfork == XFS_ATTR_FORK)
> +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> +	switch (type) {
> +	case XFS_RMAP_MAP:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> +		break;
> +	case XFS_RMAP_MAP_SHARED:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> +		break;
> +	case XFS_RMAP_UNMAP:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> +		break;
> +	case XFS_RMAP_UNMAP_SHARED:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> +		break;
> +	case XFS_RMAP_CONVERT:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> +		break;
> +	case XFS_RMAP_CONVERT_SHARED:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> +		break;
> +	case XFS_RMAP_ALLOC:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> +		break;
> +	case XFS_RMAP_FREE:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> +		break;
> +	default:
> +		ASSERT(0);
> +	}

Between here and the finish function, it looks like we could use a
helper to convert the state and whatnot to extent flags.

> +}
> +
> +
> +/*
> + * This routine is called to allocate an "extent free done"
> + * log item that will hold nextents worth of extents.  The
> + * caller must use all nextents extents, because we are not
> + * flexible about this at all.
> + */

Comment needs updating.

Brian

> +struct xfs_rud_log_item *
> +xfs_trans_get_rud(
> +	struct xfs_trans		*tp,
> +	struct xfs_rui_log_item		*ruip,
> +	uint				nextents)
> +{
> +	struct xfs_rud_log_item		*rudp;
> +
> +	ASSERT(tp != NULL);
> +	ASSERT(nextents > 0);
> +
> +	rudp = xfs_rud_init(tp->t_mountp, ruip, nextents);
> +	ASSERT(rudp != NULL);
> +
> +	/*
> +	 * Get a log_item_desc to point at the new item.
> +	 */
> +	xfs_trans_add_item(tp, &rudp->rud_item);
> +	return rudp;
> +}
> +
> +/*
> + * Finish an rmap update and log it to the RUD. Note that the transaction is
> + * marked dirty regardless of whether the rmap update succeeds or fails to
> + * support the RUI/RUD lifecycle rules.
> + */
> +int
> +xfs_trans_log_finish_rmap_update(
> +	struct xfs_trans		*tp,
> +	struct xfs_rud_log_item		*rudp,
> +	enum xfs_rmap_intent_type	type,
> +	__uint64_t			owner,
> +	int				whichfork,
> +	xfs_fileoff_t			startoff,
> +	xfs_fsblock_t			startblock,
> +	xfs_filblks_t			blockcount,
> +	xfs_exntst_t			state)
> +{
> +	uint				next_extent;
> +	struct xfs_map_extent		*rmap;
> +	int				error;
> +
> +	/* XXX: actually finish the rmap update here */
> +	error = -EFSCORRUPTED;
> +
> +	/*
> +	 * Mark the transaction dirty, even on error. This ensures the
> +	 * transaction is aborted, which:
> +	 *
> +	 * 1.) releases the RUI and frees the RUD
> +	 * 2.) shuts down the filesystem
> +	 */
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	rudp->rud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> +
> +	next_extent = rudp->rud_next_extent;
> +	ASSERT(next_extent < rudp->rud_format.rud_nextents);
> +	rmap = &(rudp->rud_format.rud_extents[next_extent]);
> +	rmap->me_owner = owner;
> +	rmap->me_startblock = startblock;
> +	rmap->me_startoff = startoff;
> +	rmap->me_len = blockcount;
> +	rmap->me_flags = 0;
> +	if (state == XFS_EXT_UNWRITTEN)
> +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> +	if (whichfork == XFS_ATTR_FORK)
> +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> +	switch (type) {
> +	case XFS_RMAP_MAP:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> +		break;
> +	case XFS_RMAP_MAP_SHARED:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> +		break;
> +	case XFS_RMAP_UNMAP:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> +		break;
> +	case XFS_RMAP_UNMAP_SHARED:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> +		break;
> +	case XFS_RMAP_CONVERT:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> +		break;
> +	case XFS_RMAP_CONVERT_SHARED:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> +		break;
> +	case XFS_RMAP_ALLOC:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> +		break;
> +	case XFS_RMAP_FREE:
> +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> +		break;
> +	default:
> +		ASSERT(0);
> +	}
> +	rudp->rud_next_extent++;
> +
> +	return error;
> +}
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 043/119] xfs: enable the xfs_defer mechanism to process rmaps to update
  2016-06-17  1:22 ` [PATCH 043/119] xfs: enable the xfs_defer mechanism to process rmaps to update Darrick J. Wong
@ 2016-07-15 18:33   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-15 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:22:27PM -0700, Darrick J. Wong wrote:
> Connect the xfs_defer mechanism with the pieces that we'll need to
> handle deferred rmap updates.  We'll wire up the existing code to
> our new deferred mechanism later.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Looks sane, same general comment here as for the bmap stuff way back
when it was hooked up to the defer bits (i.e., tightening up the
interface a bit and pushing stuff down into xfs_trans_extfree.c iirc).

Brian

>  fs/xfs/libxfs/xfs_defer.h |    1 
>  fs/xfs/xfs_defer_item.c   |  124 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 125 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> index 743fc32..920642e62 100644
> --- a/fs/xfs/libxfs/xfs_defer.h
> +++ b/fs/xfs/libxfs/xfs_defer.h
> @@ -51,6 +51,7 @@ struct xfs_defer_pending {
>   * find all the space it needs.
>   */
>  enum xfs_defer_ops_type {
> +	XFS_DEFER_OPS_TYPE_RMAP,
>  	XFS_DEFER_OPS_TYPE_FREE,
>  	XFS_DEFER_OPS_TYPE_MAX,
>  };
> diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> index 1c2d556..dbd10fc 100644
> --- a/fs/xfs/xfs_defer_item.c
> +++ b/fs/xfs/xfs_defer_item.c
> @@ -31,6 +31,8 @@
>  #include "xfs_trace.h"
>  #include "xfs_bmap.h"
>  #include "xfs_extfree_item.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rmap_item.h"
>  
>  /* Extent Freeing */
>  
> @@ -136,11 +138,133 @@ const struct xfs_defer_op_type xfs_extent_free_defer_type = {
>  	.cancel_item	= xfs_bmap_free_cancel_item,
>  };
>  
> +/* Reverse Mapping */
> +
> +/* Sort rmap intents by AG. */
> +static int
> +xfs_rmap_update_diff_items(
> +	void				*priv,
> +	struct list_head		*a,
> +	struct list_head		*b)
> +{
> +	struct xfs_mount		*mp = priv;
> +	struct xfs_rmap_intent		*ra;
> +	struct xfs_rmap_intent		*rb;
> +
> +	ra = container_of(a, struct xfs_rmap_intent, ri_list);
> +	rb = container_of(b, struct xfs_rmap_intent, ri_list);
> +	return  XFS_FSB_TO_AGNO(mp, ra->ri_bmap.br_startblock) -
> +		XFS_FSB_TO_AGNO(mp, rb->ri_bmap.br_startblock);
> +}
> +
> +/* Get an RUI. */
> +STATIC void *
> +xfs_rmap_update_create_intent(
> +	struct xfs_trans		*tp,
> +	unsigned int			count)
> +{
> +	return xfs_trans_get_rui(tp, count);
> +}
> +
> +/* Log rmap updates in the intent item. */
> +STATIC void
> +xfs_rmap_update_log_item(
> +	struct xfs_trans		*tp,
> +	void				*intent,
> +	struct list_head		*item)
> +{
> +	struct xfs_rmap_intent		*rmap;
> +
> +	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
> +	xfs_trans_log_start_rmap_update(tp, intent, rmap->ri_type,
> +			rmap->ri_owner, rmap->ri_whichfork,
> +			rmap->ri_bmap.br_startoff,
> +			rmap->ri_bmap.br_startblock,
> +			rmap->ri_bmap.br_blockcount,
> +			rmap->ri_bmap.br_state);
> +}
> +
> +/* Get an RUD so we can process all the deferred rmap updates. */
> +STATIC void *
> +xfs_rmap_update_create_done(
> +	struct xfs_trans		*tp,
> +	void				*intent,
> +	unsigned int			count)
> +{
> +	return xfs_trans_get_rud(tp, intent, count);
> +}
> +
> +/* Process a deferred rmap update. */
> +STATIC int
> +xfs_rmap_update_finish_item(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_ops		*dop,
> +	struct list_head		*item,
> +	void				*done_item,
> +	void				**state)
> +{
> +	struct xfs_rmap_intent		*rmap;
> +	int				error;
> +
> +	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
> +	error = xfs_trans_log_finish_rmap_update(tp, done_item,
> +			rmap->ri_type,
> +			rmap->ri_owner, rmap->ri_whichfork,
> +			rmap->ri_bmap.br_startoff,
> +			rmap->ri_bmap.br_startblock,
> +			rmap->ri_bmap.br_blockcount,
> +			rmap->ri_bmap.br_state);
> +	kmem_free(rmap);
> +	return error;
> +}
> +
> +/* Clean up after processing deferred rmaps. */
> +STATIC void
> +xfs_rmap_update_finish_cleanup(
> +	struct xfs_trans	*tp,
> +	void			*state,
> +	int			error)
> +{
> +}
> +
> +/* Abort all pending RUIs. */
> +STATIC void
> +xfs_rmap_update_abort_intent(
> +	void				*intent)
> +{
> +	xfs_rui_release(intent);
> +}
> +
> +/* Cancel a deferred rmap update. */
> +STATIC void
> +xfs_rmap_update_cancel_item(
> +	struct list_head		*item)
> +{
> +	struct xfs_rmap_intent		*rmap;
> +
> +	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
> +	kmem_free(rmap);
> +}
> +
> +const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
> +	.type		= XFS_DEFER_OPS_TYPE_RMAP,
> +	.max_items	= XFS_RUI_MAX_FAST_EXTENTS,
> +	.diff_items	= xfs_rmap_update_diff_items,
> +	.create_intent	= xfs_rmap_update_create_intent,
> +	.abort_intent	= xfs_rmap_update_abort_intent,
> +	.log_item	= xfs_rmap_update_log_item,
> +	.create_done	= xfs_rmap_update_create_done,
> +	.finish_item	= xfs_rmap_update_finish_item,
> +	.finish_cleanup = xfs_rmap_update_finish_cleanup,
> +	.cancel_item	= xfs_rmap_update_cancel_item,
> +};
> +
>  /* Deferred Item Initialization */
>  
>  /* Initialize the deferred operation types. */
>  void
>  xfs_defer_init_types(void)
>  {
> +	xfs_defer_init_op_type(&xfs_rmap_update_defer_type);
>  	xfs_defer_init_op_type(&xfs_extent_free_defer_type);
>  }
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-06-17  1:22 ` [PATCH 044/119] xfs: propagate bmap updates to rmapbt Darrick J. Wong
@ 2016-07-15 18:33   ` Brian Foster
  2016-07-16  7:26     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-15 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:22:34PM -0700, Darrick J. Wong wrote:
> When we map, unmap, or convert an extent in a file's data or attr
> fork, schedule a respective update in the rmapbt.  Previous versions
> of this patch required a 1:1 correspondence between bmap and rmap,
> but this is no longer true.
> 
> v2: Remove the 1:1 correspondence requirement now that we have the
> ability to make interval queries against the rmapbt.  Update the
> commit message to reflect the broad restructuring of this patch.
> Fix the bmap shift code to adjust the rmaps correctly.
> 
> v3: Use the deferred operations code to handle redo operations
> atomically and deadlock free.  Plumb in all five rmap actions
> (map, unmap, convert extent, alloc, free); we'll use the first
> three now for file data, and reflink will want the last two.
> Add an error injection site to test log recovery.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c       |   56 ++++++++-
>  fs/xfs/libxfs/xfs_rmap.c       |  252 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h |   24 ++++
>  fs/xfs/xfs_bmap_util.c         |    1 
>  fs/xfs/xfs_defer_item.c        |    6 +
>  fs/xfs/xfs_error.h             |    4 -
>  fs/xfs/xfs_log_recover.c       |   56 +++++++++
>  fs/xfs/xfs_trans.h             |    3 
>  fs/xfs/xfs_trans_rmap.c        |    7 +
>  9 files changed, 393 insertions(+), 16 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 61c0231..507fd74 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -46,6 +46,7 @@
>  #include "xfs_symlink.h"
>  #include "xfs_attr_leaf.h"
>  #include "xfs_filestream.h"
> +#include "xfs_rmap_btree.h"
>  
>  
>  kmem_zone_t		*xfs_bmap_free_item_zone;
> @@ -2178,6 +2179,11 @@ xfs_bmap_add_extent_delay_real(
>  		ASSERT(0);
>  	}
>  
> +	/* add reverse mapping */
> +	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
> +	if (error)
> +		goto done;
> +
>  	/* convert to a btree if necessary */
>  	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
>  		int	tmp_logflags;	/* partial log flag return val */
> @@ -2714,6 +2720,11 @@ xfs_bmap_add_extent_unwritten_real(
>  		ASSERT(0);
>  	}
>  
> +	/* update reverse mappings */
> +	error = xfs_rmap_convert_extent(mp, dfops, ip, XFS_DATA_FORK, new);
> +	if (error)
> +		goto done;
> +
>  	/* convert to a btree if necessary */
>  	if (xfs_bmap_needs_btree(ip, XFS_DATA_FORK)) {
>  		int	tmp_logflags;	/* partial log flag return val */
> @@ -3106,6 +3117,11 @@ xfs_bmap_add_extent_hole_real(
>  		break;
>  	}
>  
> +	/* add reverse mapping */
> +	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
> +	if (error)
> +		goto done;
> +
>  	/* convert to a btree if necessary */
>  	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
>  		int	tmp_logflags;	/* partial log flag return val */
> @@ -5032,6 +5048,14 @@ xfs_bmap_del_extent(
>  		++*idx;
>  		break;
>  	}
> +
> +	/* remove reverse mapping */
> +	if (!delay) {
> +		error = xfs_rmap_unmap_extent(mp, dfops, ip, whichfork, del);
> +		if (error)
> +			goto done;
> +	}
> +
>  	/*
>  	 * If we need to, add to list of extents to delete.
>  	 */
> @@ -5569,7 +5593,8 @@ xfs_bmse_shift_one(
>  	struct xfs_bmbt_rec_host	*gotp,
>  	struct xfs_btree_cur		*cur,
>  	int				*logflags,
> -	enum shift_direction		direction)
> +	enum shift_direction		direction,
> +	struct xfs_defer_ops		*dfops)
>  {
>  	struct xfs_ifork		*ifp;
>  	struct xfs_mount		*mp;
> @@ -5617,9 +5642,13 @@ xfs_bmse_shift_one(
>  		/* check whether to merge the extent or shift it down */
>  		if (xfs_bmse_can_merge(&adj_irec, &got,
>  				       offset_shift_fsb)) {
> -			return xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
> -					      *current_ext, gotp, adj_irecp,
> -					      cur, logflags);
> +			error = xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
> +					       *current_ext, gotp, adj_irecp,
> +					       cur, logflags);
> +			if (error)
> +				return error;
> +			adj_irec = got;
> +			goto update_rmap;
>  		}
>  	} else {
>  		startoff = got.br_startoff + offset_shift_fsb;
> @@ -5656,9 +5685,10 @@ update_current_ext:
>  		(*current_ext)--;
>  	xfs_bmbt_set_startoff(gotp, startoff);
>  	*logflags |= XFS_ILOG_CORE;
> +	adj_irec = got;
>  	if (!cur) {
>  		*logflags |= XFS_ILOG_DEXT;
> -		return 0;
> +		goto update_rmap;
>  	}
>  
>  	error = xfs_bmbt_lookup_eq(cur, got.br_startoff, got.br_startblock,
> @@ -5668,8 +5698,18 @@ update_current_ext:
>  	XFS_WANT_CORRUPTED_RETURN(mp, i == 1);
>  
>  	got.br_startoff = startoff;
> -	return xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
> -			       got.br_blockcount, got.br_state);
> +	error = xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
> +			got.br_blockcount, got.br_state);
> +	if (error)
> +		return error;
> +
> +update_rmap:
> +	/* update reverse mapping */
> +	error = xfs_rmap_unmap_extent(mp, dfops, ip, whichfork, &adj_irec);
> +	if (error)
> +		return error;
> +	adj_irec.br_startoff = startoff;
> +	return xfs_rmap_map_extent(mp, dfops, ip, whichfork, &adj_irec);
>  }
>  
>  /*
> @@ -5797,7 +5837,7 @@ xfs_bmap_shift_extents(
>  	while (nexts++ < num_exts) {
>  		error = xfs_bmse_shift_one(ip, whichfork, offset_shift_fsb,
>  					   &current_ext, gotp, cur, &logflags,
> -					   direction);
> +					   direction, dfops);
>  		if (error)
>  			goto del_cursor;
>  		/*
> diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> index 76fc5c2..f179ea4 100644
> --- a/fs/xfs/libxfs/xfs_rmap.c
> +++ b/fs/xfs/libxfs/xfs_rmap.c
> @@ -36,6 +36,8 @@
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
>  #include "xfs_extent_busy.h"
> +#include "xfs_bmap.h"
> +#include "xfs_inode.h"
>  
>  /*
>   * Lookup the first record less than or equal to [bno, len, owner, offset]
> @@ -1212,3 +1214,253 @@ xfs_rmapbt_query_range(
>  	return xfs_btree_query_range(cur, &low_brec, &high_brec,
>  			xfs_rmapbt_query_range_helper, &query);
>  }
> +
> +/* Clean up after calling xfs_rmap_finish_one. */
> +void
> +xfs_rmap_finish_one_cleanup(
> +	struct xfs_trans	*tp,
> +	struct xfs_btree_cur	*rcur,
> +	int			error)
> +{
> +	struct xfs_buf		*agbp;
> +
> +	if (rcur == NULL)
> +		return;
> +	agbp = rcur->bc_private.a.agbp;
> +	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	xfs_trans_brelse(tp, agbp);

Why unconditionally release the agbp (and not just on error)?

> +}
> +
> +/*
> + * Process one of the deferred rmap operations.  We pass back the
> + * btree cursor to maintain our lock on the rmapbt between calls.
> + * This saves time and eliminates a buffer deadlock between the
> + * superblock and the AGF because we'll always grab them in the same
> + * order.
> + */
> +int
> +xfs_rmap_finish_one(
> +	struct xfs_trans		*tp,
> +	enum xfs_rmap_intent_type	type,
> +	__uint64_t			owner,
> +	int				whichfork,
> +	xfs_fileoff_t			startoff,
> +	xfs_fsblock_t			startblock,
> +	xfs_filblks_t			blockcount,
> +	xfs_exntst_t			state,
> +	struct xfs_btree_cur		**pcur)
> +{
> +	struct xfs_mount		*mp = tp->t_mountp;
> +	struct xfs_btree_cur		*rcur;
> +	struct xfs_buf			*agbp = NULL;
> +	int				error = 0;
> +	xfs_agnumber_t			agno;
> +	struct xfs_owner_info		oinfo;
> +	xfs_agblock_t			bno;
> +	bool				unwritten;
> +
> +	agno = XFS_FSB_TO_AGNO(mp, startblock);
> +	ASSERT(agno != NULLAGNUMBER);
> +	bno = XFS_FSB_TO_AGBNO(mp, startblock);
> +
> +	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
> +			startoff, blockcount, state);
> +
> +	if (XFS_TEST_ERROR(false, mp,
> +			XFS_ERRTAG_RMAP_FINISH_ONE,
> +			XFS_RANDOM_RMAP_FINISH_ONE))
> +		return -EIO;
> +
> +	/*
> +	 * If we haven't gotten a cursor or the cursor AG doesn't match
> +	 * the startblock, get one now.
> +	 */
> +	rcur = *pcur;
> +	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
> +		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
> +		rcur = NULL;
> +		*pcur = NULL;
> +	}
> +	if (rcur == NULL) {
> +		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);

Comment? Why is this here? (Maybe we should rename that function while
we're at it..)

> +		if (error)
> +			return error;
> +		if (!agbp)
> +			return -EFSCORRUPTED;
> +
> +		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> +		if (!rcur) {
> +			error = -ENOMEM;
> +			goto out_cur;
> +		}
> +	}
> +	*pcur = rcur;
> +
> +	xfs_rmap_ino_owner(&oinfo, owner, whichfork, startoff);
> +	unwritten = state == XFS_EXT_UNWRITTEN;
> +	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, startblock);
> +
> +	switch (type) {
> +	case XFS_RMAP_MAP:
> +		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
> +		break;
> +	case XFS_RMAP_UNMAP:
> +		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
> +				&oinfo);
> +		break;
> +	case XFS_RMAP_CONVERT:
> +		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
> +				&oinfo);
> +		break;
> +	case XFS_RMAP_ALLOC:
> +		error = __xfs_rmap_alloc(rcur, bno, blockcount, unwritten,
> +				&oinfo);
> +		break;
> +	case XFS_RMAP_FREE:
> +		error = __xfs_rmap_free(rcur, bno, blockcount, unwritten,
> +				&oinfo);
> +		break;
> +	default:
> +		ASSERT(0);
> +		error = -EFSCORRUPTED;
> +	}
> +	return error;
> +
> +out_cur:
> +	xfs_trans_brelse(tp, agbp);
> +
> +	return error;
> +}
> +
> +/*
> + * Record a rmap intent; the list is kept sorted first by AG and then by
> + * increasing age.
> + */
> +static int
> +__xfs_rmap_add(
> +	struct xfs_mount	*mp,
> +	struct xfs_defer_ops	*dfops,
> +	struct xfs_rmap_intent	*ri)
> +{
> +	struct xfs_rmap_intent	*new;
> +
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return 0;
> +
> +	trace_xfs_rmap_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock),
> +			ri->ri_type,
> +			XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock),
> +			ri->ri_owner, ri->ri_whichfork,
> +			ri->ri_bmap.br_startoff,
> +			ri->ri_bmap.br_blockcount,
> +			ri->ri_bmap.br_state);
> +
> +	new = kmem_zalloc(sizeof(struct xfs_rmap_intent), KM_SLEEP | KM_NOFS);
> +	*new = *ri;
> +
> +	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_RMAP, &new->ri_list);
> +	return 0;
> +}
> +
> +/* Map an extent into a file. */
> +int
> +xfs_rmap_map_extent(
> +	struct xfs_mount	*mp,
> +	struct xfs_defer_ops	*dfops,
> +	struct xfs_inode	*ip,
> +	int			whichfork,
> +	struct xfs_bmbt_irec	*PREV)
> +{
> +	struct xfs_rmap_intent	ri;
> +
> +	ri.ri_type = XFS_RMAP_MAP;
> +	ri.ri_owner = ip->i_ino;
> +	ri.ri_whichfork = whichfork;
> +	ri.ri_bmap = *PREV;
> +

I think we should probably initialize ri_list as well (maybe turn this
into an xfs_rmap_init helper).

Also, for some reason it feels to me like the _hasrmapbt() feature check
should be up at this level (or higher), rather than buried in
__xfs_rmap_add(). I don't feel too strongly about that if others think
differently, however.

> +	return __xfs_rmap_add(mp, dfops, &ri);
> +}
> +
> +/* Unmap an extent out of a file. */
> +int
> +xfs_rmap_unmap_extent(
> +	struct xfs_mount	*mp,
> +	struct xfs_defer_ops	*dfops,
> +	struct xfs_inode	*ip,
> +	int			whichfork,
> +	struct xfs_bmbt_irec	*PREV)
> +{
> +	struct xfs_rmap_intent	ri;
> +
> +	ri.ri_type = XFS_RMAP_UNMAP;
> +	ri.ri_owner = ip->i_ino;
> +	ri.ri_whichfork = whichfork;
> +	ri.ri_bmap = *PREV;
> +
> +	return __xfs_rmap_add(mp, dfops, &ri);
> +}
> +
> +/* Convert a data fork extent from unwritten to real or vice versa. */
> +int
> +xfs_rmap_convert_extent(
> +	struct xfs_mount	*mp,
> +	struct xfs_defer_ops	*dfops,
> +	struct xfs_inode	*ip,
> +	int			whichfork,
> +	struct xfs_bmbt_irec	*PREV)
> +{
> +	struct xfs_rmap_intent	ri;
> +
> +	ri.ri_type = XFS_RMAP_CONVERT;
> +	ri.ri_owner = ip->i_ino;
> +	ri.ri_whichfork = whichfork;
> +	ri.ri_bmap = *PREV;
> +
> +	return __xfs_rmap_add(mp, dfops, &ri);
> +}
> +
> +/* Schedule the creation of an rmap for non-file data. */
> +int
> +xfs_rmap_alloc_defer(

xfs_rmap_[alloc|free]_extent() like the others..?

Brian 

> +	struct xfs_mount	*mp,
> +	struct xfs_defer_ops	*dfops,
> +	xfs_agnumber_t		agno,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	__uint64_t		owner)
> +{
> +	struct xfs_rmap_intent	ri;
> +
> +	ri.ri_type = XFS_RMAP_ALLOC;
> +	ri.ri_owner = owner;
> +	ri.ri_whichfork = XFS_DATA_FORK;
> +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> +	ri.ri_bmap.br_blockcount = len;
> +	ri.ri_bmap.br_startoff = 0;
> +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> +
> +	return __xfs_rmap_add(mp, dfops, &ri);
> +}
> +
> +/* Schedule the deletion of an rmap for non-file data. */
> +int
> +xfs_rmap_free_defer(
> +	struct xfs_mount	*mp,
> +	struct xfs_defer_ops	*dfops,
> +	xfs_agnumber_t		agno,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	__uint64_t		owner)
> +{
> +	struct xfs_rmap_intent	ri;
> +
> +	ri.ri_type = XFS_RMAP_FREE;
> +	ri.ri_owner = owner;
> +	ri.ri_whichfork = XFS_DATA_FORK;
> +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> +	ri.ri_bmap.br_blockcount = len;
> +	ri.ri_bmap.br_startoff = 0;
> +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> +
> +	return __xfs_rmap_add(mp, dfops, &ri);
> +}
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index aff60dc..5df406e 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -106,4 +106,28 @@ struct xfs_rmap_intent {
>  	struct xfs_bmbt_irec			ri_bmap;
>  };
>  
> +/* functions for updating the rmapbt based on bmbt map/unmap operations */
> +int xfs_rmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> +		struct xfs_inode *ip, int whichfork,
> +		struct xfs_bmbt_irec *imap);
> +int xfs_rmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> +		struct xfs_inode *ip, int whichfork,
> +		struct xfs_bmbt_irec *imap);
> +int xfs_rmap_convert_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> +		struct xfs_inode *ip, int whichfork,
> +		struct xfs_bmbt_irec *imap);
> +int xfs_rmap_alloc_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> +		__uint64_t owner);
> +int xfs_rmap_free_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> +		__uint64_t owner);
> +
> +void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
> +		struct xfs_btree_cur *rcur, int error);
> +int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
> +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> +
>  #endif	/* __XFS_RMAP_BTREE_H__ */
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 62d194e..450fd49 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -41,6 +41,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
>  #include "xfs_log.h"
> +#include "xfs_rmap_btree.h"
>  
>  /* Kernel only BMAP related definitions and functions */
>  
> diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> index dbd10fc..9ed060d 100644
> --- a/fs/xfs/xfs_defer_item.c
> +++ b/fs/xfs/xfs_defer_item.c
> @@ -213,7 +213,8 @@ xfs_rmap_update_finish_item(
>  			rmap->ri_bmap.br_startoff,
>  			rmap->ri_bmap.br_startblock,
>  			rmap->ri_bmap.br_blockcount,
> -			rmap->ri_bmap.br_state);
> +			rmap->ri_bmap.br_state,
> +			(struct xfs_btree_cur **)state);
>  	kmem_free(rmap);
>  	return error;
>  }
> @@ -225,6 +226,9 @@ xfs_rmap_update_finish_cleanup(
>  	void			*state,
>  	int			error)
>  {
> +	struct xfs_btree_cur	*rcur = state;
> +
> +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
>  }
>  
>  /* Abort all pending RUIs. */
> diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> index ee4680e..6bc614c 100644
> --- a/fs/xfs/xfs_error.h
> +++ b/fs/xfs/xfs_error.h
> @@ -91,7 +91,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
>  #define XFS_ERRTAG_DIOWRITE_IOERR			20
>  #define XFS_ERRTAG_BMAPIFORMAT				21
>  #define XFS_ERRTAG_FREE_EXTENT				22
> -#define XFS_ERRTAG_MAX					23
> +#define XFS_ERRTAG_RMAP_FINISH_ONE			23
> +#define XFS_ERRTAG_MAX					24
>  
>  /*
>   * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> @@ -119,6 +120,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
>  #define XFS_RANDOM_DIOWRITE_IOERR			(XFS_RANDOM_DEFAULT/10)
>  #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
>  #define XFS_RANDOM_FREE_EXTENT				1
> +#define XFS_RANDOM_RMAP_FINISH_ONE			1
>  
>  #ifdef DEBUG
>  extern int xfs_error_test_active;
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index c9fe0c4..f7f9635 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -45,6 +45,7 @@
>  #include "xfs_error.h"
>  #include "xfs_dir2.h"
>  #include "xfs_rmap_item.h"
> +#include "xfs_rmap_btree.h"
>  
>  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
>  
> @@ -4486,6 +4487,12 @@ xlog_recover_process_rui(
>  	struct xfs_map_extent		*rmap;
>  	xfs_fsblock_t			startblock_fsb;
>  	bool				op_ok;
> +	struct xfs_rud_log_item		*rudp;
> +	enum xfs_rmap_intent_type	type;
> +	int				whichfork;
> +	xfs_exntst_t			state;
> +	struct xfs_trans		*tp;
> +	struct xfs_btree_cur		*rcur = NULL;
>  
>  	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
>  
> @@ -4528,9 +4535,54 @@ xlog_recover_process_rui(
>  		}
>  	}
>  
> -	/* XXX: do nothing for now */
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
> +	rudp = xfs_trans_get_rud(tp, ruip, ruip->rui_format.rui_nextents);
> +
> +	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
> +		rmap = &(ruip->rui_format.rui_extents[i]);
> +		state = (rmap->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
> +				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
> +		whichfork = (rmap->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
> +				XFS_ATTR_FORK : XFS_DATA_FORK;
> +		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
> +		case XFS_RMAP_EXTENT_MAP:
> +			type = XFS_RMAP_MAP;
> +			break;
> +		case XFS_RMAP_EXTENT_UNMAP:
> +			type = XFS_RMAP_UNMAP;
> +			break;
> +		case XFS_RMAP_EXTENT_CONVERT:
> +			type = XFS_RMAP_CONVERT;
> +			break;
> +		case XFS_RMAP_EXTENT_ALLOC:
> +			type = XFS_RMAP_ALLOC;
> +			break;
> +		case XFS_RMAP_EXTENT_FREE:
> +			type = XFS_RMAP_FREE;
> +			break;
> +		default:
> +			error = -EFSCORRUPTED;
> +			goto abort_error;
> +		}
> +		error = xfs_trans_log_finish_rmap_update(tp, rudp, type,
> +				rmap->me_owner, whichfork,
> +				rmap->me_startoff, rmap->me_startblock,
> +				rmap->me_len, state, &rcur);
> +		if (error)
> +			goto abort_error;
> +
> +	}
> +
> +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
>  	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> -	xfs_rui_release(ruip);
> +	error = xfs_trans_commit(tp);
> +	return error;
> +
> +abort_error:
> +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> +	xfs_trans_cancel(tp);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index c48be63..f59d934 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -244,12 +244,13 @@ void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
>  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
>  		xfs_exntst_t state);
>  
> +struct xfs_btree_cur;
>  struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
>  		struct xfs_rui_log_item *ruip, uint nextents);
>  int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
>  		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
>  		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
>  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> -		xfs_exntst_t state);
> +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
>  
>  #endif	/* __XFS_TRANS_H__ */
> diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> index b55a725..0c0df18 100644
> --- a/fs/xfs/xfs_trans_rmap.c
> +++ b/fs/xfs/xfs_trans_rmap.c
> @@ -170,14 +170,15 @@ xfs_trans_log_finish_rmap_update(
>  	xfs_fileoff_t			startoff,
>  	xfs_fsblock_t			startblock,
>  	xfs_filblks_t			blockcount,
> -	xfs_exntst_t			state)
> +	xfs_exntst_t			state,
> +	struct xfs_btree_cur		**pcur)
>  {
>  	uint				next_extent;
>  	struct xfs_map_extent		*rmap;
>  	int				error;
>  
> -	/* XXX: actually finish the rmap update here */
> -	error = -EFSCORRUPTED;
> +	error = xfs_rmap_finish_one(tp, type, owner, whichfork, startoff,
> +			startblock, blockcount, state, pcur);
>  
>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 041/119] xfs: create rmap update intent log items
  2016-07-15 18:33   ` Brian Foster
@ 2016-07-16  7:10     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-16  7:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jul 15, 2016 at 02:33:41PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:22:14PM -0700, Darrick J. Wong wrote:
> > Create rmap update intent/done log items to record redo information in
> > the log.  Because we need to roll transactions between updating the
> > bmbt mapping and updating the reverse mapping, we also have to track
> > the status of the metadata updates that will be recorded in the
> > post-roll transactions, just in case we crash before committing the
> > final transaction.  This mechanism enables log recovery to finish what
> > was already started.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> A couple nits below, otherwise looks good:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> >  fs/xfs/Makefile                |    1 
> >  fs/xfs/libxfs/xfs_log_format.h |   67 ++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h |   19 ++
> >  fs/xfs/xfs_rmap_item.c         |  459 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_rmap_item.h         |  100 +++++++++
> >  fs/xfs/xfs_super.c             |   21 ++
> >  6 files changed, 665 insertions(+), 2 deletions(-)
> >  create mode 100644 fs/xfs/xfs_rmap_item.c
> >  create mode 100644 fs/xfs/xfs_rmap_item.h
> > 
> > 
> > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > index 2de8c20..8ae0a10 100644
> > --- a/fs/xfs/Makefile
> > +++ b/fs/xfs/Makefile
> > @@ -104,6 +104,7 @@ xfs-y				+= xfs_log.o \
> >  				   xfs_extfree_item.o \
> >  				   xfs_icreate_item.o \
> >  				   xfs_inode_item.o \
> > +				   xfs_rmap_item.o \
> >  				   xfs_log_recover.o \
> >  				   xfs_trans_ail.o \
> >  				   xfs_trans_buf.o \
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e5baba3..b9627b7 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -110,7 +110,9 @@ static inline uint xlog_get_cycle(char *ptr)
> >  #define XLOG_REG_TYPE_COMMIT		18
> >  #define XLOG_REG_TYPE_TRANSHDR		19
> >  #define XLOG_REG_TYPE_ICREATE		20
> > -#define XLOG_REG_TYPE_MAX		20
> > +#define XLOG_REG_TYPE_RUI_FORMAT	21
> > +#define XLOG_REG_TYPE_RUD_FORMAT	22
> > +#define XLOG_REG_TYPE_MAX		22
> >  
> >  /*
> >   * Flags to log operation header
> > @@ -227,6 +229,8 @@ typedef struct xfs_trans_header {
> >  #define	XFS_LI_DQUOT		0x123d
> >  #define	XFS_LI_QUOTAOFF		0x123e
> >  #define	XFS_LI_ICREATE		0x123f
> > +#define	XFS_LI_RUI		0x1240	/* rmap update intent */
> > +#define	XFS_LI_RUD		0x1241
> >  
> >  #define XFS_LI_TYPE_DESC \
> >  	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
> > @@ -236,7 +240,9 @@ typedef struct xfs_trans_header {
> >  	{ XFS_LI_BUF,		"XFS_LI_BUF" }, \
> >  	{ XFS_LI_DQUOT,		"XFS_LI_DQUOT" }, \
> >  	{ XFS_LI_QUOTAOFF,	"XFS_LI_QUOTAOFF" }, \
> > -	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }
> > +	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }, \
> > +	{ XFS_LI_RUI,		"XFS_LI_RUI" }, \
> > +	{ XFS_LI_RUD,		"XFS_LI_RUD" }
> >  
> >  /*
> >   * Inode Log Item Format definitions.
> > @@ -604,6 +610,63 @@ typedef struct xfs_efd_log_format_64 {
> >  } xfs_efd_log_format_64_t;
> >  
> >  /*
> > + * RUI/RUD (reverse mapping) log format definitions
> > + */
> > +struct xfs_map_extent {
> > +	__uint64_t		me_owner;
> > +	__uint64_t		me_startblock;
> > +	__uint64_t		me_startoff;
> > +	__uint32_t		me_len;
> > +	__uint32_t		me_flags;
> > +};
> > +
> > +/* rmap me_flags: upper bits are flags, lower byte is type code */
> > +#define XFS_RMAP_EXTENT_MAP		1
> > +#define XFS_RMAP_EXTENT_MAP_SHARED	2
> > +#define XFS_RMAP_EXTENT_UNMAP		3
> > +#define XFS_RMAP_EXTENT_UNMAP_SHARED	4
> > +#define XFS_RMAP_EXTENT_CONVERT		5
> > +#define XFS_RMAP_EXTENT_CONVERT_SHARED	6
> > +#define XFS_RMAP_EXTENT_ALLOC		7
> > +#define XFS_RMAP_EXTENT_FREE		8
> > +#define XFS_RMAP_EXTENT_TYPE_MASK	0xFF
> 
> I assume all of the _SHARED stuff defined here and throughout is not
> used until reflink.. (not that big of a deal if it's a PITA to remove).

Yep, these are for reflink.

> > +
> > +#define XFS_RMAP_EXTENT_ATTR_FORK	(1U << 31)
> > +#define XFS_RMAP_EXTENT_BMBT_BLOCK	(1U << 30)
> > +#define XFS_RMAP_EXTENT_UNWRITTEN	(1U << 29)
> > +
> > +#define XFS_RMAP_EXTENT_FLAGS		(XFS_RMAP_EXTENT_TYPE_MASK | \
> > +					 XFS_RMAP_EXTENT_ATTR_FORK | \
> > +					 XFS_RMAP_EXTENT_BMBT_BLOCK | \
> > +					 XFS_RMAP_EXTENT_UNWRITTEN)
> > +
> > +/*
> > + * This is the structure used to lay out an rui log item in the
> > + * log.  The rui_extents field is a variable size array whose
> > + * size is given by rui_nextents.
> > + */
> > +struct xfs_rui_log_format {
> > +	__uint16_t		rui_type;	/* rui log item type */
> > +	__uint16_t		rui_size;	/* size of this item */
> > +	__uint32_t		rui_nextents;	/* # extents to free */
> > +	__uint64_t		rui_id;		/* rui identifier */
> > +	struct xfs_map_extent	rui_extents[1];	/* array of extents to rmap */
> > +};
> > +
> > +/*
> > + * This is the structure used to lay out an rud log item in the
> > + * log.  The rud_extents array is a variable size array whose
> > + * size is given by rud_nextents;
> > + */
> > +struct xfs_rud_log_format {
> > +	__uint16_t		rud_type;	/* rud log item type */
> > +	__uint16_t		rud_size;	/* size of this item */
> > +	__uint32_t		rud_nextents;	/* # of extents freed */
> > +	__uint64_t		rud_rui_id;	/* id of corresponding rui */
> > +	struct xfs_map_extent	rud_extents[1];	/* array of extents rmapped */
> > +};
> > +
> > +/*
> >   * Dquot Log format definitions.
> >   *
> >   * The first two fields must be the type and size fitting into
> ...
> > diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> > new file mode 100644
> > index 0000000..91a3b2c
> > --- /dev/null
> > +++ b/fs/xfs/xfs_rmap_item.c
> > @@ -0,0 +1,459 @@
> ...
> > +/*
> > + * Copy an RUI format buffer from the given buf, and into the destination
> > + * RUI format structure.  The RUI/RUD items were designed not to need any
> > + * special alignment handling.
> > + */
> > +int
> > +xfs_rui_copy_format(
> > +	struct xfs_log_iovec		*buf,
> > +	struct xfs_rui_log_format	*dst_rui_fmt)
> > +{
> > +	struct xfs_rui_log_format	*src_rui_fmt;
> > +	uint				len;
> > +
> > +	src_rui_fmt = buf->i_addr;
> > +	len = sizeof(struct xfs_rui_log_format) +
> > +			(src_rui_fmt->rui_nextents - 1) *
> > +			sizeof(struct xfs_map_extent);
> > +
> > +	if (buf->i_len == len) {
> > +		memcpy((char *)dst_rui_fmt, (char *)src_rui_fmt, len);
> > +		return 0;
> > +	}
> > +	return -EFSCORRUPTED;
> 
> I'd switch this around since we don't have the mess that
> xfs_efi_copy_format() has to deal with. E.g.,
> 
> 	if (buf->i_len != len)
> 		return -EFSCORRUPTED;
> 
> 	memcpy(..);
> 	return 0;

Will do.

--D

> 
> Brian
> 
> > +}
> > +
> > +/*
> > + * Freeing the RUI requires that we remove it from the AIL if it has already
> > + * been placed there. However, the RUI may not yet have been placed in the AIL
> > + * when called by xfs_rui_release() from RUD processing due to the ordering of
> > + * committed vs unpin operations in bulk insert operations. Hence the reference
> > + * count to ensure only the last caller frees the RUI.
> > + */
> > +void
> > +xfs_rui_release(
> > +	struct xfs_rui_log_item	*ruip)
> > +{
> > +	if (atomic_dec_and_test(&ruip->rui_refcount)) {
> > +		xfs_trans_ail_remove(&ruip->rui_item, SHUTDOWN_LOG_IO_ERROR);
> > +		xfs_rui_item_free(ruip);
> > +	}
> > +}
> > +
> > +static inline struct xfs_rud_log_item *RUD_ITEM(struct xfs_log_item *lip)
> > +{
> > +	return container_of(lip, struct xfs_rud_log_item, rud_item);
> > +}
> > +
> > +STATIC void
> > +xfs_rud_item_free(struct xfs_rud_log_item *rudp)
> > +{
> > +	if (rudp->rud_format.rud_nextents > XFS_RUD_MAX_FAST_EXTENTS)
> > +		kmem_free(rudp);
> > +	else
> > +		kmem_zone_free(xfs_rud_zone, rudp);
> > +}
> > +
> > +/*
> > + * This returns the number of iovecs needed to log the given rud item.
> > + * We only need 1 iovec for an rud item.  It just logs the rud_log_format
> > + * structure.
> > + */
> > +static inline int
> > +xfs_rud_item_sizeof(
> > +	struct xfs_rud_log_item	*rudp)
> > +{
> > +	return sizeof(struct xfs_rud_log_format) +
> > +			(rudp->rud_format.rud_nextents - 1) *
> > +			sizeof(struct xfs_map_extent);
> > +}
> > +
> > +STATIC void
> > +xfs_rud_item_size(
> > +	struct xfs_log_item	*lip,
> > +	int			*nvecs,
> > +	int			*nbytes)
> > +{
> > +	*nvecs += 1;
> > +	*nbytes += xfs_rud_item_sizeof(RUD_ITEM(lip));
> > +}
> > +
> > +/*
> > + * This is called to fill in the vector of log iovecs for the
> > + * given rud log item. We use only 1 iovec, and we point that
> > + * at the rud_log_format structure embedded in the rud item.
> > + * It is at this point that we assert that all of the extent
> > + * slots in the rud item have been filled.
> > + */
> > +STATIC void
> > +xfs_rud_item_format(
> > +	struct xfs_log_item	*lip,
> > +	struct xfs_log_vec	*lv)
> > +{
> > +	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
> > +	struct xfs_log_iovec	*vecp = NULL;
> > +
> > +	ASSERT(rudp->rud_next_extent == rudp->rud_format.rud_nextents);
> > +
> > +	rudp->rud_format.rud_type = XFS_LI_RUD;
> > +	rudp->rud_format.rud_size = 1;
> > +
> > +	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_RUD_FORMAT, &rudp->rud_format,
> > +			xfs_rud_item_sizeof(rudp));
> > +}
> > +
> > +/*
> > + * Pinning has no meaning for an rud item, so just return.
> > + */
> > +STATIC void
> > +xfs_rud_item_pin(
> > +	struct xfs_log_item	*lip)
> > +{
> > +}
> > +
> > +/*
> > + * Since pinning has no meaning for an rud item, unpinning does
> > + * not either.
> > + */
> > +STATIC void
> > +xfs_rud_item_unpin(
> > +	struct xfs_log_item	*lip,
> > +	int			remove)
> > +{
> > +}
> > +
> > +/*
> > + * There isn't much you can do to push on an rud item.  It is simply stuck
> > + * waiting for the log to be flushed to disk.
> > + */
> > +STATIC uint
> > +xfs_rud_item_push(
> > +	struct xfs_log_item	*lip,
> > +	struct list_head	*buffer_list)
> > +{
> > +	return XFS_ITEM_PINNED;
> > +}
> > +
> > +/*
> > + * The RUD is either committed or aborted if the transaction is cancelled. If
> > + * the transaction is cancelled, drop our reference to the RUI and free the
> > + * RUD.
> > + */
> > +STATIC void
> > +xfs_rud_item_unlock(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
> > +
> > +	if (lip->li_flags & XFS_LI_ABORTED) {
> > +		xfs_rui_release(rudp->rud_ruip);
> > +		xfs_rud_item_free(rudp);
> > +	}
> > +}
> > +
> > +/*
> > + * When the rud item is committed to disk, all we need to do is delete our
> > + * reference to our partner rui item and then free ourselves. Since we're
> > + * freeing ourselves we must return -1 to keep the transaction code from
> > + * further referencing this item.
> > + */
> > +STATIC xfs_lsn_t
> > +xfs_rud_item_committed(
> > +	struct xfs_log_item	*lip,
> > +	xfs_lsn_t		lsn)
> > +{
> > +	struct xfs_rud_log_item	*rudp = RUD_ITEM(lip);
> > +
> > +	/*
> > +	 * Drop the RUI reference regardless of whether the RUD has been
> > +	 * aborted. Once the RUD transaction is constructed, it is the sole
> > +	 * responsibility of the RUD to release the RUI (even if the RUI is
> > +	 * aborted due to log I/O error).
> > +	 */
> > +	xfs_rui_release(rudp->rud_ruip);
> > +	xfs_rud_item_free(rudp);
> > +
> > +	return (xfs_lsn_t)-1;
> > +}
> > +
> > +/*
> > + * The RUD dependency tracking op doesn't do squat.  It can't because
> > + * it doesn't know where the free extent is coming from.  The dependency
> > + * tracking has to be handled by the "enclosing" metadata object.  For
> > + * example, for inodes, the inode is locked throughout the extent freeing
> > + * so the dependency should be recorded there.
> > + */
> > +STATIC void
> > +xfs_rud_item_committing(
> > +	struct xfs_log_item	*lip,
> > +	xfs_lsn_t		lsn)
> > +{
> > +}
> > +
> > +/*
> > + * This is the ops vector shared by all rud log items.
> > + */
> > +static const struct xfs_item_ops xfs_rud_item_ops = {
> > +	.iop_size	= xfs_rud_item_size,
> > +	.iop_format	= xfs_rud_item_format,
> > +	.iop_pin	= xfs_rud_item_pin,
> > +	.iop_unpin	= xfs_rud_item_unpin,
> > +	.iop_unlock	= xfs_rud_item_unlock,
> > +	.iop_committed	= xfs_rud_item_committed,
> > +	.iop_push	= xfs_rud_item_push,
> > +	.iop_committing = xfs_rud_item_committing,
> > +};
> > +
> > +/*
> > + * Allocate and initialize an rud item with the given number of extents.
> > + */
> > +struct xfs_rud_log_item *
> > +xfs_rud_init(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_rui_log_item		*ruip,
> > +	uint				nextents)
> > +
> > +{
> > +	struct xfs_rud_log_item	*rudp;
> > +	uint			size;
> > +
> > +	ASSERT(nextents > 0);
> > +	if (nextents > XFS_RUD_MAX_FAST_EXTENTS) {
> > +		size = (uint)(sizeof(struct xfs_rud_log_item) +
> > +			((nextents - 1) * sizeof(struct xfs_map_extent)));
> > +		rudp = kmem_zalloc(size, KM_SLEEP);
> > +	} else {
> > +		rudp = kmem_zone_zalloc(xfs_rud_zone, KM_SLEEP);
> > +	}
> > +
> > +	xfs_log_item_init(mp, &rudp->rud_item, XFS_LI_RUD, &xfs_rud_item_ops);
> > +	rudp->rud_ruip = ruip;
> > +	rudp->rud_format.rud_nextents = nextents;
> > +	rudp->rud_format.rud_rui_id = ruip->rui_format.rui_id;
> > +
> > +	return rudp;
> > +}
> > diff --git a/fs/xfs/xfs_rmap_item.h b/fs/xfs/xfs_rmap_item.h
> > new file mode 100644
> > index 0000000..bd36ab5
> > --- /dev/null
> > +++ b/fs/xfs/xfs_rmap_item.h
> > @@ -0,0 +1,100 @@
> > +/*
> > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#ifndef	__XFS_RMAP_ITEM_H__
> > +#define	__XFS_RMAP_ITEM_H__
> > +
> > +/*
> > + * There are (currently) three pairs of rmap btree redo item types: map, unmap,
> > + * and convert.  The common abbreviations for these are RUI (rmap update
> > + * intent) and RUD (rmap update done).  The redo item type is encoded in the
> > + * flags field of each xfs_map_extent.
> > + *
> > + * *I items should be recorded in the *first* of a series of rolled
> > + * transactions, and the *D items should be recorded in the same transaction
> > + * that records the associated rmapbt updates.  Typically, the first
> > + * transaction will record a bmbt update, followed by some number of
> > + * transactions containing rmapbt updates, and finally transactions with any
> > + * bnobt/cntbt updates.
> > + *
> > + * Should the system crash after the commit of the first transaction but
> > + * before the commit of the final transaction in a series, log recovery will
> > + * use the redo information recorded by the intent items to replay the
> > + * (rmapbt/bnobt/cntbt) metadata updates in the non-first transaction.
> > + */
> > +
> > +/* kernel only RUI/RUD definitions */
> > +
> > +struct xfs_mount;
> > +struct kmem_zone;
> > +
> > +/*
> > + * Max number of extents in fast allocation path.
> > + */
> > +#define	XFS_RUI_MAX_FAST_EXTENTS	16
> > +
> > +/*
> > + * Define RUI flag bits. Manipulated by set/clear/test_bit operators.
> > + */
> > +#define	XFS_RUI_RECOVERED		1
> > +
> > +/*
> > + * This is the "rmap update intent" log item.  It is used to log the fact that
> > + * some reverse mappings need to change.  It is used in conjunction with the
> > + * "rmap update done" log item described below.
> > + *
> > + * These log items follow the same rules as struct xfs_efi_log_item; see the
> > + * comments about that structure (in xfs_extfree_item.h) for more details.
> > + */
> > +struct xfs_rui_log_item {
> > +	struct xfs_log_item		rui_item;
> > +	atomic_t			rui_refcount;
> > +	atomic_t			rui_next_extent;
> > +	unsigned long			rui_flags;	/* misc flags */
> > +	struct xfs_rui_log_format	rui_format;
> > +};
> > +
> > +/*
> > + * This is the "rmap update done" log item.  It is used to log the fact that
> > + * some rmapbt updates mentioned in an earlier rui item have been performed.
> > + */
> > +struct xfs_rud_log_item {
> > +	struct xfs_log_item		rud_item;
> > +	struct xfs_rui_log_item		*rud_ruip;
> > +	uint				rud_next_extent;
> > +	struct xfs_rud_log_format	rud_format;
> > +};
> > +
> > +/*
> > + * Max number of extents in fast allocation path.
> > + */
> > +#define	XFS_RUD_MAX_FAST_EXTENTS	16
> > +
> > +extern struct kmem_zone	*xfs_rui_zone;
> > +extern struct kmem_zone	*xfs_rud_zone;
> > +
> > +struct xfs_rui_log_item *xfs_rui_init(struct xfs_mount *, uint);
> > +struct xfs_rud_log_item *xfs_rud_init(struct xfs_mount *,
> > +		struct xfs_rui_log_item *, uint);
> > +int xfs_rui_copy_format(struct xfs_log_iovec *buf,
> > +		struct xfs_rui_log_format *dst_rui_fmt);
> > +void xfs_rui_item_free(struct xfs_rui_log_item *);
> > +void xfs_rui_release(struct xfs_rui_log_item *);
> > +
> > +#endif	/* __XFS_RMAP_ITEM_H__ */
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 1575849..a8300e4 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -47,6 +47,7 @@
> >  #include "xfs_sysfs.h"
> >  #include "xfs_ondisk.h"
> >  #include "xfs_defer.h"
> > +#include "xfs_rmap_item.h"
> >  
> >  #include <linux/namei.h>
> >  #include <linux/init.h>
> > @@ -1762,8 +1763,26 @@ xfs_init_zones(void)
> >  	if (!xfs_icreate_zone)
> >  		goto out_destroy_ili_zone;
> >  
> > +	xfs_rud_zone = kmem_zone_init((sizeof(struct xfs_rud_log_item) +
> > +			((XFS_RUD_MAX_FAST_EXTENTS - 1) *
> > +				 sizeof(struct xfs_map_extent))),
> > +			"xfs_rud_item");
> > +	if (!xfs_rud_zone)
> > +		goto out_destroy_icreate_zone;
> > +
> > +	xfs_rui_zone = kmem_zone_init((sizeof(struct xfs_rui_log_item) +
> > +			((XFS_RUI_MAX_FAST_EXTENTS - 1) *
> > +				sizeof(struct xfs_map_extent))),
> > +			"xfs_rui_item");
> > +	if (!xfs_rui_zone)
> > +		goto out_destroy_rud_zone;
> > +
> >  	return 0;
> >  
> > + out_destroy_rud_zone:
> > +	kmem_zone_destroy(xfs_rud_zone);
> > + out_destroy_icreate_zone:
> > +	kmem_zone_destroy(xfs_icreate_zone);
> >   out_destroy_ili_zone:
> >  	kmem_zone_destroy(xfs_ili_zone);
> >   out_destroy_inode_zone:
> > @@ -1802,6 +1821,8 @@ xfs_destroy_zones(void)
> >  	 * destroy caches.
> >  	 */
> >  	rcu_barrier();
> > +	kmem_zone_destroy(xfs_rui_zone);
> > +	kmem_zone_destroy(xfs_rud_zone);
> >  	kmem_zone_destroy(xfs_icreate_zone);
> >  	kmem_zone_destroy(xfs_ili_zone);
> >  	kmem_zone_destroy(xfs_inode_zone);
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-07-15 18:33   ` Brian Foster
@ 2016-07-16  7:26     ` Darrick J. Wong
  2016-07-18  1:21       ` Dave Chinner
  2016-07-18 12:55       ` Brian Foster
  0 siblings, 2 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-16  7:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jul 15, 2016 at 02:33:56PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:22:34PM -0700, Darrick J. Wong wrote:
> > When we map, unmap, or convert an extent in a file's data or attr
> > fork, schedule a respective update in the rmapbt.  Previous versions
> > of this patch required a 1:1 correspondence between bmap and rmap,
> > but this is no longer true.
> > 
> > v2: Remove the 1:1 correspondence requirement now that we have the
> > ability to make interval queries against the rmapbt.  Update the
> > commit message to reflect the broad restructuring of this patch.
> > Fix the bmap shift code to adjust the rmaps correctly.
> > 
> > v3: Use the deferred operations code to handle redo operations
> > atomically and deadlock free.  Plumb in all five rmap actions
> > (map, unmap, convert extent, alloc, free); we'll use the first
> > three now for file data, and reflink will want the last two.
> > Add an error injection site to test log recovery.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c       |   56 ++++++++-
> >  fs/xfs/libxfs/xfs_rmap.c       |  252 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h |   24 ++++
> >  fs/xfs/xfs_bmap_util.c         |    1 
> >  fs/xfs/xfs_defer_item.c        |    6 +
> >  fs/xfs/xfs_error.h             |    4 -
> >  fs/xfs/xfs_log_recover.c       |   56 +++++++++
> >  fs/xfs/xfs_trans.h             |    3 
> >  fs/xfs/xfs_trans_rmap.c        |    7 +
> >  9 files changed, 393 insertions(+), 16 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 61c0231..507fd74 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -46,6 +46,7 @@
> >  #include "xfs_symlink.h"
> >  #include "xfs_attr_leaf.h"
> >  #include "xfs_filestream.h"
> > +#include "xfs_rmap_btree.h"
> >  
> >  
> >  kmem_zone_t		*xfs_bmap_free_item_zone;
> > @@ -2178,6 +2179,11 @@ xfs_bmap_add_extent_delay_real(
> >  		ASSERT(0);
> >  	}
> >  
> > +	/* add reverse mapping */
> > +	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
> > +	if (error)
> > +		goto done;
> > +
> >  	/* convert to a btree if necessary */
> >  	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
> >  		int	tmp_logflags;	/* partial log flag return val */
> > @@ -2714,6 +2720,11 @@ xfs_bmap_add_extent_unwritten_real(
> >  		ASSERT(0);
> >  	}
> >  
> > +	/* update reverse mappings */
> > +	error = xfs_rmap_convert_extent(mp, dfops, ip, XFS_DATA_FORK, new);
> > +	if (error)
> > +		goto done;
> > +
> >  	/* convert to a btree if necessary */
> >  	if (xfs_bmap_needs_btree(ip, XFS_DATA_FORK)) {
> >  		int	tmp_logflags;	/* partial log flag return val */
> > @@ -3106,6 +3117,11 @@ xfs_bmap_add_extent_hole_real(
> >  		break;
> >  	}
> >  
> > +	/* add reverse mapping */
> > +	error = xfs_rmap_map_extent(mp, bma->dfops, bma->ip, whichfork, new);
> > +	if (error)
> > +		goto done;
> > +
> >  	/* convert to a btree if necessary */
> >  	if (xfs_bmap_needs_btree(bma->ip, whichfork)) {
> >  		int	tmp_logflags;	/* partial log flag return val */
> > @@ -5032,6 +5048,14 @@ xfs_bmap_del_extent(
> >  		++*idx;
> >  		break;
> >  	}
> > +
> > +	/* remove reverse mapping */
> > +	if (!delay) {
> > +		error = xfs_rmap_unmap_extent(mp, dfops, ip, whichfork, del);
> > +		if (error)
> > +			goto done;
> > +	}
> > +
> >  	/*
> >  	 * If we need to, add to list of extents to delete.
> >  	 */
> > @@ -5569,7 +5593,8 @@ xfs_bmse_shift_one(
> >  	struct xfs_bmbt_rec_host	*gotp,
> >  	struct xfs_btree_cur		*cur,
> >  	int				*logflags,
> > -	enum shift_direction		direction)
> > +	enum shift_direction		direction,
> > +	struct xfs_defer_ops		*dfops)
> >  {
> >  	struct xfs_ifork		*ifp;
> >  	struct xfs_mount		*mp;
> > @@ -5617,9 +5642,13 @@ xfs_bmse_shift_one(
> >  		/* check whether to merge the extent or shift it down */
> >  		if (xfs_bmse_can_merge(&adj_irec, &got,
> >  				       offset_shift_fsb)) {
> > -			return xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
> > -					      *current_ext, gotp, adj_irecp,
> > -					      cur, logflags);
> > +			error = xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
> > +					       *current_ext, gotp, adj_irecp,
> > +					       cur, logflags);
> > +			if (error)
> > +				return error;
> > +			adj_irec = got;
> > +			goto update_rmap;
> >  		}
> >  	} else {
> >  		startoff = got.br_startoff + offset_shift_fsb;
> > @@ -5656,9 +5685,10 @@ update_current_ext:
> >  		(*current_ext)--;
> >  	xfs_bmbt_set_startoff(gotp, startoff);
> >  	*logflags |= XFS_ILOG_CORE;
> > +	adj_irec = got;
> >  	if (!cur) {
> >  		*logflags |= XFS_ILOG_DEXT;
> > -		return 0;
> > +		goto update_rmap;
> >  	}
> >  
> >  	error = xfs_bmbt_lookup_eq(cur, got.br_startoff, got.br_startblock,
> > @@ -5668,8 +5698,18 @@ update_current_ext:
> >  	XFS_WANT_CORRUPTED_RETURN(mp, i == 1);
> >  
> >  	got.br_startoff = startoff;
> > -	return xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
> > -			       got.br_blockcount, got.br_state);
> > +	error = xfs_bmbt_update(cur, got.br_startoff, got.br_startblock,
> > +			got.br_blockcount, got.br_state);
> > +	if (error)
> > +		return error;
> > +
> > +update_rmap:
> > +	/* update reverse mapping */
> > +	error = xfs_rmap_unmap_extent(mp, dfops, ip, whichfork, &adj_irec);
> > +	if (error)
> > +		return error;
> > +	adj_irec.br_startoff = startoff;
> > +	return xfs_rmap_map_extent(mp, dfops, ip, whichfork, &adj_irec);
> >  }
> >  
> >  /*
> > @@ -5797,7 +5837,7 @@ xfs_bmap_shift_extents(
> >  	while (nexts++ < num_exts) {
> >  		error = xfs_bmse_shift_one(ip, whichfork, offset_shift_fsb,
> >  					   &current_ext, gotp, cur, &logflags,
> > -					   direction);
> > +					   direction, dfops);
> >  		if (error)
> >  			goto del_cursor;
> >  		/*
> > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > index 76fc5c2..f179ea4 100644
> > --- a/fs/xfs/libxfs/xfs_rmap.c
> > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > @@ -36,6 +36,8 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_error.h"
> >  #include "xfs_extent_busy.h"
> > +#include "xfs_bmap.h"
> > +#include "xfs_inode.h"
> >  
> >  /*
> >   * Lookup the first record less than or equal to [bno, len, owner, offset]
> > @@ -1212,3 +1214,253 @@ xfs_rmapbt_query_range(
> >  	return xfs_btree_query_range(cur, &low_brec, &high_brec,
> >  			xfs_rmapbt_query_range_helper, &query);
> >  }
> > +
> > +/* Clean up after calling xfs_rmap_finish_one. */
> > +void
> > +xfs_rmap_finish_one_cleanup(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_btree_cur	*rcur,
> > +	int			error)
> > +{
> > +	struct xfs_buf		*agbp;
> > +
> > +	if (rcur == NULL)
> > +		return;
> > +	agbp = rcur->bc_private.a.agbp;
> > +	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > +	xfs_trans_brelse(tp, agbp);
> 
> Why unconditionally release the agbp (and not just on error)?

We grabbed the agbp (er, AGF buffer) to construct the rmapbt cursor, so we have
to free it after the cursor is deleted regardless of whether or not there's an
error.

> > +}
> > +
> > +/*
> > + * Process one of the deferred rmap operations.  We pass back the
> > + * btree cursor to maintain our lock on the rmapbt between calls.
> > + * This saves time and eliminates a buffer deadlock between the
> > + * superblock and the AGF because we'll always grab them in the same
> > + * order.
> > + */
> > +int
> > +xfs_rmap_finish_one(
> > +	struct xfs_trans		*tp,
> > +	enum xfs_rmap_intent_type	type,
> > +	__uint64_t			owner,
> > +	int				whichfork,
> > +	xfs_fileoff_t			startoff,
> > +	xfs_fsblock_t			startblock,
> > +	xfs_filblks_t			blockcount,
> > +	xfs_exntst_t			state,
> > +	struct xfs_btree_cur		**pcur)
> > +{
> > +	struct xfs_mount		*mp = tp->t_mountp;
> > +	struct xfs_btree_cur		*rcur;
> > +	struct xfs_buf			*agbp = NULL;
> > +	int				error = 0;
> > +	xfs_agnumber_t			agno;
> > +	struct xfs_owner_info		oinfo;
> > +	xfs_agblock_t			bno;
> > +	bool				unwritten;
> > +
> > +	agno = XFS_FSB_TO_AGNO(mp, startblock);
> > +	ASSERT(agno != NULLAGNUMBER);
> > +	bno = XFS_FSB_TO_AGBNO(mp, startblock);
> > +
> > +	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
> > +			startoff, blockcount, state);
> > +
> > +	if (XFS_TEST_ERROR(false, mp,
> > +			XFS_ERRTAG_RMAP_FINISH_ONE,
> > +			XFS_RANDOM_RMAP_FINISH_ONE))
> > +		return -EIO;
> > +
> > +	/*
> > +	 * If we haven't gotten a cursor or the cursor AG doesn't match
> > +	 * the startblock, get one now.
> > +	 */
> > +	rcur = *pcur;
> > +	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
> > +		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
> > +		rcur = NULL;
> > +		*pcur = NULL;
> > +	}
> > +	if (rcur == NULL) {
> > +		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
> 
> Comment? Why is this here? (Maybe we should rename that function while
> we're at it..)

/*
 * Ensure the freelist is of a sufficient length to provide for any btree
 * splits that could happen when we make changes to the rmapbt.
 */

(I don't know why the function has that name; Dave supplied it.)

> > +		if (error)
> > +			return error;
> > +		if (!agbp)
> > +			return -EFSCORRUPTED;
> > +
> > +		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> > +		if (!rcur) {
> > +			error = -ENOMEM;
> > +			goto out_cur;
> > +		}
> > +	}
> > +	*pcur = rcur;
> > +
> > +	xfs_rmap_ino_owner(&oinfo, owner, whichfork, startoff);
> > +	unwritten = state == XFS_EXT_UNWRITTEN;
> > +	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, startblock);
> > +
> > +	switch (type) {
> > +	case XFS_RMAP_MAP:
> > +		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
> > +		break;
> > +	case XFS_RMAP_UNMAP:
> > +		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
> > +				&oinfo);
> > +		break;
> > +	case XFS_RMAP_CONVERT:
> > +		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
> > +				&oinfo);
> > +		break;
> > +	case XFS_RMAP_ALLOC:
> > +		error = __xfs_rmap_alloc(rcur, bno, blockcount, unwritten,
> > +				&oinfo);
> > +		break;
> > +	case XFS_RMAP_FREE:
> > +		error = __xfs_rmap_free(rcur, bno, blockcount, unwritten,
> > +				&oinfo);
> > +		break;
> > +	default:
> > +		ASSERT(0);
> > +		error = -EFSCORRUPTED;
> > +	}
> > +	return error;
> > +
> > +out_cur:
> > +	xfs_trans_brelse(tp, agbp);
> > +
> > +	return error;
> > +}
> > +
> > +/*
> > + * Record a rmap intent; the list is kept sorted first by AG and then by
> > + * increasing age.
> > + */
> > +static int
> > +__xfs_rmap_add(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_defer_ops	*dfops,
> > +	struct xfs_rmap_intent	*ri)
> > +{
> > +	struct xfs_rmap_intent	*new;
> > +
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return 0;
> > +
> > +	trace_xfs_rmap_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock),
> > +			ri->ri_type,
> > +			XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock),
> > +			ri->ri_owner, ri->ri_whichfork,
> > +			ri->ri_bmap.br_startoff,
> > +			ri->ri_bmap.br_blockcount,
> > +			ri->ri_bmap.br_state);
> > +
> > +	new = kmem_zalloc(sizeof(struct xfs_rmap_intent), KM_SLEEP | KM_NOFS);
> > +	*new = *ri;
> > +
> > +	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_RMAP, &new->ri_list);
> > +	return 0;
> > +}
> > +
> > +/* Map an extent into a file. */
> > +int
> > +xfs_rmap_map_extent(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_defer_ops	*dfops,
> > +	struct xfs_inode	*ip,
> > +	int			whichfork,
> > +	struct xfs_bmbt_irec	*PREV)
> > +{
> > +	struct xfs_rmap_intent	ri;
> > +
> > +	ri.ri_type = XFS_RMAP_MAP;
> > +	ri.ri_owner = ip->i_ino;
> > +	ri.ri_whichfork = whichfork;
> > +	ri.ri_bmap = *PREV;
> > +
> 
> I think we should probably initialize ri_list as well (maybe turn this
> into an xfs_rmap_init helper).

__xfs_rmap_add calls xfs_defer_add, which calls list_add_tail, which
initializes ri_list.  Could probably just make an _rmap_init helper that
allocates the structure, then have _rmap_*_extent fill out the new intent, and
make the _rmap_add function pass it to _defer_add, which I think is what you're
getting at.

> Also, for some reason it feels to me like the _hasrmapbt() feature check
> should be up at this level (or higher), rather than buried in
> __xfs_rmap_add(). I don't feel too strongly about that if others think
> differently, however.

<shrug> It probably ought to be in the higher level function.

> > +	return __xfs_rmap_add(mp, dfops, &ri);
> > +}
> > +
> > +/* Unmap an extent out of a file. */
> > +int
> > +xfs_rmap_unmap_extent(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_defer_ops	*dfops,
> > +	struct xfs_inode	*ip,
> > +	int			whichfork,
> > +	struct xfs_bmbt_irec	*PREV)
> > +{
> > +	struct xfs_rmap_intent	ri;
> > +
> > +	ri.ri_type = XFS_RMAP_UNMAP;
> > +	ri.ri_owner = ip->i_ino;
> > +	ri.ri_whichfork = whichfork;
> > +	ri.ri_bmap = *PREV;
> > +
> > +	return __xfs_rmap_add(mp, dfops, &ri);
> > +}
> > +
> > +/* Convert a data fork extent from unwritten to real or vice versa. */
> > +int
> > +xfs_rmap_convert_extent(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_defer_ops	*dfops,
> > +	struct xfs_inode	*ip,
> > +	int			whichfork,
> > +	struct xfs_bmbt_irec	*PREV)
> > +{
> > +	struct xfs_rmap_intent	ri;
> > +
> > +	ri.ri_type = XFS_RMAP_CONVERT;
> > +	ri.ri_owner = ip->i_ino;
> > +	ri.ri_whichfork = whichfork;
> > +	ri.ri_bmap = *PREV;
> > +
> > +	return __xfs_rmap_add(mp, dfops, &ri);
> > +}
> > +
> > +/* Schedule the creation of an rmap for non-file data. */
> > +int
> > +xfs_rmap_alloc_defer(
> 
> xfs_rmap_[alloc|free]_extent() like the others..?

Yeah.  The naming has shifted a bit over the past few revisions.

--D

> 
> Brian 
> 
> > +	struct xfs_mount	*mp,
> > +	struct xfs_defer_ops	*dfops,
> > +	xfs_agnumber_t		agno,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len,
> > +	__uint64_t		owner)
> > +{
> > +	struct xfs_rmap_intent	ri;
> > +
> > +	ri.ri_type = XFS_RMAP_ALLOC;
> > +	ri.ri_owner = owner;
> > +	ri.ri_whichfork = XFS_DATA_FORK;
> > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > +	ri.ri_bmap.br_blockcount = len;
> > +	ri.ri_bmap.br_startoff = 0;
> > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > +
> > +	return __xfs_rmap_add(mp, dfops, &ri);
> > +}
> > +
> > +/* Schedule the deletion of an rmap for non-file data. */
> > +int
> > +xfs_rmap_free_defer(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_defer_ops	*dfops,
> > +	xfs_agnumber_t		agno,
> > +	xfs_agblock_t		bno,
> > +	xfs_extlen_t		len,
> > +	__uint64_t		owner)
> > +{
> > +	struct xfs_rmap_intent	ri;
> > +
> > +	ri.ri_type = XFS_RMAP_FREE;
> > +	ri.ri_owner = owner;
> > +	ri.ri_whichfork = XFS_DATA_FORK;
> > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > +	ri.ri_bmap.br_blockcount = len;
> > +	ri.ri_bmap.br_startoff = 0;
> > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > +
> > +	return __xfs_rmap_add(mp, dfops, &ri);
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index aff60dc..5df406e 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -106,4 +106,28 @@ struct xfs_rmap_intent {
> >  	struct xfs_bmbt_irec			ri_bmap;
> >  };
> >  
> > +/* functions for updating the rmapbt based on bmbt map/unmap operations */
> > +int xfs_rmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > +		struct xfs_inode *ip, int whichfork,
> > +		struct xfs_bmbt_irec *imap);
> > +int xfs_rmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > +		struct xfs_inode *ip, int whichfork,
> > +		struct xfs_bmbt_irec *imap);
> > +int xfs_rmap_convert_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > +		struct xfs_inode *ip, int whichfork,
> > +		struct xfs_bmbt_irec *imap);
> > +int xfs_rmap_alloc_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > +		__uint64_t owner);
> > +int xfs_rmap_free_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > +		__uint64_t owner);
> > +
> > +void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
> > +		struct xfs_btree_cur *rcur, int error);
> > +int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
> > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> > +
> >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 62d194e..450fd49 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -41,6 +41,7 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> >  #include "xfs_log.h"
> > +#include "xfs_rmap_btree.h"
> >  
> >  /* Kernel only BMAP related definitions and functions */
> >  
> > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > index dbd10fc..9ed060d 100644
> > --- a/fs/xfs/xfs_defer_item.c
> > +++ b/fs/xfs/xfs_defer_item.c
> > @@ -213,7 +213,8 @@ xfs_rmap_update_finish_item(
> >  			rmap->ri_bmap.br_startoff,
> >  			rmap->ri_bmap.br_startblock,
> >  			rmap->ri_bmap.br_blockcount,
> > -			rmap->ri_bmap.br_state);
> > +			rmap->ri_bmap.br_state,
> > +			(struct xfs_btree_cur **)state);
> >  	kmem_free(rmap);
> >  	return error;
> >  }
> > @@ -225,6 +226,9 @@ xfs_rmap_update_finish_cleanup(
> >  	void			*state,
> >  	int			error)
> >  {
> > +	struct xfs_btree_cur	*rcur = state;
> > +
> > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> >  }
> >  
> >  /* Abort all pending RUIs. */
> > diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> > index ee4680e..6bc614c 100644
> > --- a/fs/xfs/xfs_error.h
> > +++ b/fs/xfs/xfs_error.h
> > @@ -91,7 +91,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> >  #define XFS_ERRTAG_DIOWRITE_IOERR			20
> >  #define XFS_ERRTAG_BMAPIFORMAT				21
> >  #define XFS_ERRTAG_FREE_EXTENT				22
> > -#define XFS_ERRTAG_MAX					23
> > +#define XFS_ERRTAG_RMAP_FINISH_ONE			23
> > +#define XFS_ERRTAG_MAX					24
> >  
> >  /*
> >   * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> > @@ -119,6 +120,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> >  #define XFS_RANDOM_DIOWRITE_IOERR			(XFS_RANDOM_DEFAULT/10)
> >  #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
> >  #define XFS_RANDOM_FREE_EXTENT				1
> > +#define XFS_RANDOM_RMAP_FINISH_ONE			1
> >  
> >  #ifdef DEBUG
> >  extern int xfs_error_test_active;
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index c9fe0c4..f7f9635 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -45,6 +45,7 @@
> >  #include "xfs_error.h"
> >  #include "xfs_dir2.h"
> >  #include "xfs_rmap_item.h"
> > +#include "xfs_rmap_btree.h"
> >  
> >  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
> >  
> > @@ -4486,6 +4487,12 @@ xlog_recover_process_rui(
> >  	struct xfs_map_extent		*rmap;
> >  	xfs_fsblock_t			startblock_fsb;
> >  	bool				op_ok;
> > +	struct xfs_rud_log_item		*rudp;
> > +	enum xfs_rmap_intent_type	type;
> > +	int				whichfork;
> > +	xfs_exntst_t			state;
> > +	struct xfs_trans		*tp;
> > +	struct xfs_btree_cur		*rcur = NULL;
> >  
> >  	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
> >  
> > @@ -4528,9 +4535,54 @@ xlog_recover_process_rui(
> >  		}
> >  	}
> >  
> > -	/* XXX: do nothing for now */
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > +	if (error)
> > +		return error;
> > +	rudp = xfs_trans_get_rud(tp, ruip, ruip->rui_format.rui_nextents);
> > +
> > +	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
> > +		rmap = &(ruip->rui_format.rui_extents[i]);
> > +		state = (rmap->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
> > +				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
> > +		whichfork = (rmap->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
> > +				XFS_ATTR_FORK : XFS_DATA_FORK;
> > +		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
> > +		case XFS_RMAP_EXTENT_MAP:
> > +			type = XFS_RMAP_MAP;
> > +			break;
> > +		case XFS_RMAP_EXTENT_UNMAP:
> > +			type = XFS_RMAP_UNMAP;
> > +			break;
> > +		case XFS_RMAP_EXTENT_CONVERT:
> > +			type = XFS_RMAP_CONVERT;
> > +			break;
> > +		case XFS_RMAP_EXTENT_ALLOC:
> > +			type = XFS_RMAP_ALLOC;
> > +			break;
> > +		case XFS_RMAP_EXTENT_FREE:
> > +			type = XFS_RMAP_FREE;
> > +			break;
> > +		default:
> > +			error = -EFSCORRUPTED;
> > +			goto abort_error;
> > +		}
> > +		error = xfs_trans_log_finish_rmap_update(tp, rudp, type,
> > +				rmap->me_owner, whichfork,
> > +				rmap->me_startoff, rmap->me_startblock,
> > +				rmap->me_len, state, &rcur);
> > +		if (error)
> > +			goto abort_error;
> > +
> > +	}
> > +
> > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> >  	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> > -	xfs_rui_release(ruip);
> > +	error = xfs_trans_commit(tp);
> > +	return error;
> > +
> > +abort_error:
> > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > +	xfs_trans_cancel(tp);
> >  	return error;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index c48be63..f59d934 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -244,12 +244,13 @@ void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> >  		xfs_exntst_t state);
> >  
> > +struct xfs_btree_cur;
> >  struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> >  		struct xfs_rui_log_item *ruip, uint nextents);
> >  int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> >  		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> >  		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > -		xfs_exntst_t state);
> > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> >  
> >  #endif	/* __XFS_TRANS_H__ */
> > diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> > index b55a725..0c0df18 100644
> > --- a/fs/xfs/xfs_trans_rmap.c
> > +++ b/fs/xfs/xfs_trans_rmap.c
> > @@ -170,14 +170,15 @@ xfs_trans_log_finish_rmap_update(
> >  	xfs_fileoff_t			startoff,
> >  	xfs_fsblock_t			startblock,
> >  	xfs_filblks_t			blockcount,
> > -	xfs_exntst_t			state)
> > +	xfs_exntst_t			state,
> > +	struct xfs_btree_cur		**pcur)
> >  {
> >  	uint				next_extent;
> >  	struct xfs_map_extent		*rmap;
> >  	int				error;
> >  
> > -	/* XXX: actually finish the rmap update here */
> > -	error = -EFSCORRUPTED;
> > +	error = xfs_rmap_finish_one(tp, type, owner, whichfork, startoff,
> > +			startblock, blockcount, state, pcur);
> >  
> >  	/*
> >  	 * Mark the transaction dirty, even on error. This ensures the
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 042/119] xfs: log rmap intent items
  2016-07-15 18:33   ` Brian Foster
@ 2016-07-16  7:34     ` Darrick J. Wong
  2016-07-18 12:55       ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-16  7:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Fri, Jul 15, 2016 at 02:33:46PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:22:21PM -0700, Darrick J. Wong wrote:
> > Provide a mechanism for higher levels to create RUI/RUD items, submit
> > them to the log, and a stub function to deal with recovered RUI items.
> > These parts will be connected to the rmapbt in a later patch.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> The commit log makes no mention of log recovery.. perhaps this should be
> split in two?
> 
> >  fs/xfs/Makefile          |    1 
> >  fs/xfs/xfs_log_recover.c |  344 +++++++++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_trans.h       |   17 ++
> >  fs/xfs/xfs_trans_rmap.c  |  235 +++++++++++++++++++++++++++++++
> >  4 files changed, 589 insertions(+), 8 deletions(-)
> >  create mode 100644 fs/xfs/xfs_trans_rmap.c
> > 
> > 
> > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > index 8ae0a10..1980110 100644
> > --- a/fs/xfs/Makefile
> > +++ b/fs/xfs/Makefile
> > @@ -110,6 +110,7 @@ xfs-y				+= xfs_log.o \
> >  				   xfs_trans_buf.o \
> >  				   xfs_trans_extfree.o \
> >  				   xfs_trans_inode.o \
> > +				   xfs_trans_rmap.o \
> >  
> >  # optional features
> >  xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index b33187b..c9fe0c4 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -44,6 +44,7 @@
> >  #include "xfs_bmap_btree.h"
> >  #include "xfs_error.h"
> >  #include "xfs_dir2.h"
> > +#include "xfs_rmap_item.h"
> >  
> >  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
> >  
> > @@ -1912,6 +1913,8 @@ xlog_recover_reorder_trans(
> >  		case XFS_LI_QUOTAOFF:
> >  		case XFS_LI_EFD:
> >  		case XFS_LI_EFI:
> > +		case XFS_LI_RUI:
> > +		case XFS_LI_RUD:
> >  			trace_xfs_log_recover_item_reorder_tail(log,
> >  							trans, item, pass);
> >  			list_move_tail(&item->ri_list, &inode_list);
> > @@ -3416,6 +3419,101 @@ xlog_recover_efd_pass2(
> >  }
> >  
> >  /*
> > + * This routine is called to create an in-core extent rmap update
> > + * item from the rui format structure which was logged on disk.
> > + * It allocates an in-core rui, copies the extents from the format
> > + * structure into it, and adds the rui to the AIL with the given
> > + * LSN.
> > + */
> > +STATIC int
> > +xlog_recover_rui_pass2(
> > +	struct xlog			*log,
> > +	struct xlog_recover_item	*item,
> > +	xfs_lsn_t			lsn)
> > +{
> > +	int				error;
> > +	struct xfs_mount		*mp = log->l_mp;
> > +	struct xfs_rui_log_item		*ruip;
> > +	struct xfs_rui_log_format	*rui_formatp;
> > +
> > +	rui_formatp = item->ri_buf[0].i_addr;
> > +
> > +	ruip = xfs_rui_init(mp, rui_formatp->rui_nextents);
> > +	error = xfs_rui_copy_format(&item->ri_buf[0], &ruip->rui_format);
> > +	if (error) {
> > +		xfs_rui_item_free(ruip);
> > +		return error;
> > +	}
> > +	atomic_set(&ruip->rui_next_extent, rui_formatp->rui_nextents);
> > +
> > +	spin_lock(&log->l_ailp->xa_lock);
> > +	/*
> > +	 * The RUI has two references. One for the RUD and one for RUI to ensure
> > +	 * it makes it into the AIL. Insert the RUI into the AIL directly and
> > +	 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
> > +	 * AIL lock.
> > +	 */
> > +	xfs_trans_ail_update(log->l_ailp, &ruip->rui_item, lsn);
> > +	xfs_rui_release(ruip);
> > +	return 0;
> > +}
> > +
> > +
> > +/*
> > + * This routine is called when an RUD format structure is found in a committed
> > + * transaction in the log. Its purpose is to cancel the corresponding RUI if it
> > + * was still in the log. To do this it searches the AIL for the RUI with an id
> > + * equal to that in the RUD format structure. If we find it we drop the RUD
> > + * reference, which removes the RUI from the AIL and frees it.
> > + */
> > +STATIC int
> > +xlog_recover_rud_pass2(
> > +	struct xlog			*log,
> > +	struct xlog_recover_item	*item)
> > +{
> > +	struct xfs_rud_log_format	*rud_formatp;
> > +	struct xfs_rui_log_item		*ruip = NULL;
> > +	struct xfs_log_item		*lip;
> > +	__uint64_t			rui_id;
> > +	struct xfs_ail_cursor		cur;
> > +	struct xfs_ail			*ailp = log->l_ailp;
> > +
> > +	rud_formatp = item->ri_buf[0].i_addr;
> > +	ASSERT(item->ri_buf[0].i_len == (sizeof(struct xfs_rud_log_format) +
> > +			((rud_formatp->rud_nextents - 1) *
> > +			sizeof(struct xfs_map_extent))));
> > +	rui_id = rud_formatp->rud_rui_id;
> > +
> > +	/*
> > +	 * Search for the RUI with the id in the RUD format structure in the
> > +	 * AIL.
> > +	 */
> > +	spin_lock(&ailp->xa_lock);
> > +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > +	while (lip != NULL) {
> > +		if (lip->li_type == XFS_LI_RUI) {
> > +			ruip = (struct xfs_rui_log_item *)lip;
> > +			if (ruip->rui_format.rui_id == rui_id) {
> > +				/*
> > +				 * Drop the RUD reference to the RUI. This
> > +				 * removes the RUI from the AIL and frees it.
> > +				 */
> > +				spin_unlock(&ailp->xa_lock);
> > +				xfs_rui_release(ruip);
> > +				spin_lock(&ailp->xa_lock);
> > +				break;
> > +			}
> > +		}
> > +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +	}
> > +
> > +	xfs_trans_ail_cursor_done(&cur);
> > +	spin_unlock(&ailp->xa_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> >   * This routine is called when an inode create format structure is found in a
> >   * committed transaction in the log.  It's purpose is to initialise the inodes
> >   * being allocated on disk. This requires us to get inode cluster buffers that
> > @@ -3640,6 +3738,8 @@ xlog_recover_ra_pass2(
> >  	case XFS_LI_EFI:
> >  	case XFS_LI_EFD:
> >  	case XFS_LI_QUOTAOFF:
> > +	case XFS_LI_RUI:
> > +	case XFS_LI_RUD:
> >  	default:
> >  		break;
> >  	}
> > @@ -3663,6 +3763,8 @@ xlog_recover_commit_pass1(
> >  	case XFS_LI_EFD:
> >  	case XFS_LI_DQUOT:
> >  	case XFS_LI_ICREATE:
> > +	case XFS_LI_RUI:
> > +	case XFS_LI_RUD:
> >  		/* nothing to do in pass 1 */
> >  		return 0;
> >  	default:
> > @@ -3693,6 +3795,10 @@ xlog_recover_commit_pass2(
> >  		return xlog_recover_efi_pass2(log, item, trans->r_lsn);
> >  	case XFS_LI_EFD:
> >  		return xlog_recover_efd_pass2(log, item);
> > +	case XFS_LI_RUI:
> > +		return xlog_recover_rui_pass2(log, item, trans->r_lsn);
> > +	case XFS_LI_RUD:
> > +		return xlog_recover_rud_pass2(log, item);
> >  	case XFS_LI_DQUOT:
> >  		return xlog_recover_dquot_pass2(log, buffer_list, item,
> >  						trans->r_lsn);
> > @@ -4165,6 +4271,18 @@ xlog_recover_process_data(
> >  	return 0;
> >  }
> >  
> > +/* Is this log item a deferred action intent? */
> > +static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
> > +{
> > +	switch (lip->li_type) {
> > +	case XFS_LI_EFI:
> > +	case XFS_LI_RUI:
> > +		return true;
> > +	default:
> > +		return false;
> > +	}
> > +}
> > +
> >  /*
> >   * Process an extent free intent item that was recovered from
> >   * the log.  We need to free the extents that it describes.
> > @@ -4265,17 +4383,23 @@ xlog_recover_process_efis(
> >  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> >  	while (lip != NULL) {
> >  		/*
> > -		 * We're done when we see something other than an EFI.
> > -		 * There should be no EFIs left in the AIL now.
> > +		 * We're done when we see something other than an intent.
> > +		 * There should be no intents left in the AIL now.
> >  		 */
> > -		if (lip->li_type != XFS_LI_EFI) {
> > +		if (!xlog_item_is_intent(lip)) {
> >  #ifdef DEBUG
> >  			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> > -				ASSERT(lip->li_type != XFS_LI_EFI);
> > +				ASSERT(!xlog_item_is_intent(lip));
> >  #endif
> >  			break;
> >  		}
> >  
> > +		/* Skip anything that isn't an EFI */
> > +		if (lip->li_type != XFS_LI_EFI) {
> > +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +			continue;
> > +		}
> > +
> 
> Hmm, so previously this function used the existence of any non-EFI item
> as an end of traversal marker, since the freeing operations add more
> items to the AIL. It's not immediately clear to me whether this is just
> an efficiency thing or a potential problem, but I wonder if we should
> grab the last item and use that or its lsn as an end of list marker.

FWIW I designed all this under the impression that it was safe to stop looking
for intent items once we found something that wasn't an intent item because all
the new items generated during log recovery came after, and therefore there was
no problem.

> At the very least we need to update the comment at the top of the
> function wrt to the current behavior.

Oops, missed that, yeah.

> >  		/*
> >  		 * Skip EFIs that we've already processed.
> >  		 */
> > @@ -4320,14 +4444,20 @@ xlog_recover_cancel_efis(
> >  		 * We're done when we see something other than an EFI.
> >  		 * There should be no EFIs left in the AIL now.
> >  		 */
> 
> Need to update this comment as for process_efis()...

Yep.

> > -		if (lip->li_type != XFS_LI_EFI) {
> > +		if (!xlog_item_is_intent(lip)) {
> >  #ifdef DEBUG
> >  			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> > -				ASSERT(lip->li_type != XFS_LI_EFI);
> > +				ASSERT(!xlog_item_is_intent(lip));
> >  #endif
> >  			break;
> >  		}
> >  
> > +		/* Skip anything that isn't an EFI */
> > +		if (lip->li_type != XFS_LI_EFI) {
> > +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +			continue;
> > +		}
> > +
> >  		efip = container_of(lip, struct xfs_efi_log_item, efi_item);
> >  
> >  		spin_unlock(&ailp->xa_lock);
> > @@ -4343,6 +4473,190 @@ xlog_recover_cancel_efis(
> >  }
> >  
> >  /*
> > + * Process an rmap update intent item that was recovered from the log.
> > + * We need to update the rmapbt.
> > + */
> > +STATIC int
> > +xlog_recover_process_rui(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_rui_log_item		*ruip)
> > +{
> > +	int				i;
> > +	int				error = 0;
> > +	struct xfs_map_extent		*rmap;
> > +	xfs_fsblock_t			startblock_fsb;
> > +	bool				op_ok;
> > +
> > +	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
> > +
> > +	/*
> > +	 * First check the validity of the extents described by the
> > +	 * RUI.  If any are bad, then assume that all are bad and
> > +	 * just toss the RUI.
> > +	 */
> > +	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
> > +		rmap = &(ruip->rui_format.rui_extents[i]);
> > +		startblock_fsb = XFS_BB_TO_FSB(mp,
> > +				   XFS_FSB_TO_DADDR(mp, rmap->me_startblock));
> > +		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
> > +		case XFS_RMAP_EXTENT_MAP:
> > +		case XFS_RMAP_EXTENT_MAP_SHARED:
> > +		case XFS_RMAP_EXTENT_UNMAP:
> > +		case XFS_RMAP_EXTENT_UNMAP_SHARED:
> > +		case XFS_RMAP_EXTENT_CONVERT:
> > +		case XFS_RMAP_EXTENT_CONVERT_SHARED:
> > +		case XFS_RMAP_EXTENT_ALLOC:
> > +		case XFS_RMAP_EXTENT_FREE:
> > +			op_ok = true;
> > +			break;
> > +		default:
> > +			op_ok = false;
> > +			break;
> > +		}
> > +		if (!op_ok || (startblock_fsb == 0) ||
> > +		    (rmap->me_len == 0) ||
> > +		    (startblock_fsb >= mp->m_sb.sb_dblocks) ||
> > +		    (rmap->me_len >= mp->m_sb.sb_agblocks) ||
> > +		    (rmap->me_flags & ~XFS_RMAP_EXTENT_FLAGS)) {
> > +			/*
> > +			 * This will pull the RUI from the AIL and
> > +			 * free the memory associated with it.
> > +			 */
> > +			set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> > +			xfs_rui_release(ruip);
> > +			return -EIO;
> > +		}
> > +	}
> > +
> > +	/* XXX: do nothing for now */
> > +	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> > +	xfs_rui_release(ruip);
> > +	return error;
> > +}
> > +
> > +/*
> > + * When this is called, all of the RUIs which did not have
> > + * corresponding RUDs should be in the AIL.  What we do now
> > + * is update the rmaps associated with each one.
> > + *
> > + * Since we process the RUIs in normal transactions, they
> > + * will be removed at some point after the commit.  This prevents
> > + * us from just walking down the list processing each one.
> > + * We'll use a flag in the RUI to skip those that we've already
> > + * processed and use the AIL iteration mechanism's generation
> > + * count to try to speed this up at least a bit.
> > + *
> > + * When we start, we know that the RUIs are the only things in
> > + * the AIL.  As we process them, however, other items are added
> > + * to the AIL.  Since everything added to the AIL must come after
> > + * everything already in the AIL, we stop processing as soon as
> > + * we see something other than an RUI in the AIL.
> > + */
> > +STATIC int
> > +xlog_recover_process_ruis(
> > +	struct xlog		*log)
> > +{
> > +	struct xfs_log_item	*lip;
> > +	struct xfs_rui_log_item	*ruip;
> > +	int			error = 0;
> > +	struct xfs_ail_cursor	cur;
> > +	struct xfs_ail		*ailp;
> > +
> > +	ailp = log->l_ailp;
> > +	spin_lock(&ailp->xa_lock);
> > +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > +	while (lip != NULL) {
> > +		/*
> > +		 * We're done when we see something other than an intent.
> > +		 * There should be no intents left in the AIL now.
> > +		 */
> > +		if (!xlog_item_is_intent(lip)) {
> > +#ifdef DEBUG
> > +			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> > +				ASSERT(!xlog_item_is_intent(lip));
> > +#endif
> > +			break;
> > +		}
> > +
> > +		/* Skip anything that isn't an RUI */
> > +		if (lip->li_type != XFS_LI_RUI) {
> > +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Skip RUIs that we've already processed.
> > +		 */
> > +		ruip = container_of(lip, struct xfs_rui_log_item, rui_item);
> > +		if (test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags)) {
> > +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +			continue;
> > +		}
> > +
> > +		spin_unlock(&ailp->xa_lock);
> > +		error = xlog_recover_process_rui(log->l_mp, ruip);
> > +		spin_lock(&ailp->xa_lock);
> > +		if (error)
> > +			goto out;
> > +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +	}
> > +out:
> > +	xfs_trans_ail_cursor_done(&cur);
> > +	spin_unlock(&ailp->xa_lock);
> > +	return error;
> > +}
> > +
> > +/*
> > + * A cancel occurs when the mount has failed and we're bailing out. Release all
> > + * pending RUIs so they don't pin the AIL.
> > + */
> > +STATIC int
> > +xlog_recover_cancel_ruis(
> > +	struct xlog		*log)
> > +{
> > +	struct xfs_log_item	*lip;
> > +	struct xfs_rui_log_item	*ruip;
> > +	int			error = 0;
> > +	struct xfs_ail_cursor	cur;
> > +	struct xfs_ail		*ailp;
> > +
> > +	ailp = log->l_ailp;
> > +	spin_lock(&ailp->xa_lock);
> > +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > +	while (lip != NULL) {
> > +		/*
> > +		 * We're done when we see something other than an RUI.
> > +		 * There should be no RUIs left in the AIL now.
> > +		 */
> > +		if (!xlog_item_is_intent(lip)) {
> > +#ifdef DEBUG
> > +			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> > +				ASSERT(!xlog_item_is_intent(lip));
> > +#endif
> > +			break;
> > +		}
> > +
> > +		/* Skip anything that isn't an RUI */
> > +		if (lip->li_type != XFS_LI_RUI) {
> > +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +			continue;
> > +		}
> > +
> > +		ruip = container_of(lip, struct xfs_rui_log_item, rui_item);
> > +
> > +		spin_unlock(&ailp->xa_lock);
> > +		xfs_rui_release(ruip);
> > +		spin_lock(&ailp->xa_lock);
> > +
> > +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +	}
> > +
> > +	xfs_trans_ail_cursor_done(&cur);
> > +	spin_unlock(&ailp->xa_lock);
> > +	return error;
> > +}
> 
> How about we combine this and cancel_efis() into a cancel_intents()
> function so we only have to make one pass? It looks like the only
> difference is the item-specific release call.

Yeah, sounds like a good refactor.

> > +
> > +/*
> >   * This routine performs a transaction to null out a bad inode pointer
> >   * in an agi unlinked inode hash bucket.
> >   */
> > @@ -5144,11 +5458,19 @@ xlog_recover_finish(
> >  	 */
> >  	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
> >  		int	error;
> > +
> > +		error = xlog_recover_process_ruis(log);
> > +		if (error) {
> > +			xfs_alert(log->l_mp, "Failed to recover RUIs");
> > +			return error;
> > +		}
> > +
> >  		error = xlog_recover_process_efis(log);
> >  		if (error) {
> >  			xfs_alert(log->l_mp, "Failed to recover EFIs");
> >  			return error;
> >  		}
> > +
> 
> Is the order important here in any way (e.g., RUIs before EFIs)? If so,
> it might be a good idea to call it out.

AFAIK the intent items within a particular type have to be replayed in
order, but between types, there isn't a problem with the current code.

That said, I'd also been wondering if it made more sense to iterate the
list of items /once/ and actually replay items in order.  Less iteration
and the order of replayed items matches the log order much more closely.

> >  		/*
> >  		 * Sync the log to get all the EFIs out of the AIL.
> >  		 * This isn't absolutely necessary, but it helps in
> > @@ -5176,9 +5498,15 @@ xlog_recover_cancel(
> >  	struct xlog	*log)
> >  {
> >  	int		error = 0;
> > +	int		err2;
> >  
> > -	if (log->l_flags & XLOG_RECOVERY_NEEDED)
> > -		error = xlog_recover_cancel_efis(log);
> > +	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
> > +		error = xlog_recover_cancel_ruis(log);
> > +
> > +		err2 = xlog_recover_cancel_efis(log);
> > +		if (err2 && !error)
> > +			error = err2;
> > +	}
> >  
> >  	return error;
> >  }
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index f8d363f..c48be63 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -235,4 +235,21 @@ void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
> >  extern kmem_zone_t	*xfs_trans_zone;
> >  extern kmem_zone_t	*xfs_log_item_desc_zone;
> >  
> > +enum xfs_rmap_intent_type;
> > +
> > +struct xfs_rui_log_item *xfs_trans_get_rui(struct xfs_trans *tp, uint nextents);
> > +void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> > +		struct xfs_rui_log_item *ruip, enum xfs_rmap_intent_type type,
> > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > +		xfs_exntst_t state);
> > +
> > +struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> > +		struct xfs_rui_log_item *ruip, uint nextents);
> > +int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> > +		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > +		xfs_exntst_t state);
> > +
> >  #endif	/* __XFS_TRANS_H__ */
> > diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> > new file mode 100644
> > index 0000000..b55a725
> > --- /dev/null
> > +++ b/fs/xfs/xfs_trans_rmap.c
> > @@ -0,0 +1,235 @@
> > +/*
> > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_trans_priv.h"
> > +#include "xfs_rmap_item.h"
> > +#include "xfs_alloc.h"
> > +#include "xfs_rmap_btree.h"
> > +
> > +/*
> > + * This routine is called to allocate an "rmap update intent"
> > + * log item that will hold nextents worth of extents.  The
> > + * caller must use all nextents extents, because we are not
> > + * flexible about this at all.
> > + */
> > +struct xfs_rui_log_item *
> > +xfs_trans_get_rui(
> > +	struct xfs_trans		*tp,
> > +	uint				nextents)
> > +{
> > +	struct xfs_rui_log_item		*ruip;
> > +
> > +	ASSERT(tp != NULL);
> > +	ASSERT(nextents > 0);
> > +
> > +	ruip = xfs_rui_init(tp->t_mountp, nextents);
> > +	ASSERT(ruip != NULL);
> > +
> > +	/*
> > +	 * Get a log_item_desc to point at the new item.
> > +	 */
> > +	xfs_trans_add_item(tp, &ruip->rui_item);
> > +	return ruip;
> > +}
> > +
> > +/*
> > + * This routine is called to indicate that the described
> > + * extent is to be logged as needing to be freed.  It should
> > + * be called once for each extent to be freed.
> > + */
> 
> Stale comment.

<nod>

> > +void
> > +xfs_trans_log_start_rmap_update(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_rui_log_item		*ruip,
> > +	enum xfs_rmap_intent_type	type,
> > +	__uint64_t			owner,
> > +	int				whichfork,
> > +	xfs_fileoff_t			startoff,
> > +	xfs_fsblock_t			startblock,
> > +	xfs_filblks_t			blockcount,
> > +	xfs_exntst_t			state)
> > +{
> > +	uint				next_extent;
> > +	struct xfs_map_extent		*rmap;
> > +
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	ruip->rui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > +
> > +	/*
> > +	 * atomic_inc_return gives us the value after the increment;
> > +	 * we want to use it as an array index so we need to subtract 1 from
> > +	 * it.
> > +	 */
> > +	next_extent = atomic_inc_return(&ruip->rui_next_extent) - 1;
> > +	ASSERT(next_extent < ruip->rui_format.rui_nextents);
> > +	rmap = &(ruip->rui_format.rui_extents[next_extent]);
> > +	rmap->me_owner = owner;
> > +	rmap->me_startblock = startblock;
> > +	rmap->me_startoff = startoff;
> > +	rmap->me_len = blockcount;
> > +	rmap->me_flags = 0;
> > +	if (state == XFS_EXT_UNWRITTEN)
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> > +	switch (type) {
> > +	case XFS_RMAP_MAP:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> > +		break;
> > +	case XFS_RMAP_MAP_SHARED:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> > +		break;
> > +	case XFS_RMAP_UNMAP:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> > +		break;
> > +	case XFS_RMAP_UNMAP_SHARED:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> > +		break;
> > +	case XFS_RMAP_CONVERT:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> > +		break;
> > +	case XFS_RMAP_CONVERT_SHARED:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> > +		break;
> > +	case XFS_RMAP_ALLOC:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> > +		break;
> > +	case XFS_RMAP_FREE:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> > +		break;
> > +	default:
> > +		ASSERT(0);
> > +	}
> 
> Between here and the finish function, it looks like we could use a
> helper to convert the state and whatnot to extent flags.

Ok.

> > +}
> > +
> > +
> > +/*
> > + * This routine is called to allocate an "extent free done"
> > + * log item that will hold nextents worth of extents.  The
> > + * caller must use all nextents extents, because we are not
> > + * flexible about this at all.
> > + */
> 
> Comment needs updating.

Ok.

> Brian
> 
> > +struct xfs_rud_log_item *
> > +xfs_trans_get_rud(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_rui_log_item		*ruip,
> > +	uint				nextents)
> > +{
> > +	struct xfs_rud_log_item		*rudp;
> > +
> > +	ASSERT(tp != NULL);
> > +	ASSERT(nextents > 0);
> > +
> > +	rudp = xfs_rud_init(tp->t_mountp, ruip, nextents);
> > +	ASSERT(rudp != NULL);
> > +
> > +	/*
> > +	 * Get a log_item_desc to point at the new item.
> > +	 */
> > +	xfs_trans_add_item(tp, &rudp->rud_item);
> > +	return rudp;
> > +}
> > +
> > +/*
> > + * Finish an rmap update and log it to the RUD. Note that the transaction is
> > + * marked dirty regardless of whether the rmap update succeeds or fails to
> > + * support the RUI/RUD lifecycle rules.
> > + */
> > +int
> > +xfs_trans_log_finish_rmap_update(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_rud_log_item		*rudp,
> > +	enum xfs_rmap_intent_type	type,
> > +	__uint64_t			owner,
> > +	int				whichfork,
> > +	xfs_fileoff_t			startoff,
> > +	xfs_fsblock_t			startblock,
> > +	xfs_filblks_t			blockcount,
> > +	xfs_exntst_t			state)
> > +{
> > +	uint				next_extent;
> > +	struct xfs_map_extent		*rmap;
> > +	int				error;
> > +
> > +	/* XXX: actually finish the rmap update here */
> > +	error = -EFSCORRUPTED;
> > +
> > +	/*
> > +	 * Mark the transaction dirty, even on error. This ensures the
> > +	 * transaction is aborted, which:
> > +	 *
> > +	 * 1.) releases the RUI and frees the RUD
> > +	 * 2.) shuts down the filesystem
> > +	 */
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	rudp->rud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > +
> > +	next_extent = rudp->rud_next_extent;
> > +	ASSERT(next_extent < rudp->rud_format.rud_nextents);
> > +	rmap = &(rudp->rud_format.rud_extents[next_extent]);
> > +	rmap->me_owner = owner;
> > +	rmap->me_startblock = startblock;
> > +	rmap->me_startoff = startoff;
> > +	rmap->me_len = blockcount;
> > +	rmap->me_flags = 0;
> > +	if (state == XFS_EXT_UNWRITTEN)
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> > +	switch (type) {
> > +	case XFS_RMAP_MAP:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> > +		break;
> > +	case XFS_RMAP_MAP_SHARED:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> > +		break;
> > +	case XFS_RMAP_UNMAP:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> > +		break;
> > +	case XFS_RMAP_UNMAP_SHARED:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> > +		break;
> > +	case XFS_RMAP_CONVERT:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> > +		break;
> > +	case XFS_RMAP_CONVERT_SHARED:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> > +		break;
> > +	case XFS_RMAP_ALLOC:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> > +		break;
> > +	case XFS_RMAP_FREE:
> > +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> > +		break;
> > +	default:
> > +		ASSERT(0);
> > +	}
> > +	rudp->rud_next_extent++;
> > +
> > +	return error;
> > +}
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-07-16  7:26     ` Darrick J. Wong
@ 2016-07-18  1:21       ` Dave Chinner
  2016-07-18 12:56         ` Brian Foster
  2016-07-18 12:55       ` Brian Foster
  1 sibling, 1 reply; 236+ messages in thread
From: Dave Chinner @ 2016-07-18  1:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-fsdevel, vishal.l.verma, xfs

On Sat, Jul 16, 2016 at 12:26:21AM -0700, Darrick J. Wong wrote:
> On Fri, Jul 15, 2016 at 02:33:56PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:22:34PM -0700, Darrick J. Wong wrote:
> > > When we map, unmap, or convert an extent in a file's data or attr
> > > fork, schedule a respective update in the rmapbt.  Previous versions
> > > of this patch required a 1:1 correspondence between bmap and rmap,
> > > but this is no longer true.
> > > 
> > > v2: Remove the 1:1 correspondence requirement now that we have the
> > > ability to make interval queries against the rmapbt.  Update the
> > > commit message to reflect the broad restructuring of this patch.
> > > Fix the bmap shift code to adjust the rmaps correctly.
> > > 
> > > v3: Use the deferred operations code to handle redo operations
> > > atomically and deadlock free.  Plumb in all five rmap actions
> > > (map, unmap, convert extent, alloc, free); we'll use the first
> > > three now for file data, and reflink will want the last two.
> > > Add an error injection site to test log recovery.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
.....
> > > + * superblock and the AGF because we'll always grab them in the same
> > > + * order.
> > > + */
> > > +int
> > > +xfs_rmap_finish_one(
> > > +	struct xfs_trans		*tp,
> > > +	enum xfs_rmap_intent_type	type,
> > > +	__uint64_t			owner,
> > > +	int				whichfork,
> > > +	xfs_fileoff_t			startoff,
> > > +	xfs_fsblock_t			startblock,
> > > +	xfs_filblks_t			blockcount,
> > > +	xfs_exntst_t			state,
> > > +	struct xfs_btree_cur		**pcur)
> > > +{
> > > +	struct xfs_mount		*mp = tp->t_mountp;
> > > +	struct xfs_btree_cur		*rcur;
> > > +	struct xfs_buf			*agbp = NULL;
> > > +	int				error = 0;
> > > +	xfs_agnumber_t			agno;
> > > +	struct xfs_owner_info		oinfo;
> > > +	xfs_agblock_t			bno;
> > > +	bool				unwritten;
> > > +
> > > +	agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > +	ASSERT(agno != NULLAGNUMBER);
> > > +	bno = XFS_FSB_TO_AGBNO(mp, startblock);
> > > +
> > > +	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
> > > +			startoff, blockcount, state);
> > > +
> > > +	if (XFS_TEST_ERROR(false, mp,
> > > +			XFS_ERRTAG_RMAP_FINISH_ONE,
> > > +			XFS_RANDOM_RMAP_FINISH_ONE))
> > > +		return -EIO;
> > > +
> > > +	/*
> > > +	 * If we haven't gotten a cursor or the cursor AG doesn't match
> > > +	 * the startblock, get one now.
> > > +	 */
> > > +	rcur = *pcur;
> > > +	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
> > > +		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
> > > +		rcur = NULL;
> > > +		*pcur = NULL;
> > > +	}
> > > +	if (rcur == NULL) {
> > > +		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
> > 
> > Comment? Why is this here? (Maybe we should rename that function while
> > we're at it..)
> 
> /*
>  * Ensure the freelist is of a sufficient length to provide for any btree
>  * splits that could happen when we make changes to the rmapbt.
>  */
> 
> (I don't know why the function has that name; Dave supplied it.)

I named it that way because it was common code factored out of
xfs_free_extent() for use by multiple callers on the extent freeing
side of things. Feel free to name it differently if you can think of
something more appropriate.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 042/119] xfs: log rmap intent items
  2016-07-16  7:34     ` Darrick J. Wong
@ 2016-07-18 12:55       ` Brian Foster
  2016-07-19 17:10         ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-18 12:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Sat, Jul 16, 2016 at 12:34:09AM -0700, Darrick J. Wong wrote:
> On Fri, Jul 15, 2016 at 02:33:46PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:22:21PM -0700, Darrick J. Wong wrote:
> > > Provide a mechanism for higher levels to create RUI/RUD items, submit
> > > them to the log, and a stub function to deal with recovered RUI items.
> > > These parts will be connected to the rmapbt in a later patch.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > The commit log makes no mention of log recovery.. perhaps this should be
> > split in two?
> > 
> > >  fs/xfs/Makefile          |    1 
> > >  fs/xfs/xfs_log_recover.c |  344 +++++++++++++++++++++++++++++++++++++++++++++-
> > >  fs/xfs/xfs_trans.h       |   17 ++
> > >  fs/xfs/xfs_trans_rmap.c  |  235 +++++++++++++++++++++++++++++++
> > >  4 files changed, 589 insertions(+), 8 deletions(-)
> > >  create mode 100644 fs/xfs/xfs_trans_rmap.c
> > > 
> > > 
> > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > index 8ae0a10..1980110 100644
> > > --- a/fs/xfs/Makefile
> > > +++ b/fs/xfs/Makefile
> > > @@ -110,6 +110,7 @@ xfs-y				+= xfs_log.o \
> > >  				   xfs_trans_buf.o \
> > >  				   xfs_trans_extfree.o \
> > >  				   xfs_trans_inode.o \
> > > +				   xfs_trans_rmap.o \
> > >  
> > >  # optional features
> > >  xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index b33187b..c9fe0c4 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
...
> > > @@ -4265,17 +4383,23 @@ xlog_recover_process_efis(
> > >  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > >  	while (lip != NULL) {
> > >  		/*
> > > -		 * We're done when we see something other than an EFI.
> > > -		 * There should be no EFIs left in the AIL now.
> > > +		 * We're done when we see something other than an intent.
> > > +		 * There should be no intents left in the AIL now.
> > >  		 */
> > > -		if (lip->li_type != XFS_LI_EFI) {
> > > +		if (!xlog_item_is_intent(lip)) {
> > >  #ifdef DEBUG
> > >  			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> > > -				ASSERT(lip->li_type != XFS_LI_EFI);
> > > +				ASSERT(!xlog_item_is_intent(lip));
> > >  #endif
> > >  			break;
> > >  		}
> > >  
> > > +		/* Skip anything that isn't an EFI */
> > > +		if (lip->li_type != XFS_LI_EFI) {
> > > +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > > +			continue;
> > > +		}
> > > +
> > 
> > Hmm, so previously this function used the existence of any non-EFI item
> > as an end of traversal marker, since the freeing operations add more
> > items to the AIL. It's not immediately clear to me whether this is just
> > an efficiency thing or a potential problem, but I wonder if we should
> > grab the last item and use that or its lsn as an end of list marker.
> 
> FWIW I designed all this under the impression that it was safe to stop looking
> for intent items once we found something that wasn't an intent item because all
> the new items generated during log recovery came after, and therefore there was
> no problem.
> 

Ok. To be clear, are you saying that any new intents should follow
non-intent items? If so, that sounds... reasonable (perhaps a little
landmind-ish :P).

> > At the very least we need to update the comment at the top of the
> > function wrt to the current behavior.
> 
> Oops, missed that, yeah.
> 
> > >  		/*
> > >  		 * Skip EFIs that we've already processed.
> > >  		 */
...
> > > @@ -5144,11 +5458,19 @@ xlog_recover_finish(
> > >  	 */
> > >  	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
> > >  		int	error;
> > > +
> > > +		error = xlog_recover_process_ruis(log);
> > > +		if (error) {
> > > +			xfs_alert(log->l_mp, "Failed to recover RUIs");
> > > +			return error;
> > > +		}
> > > +
> > >  		error = xlog_recover_process_efis(log);
> > >  		if (error) {
> > >  			xfs_alert(log->l_mp, "Failed to recover EFIs");
> > >  			return error;
> > >  		}
> > > +
> > 
> > Is the order important here in any way (e.g., RUIs before EFIs)? If so,
> > it might be a good idea to call it out.
> 
> AFAIK the intent items within a particular type have to be replayed in
> order, but between types, there isn't a problem with the current code.
> 
> That said, I'd also been wondering if it made more sense to iterate the
> list of items /once/ and actually replay items in order.  Less iteration
> and the order of replayed items matches the log order much more closely.
> 

That sounds like a nice idea to me. There might actually be some room
for consolidation between the RUI/EFI recovered bits and whatnot, but
only if it makes things more clean and simple.

Brian

> > >  		/*
> > >  		 * Sync the log to get all the EFIs out of the AIL.
> > >  		 * This isn't absolutely necessary, but it helps in
> > > @@ -5176,9 +5498,15 @@ xlog_recover_cancel(
> > >  	struct xlog	*log)
> > >  {
> > >  	int		error = 0;
> > > +	int		err2;
> > >  
> > > -	if (log->l_flags & XLOG_RECOVERY_NEEDED)
> > > -		error = xlog_recover_cancel_efis(log);
> > > +	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
> > > +		error = xlog_recover_cancel_ruis(log);
> > > +
> > > +		err2 = xlog_recover_cancel_efis(log);
> > > +		if (err2 && !error)
> > > +			error = err2;
> > > +	}
> > >  
> > >  	return error;
> > >  }
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index f8d363f..c48be63 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -235,4 +235,21 @@ void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
> > >  extern kmem_zone_t	*xfs_trans_zone;
> > >  extern kmem_zone_t	*xfs_log_item_desc_zone;
> > >  
> > > +enum xfs_rmap_intent_type;
> > > +
> > > +struct xfs_rui_log_item *xfs_trans_get_rui(struct xfs_trans *tp, uint nextents);
> > > +void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> > > +		struct xfs_rui_log_item *ruip, enum xfs_rmap_intent_type type,
> > > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > +		xfs_exntst_t state);
> > > +
> > > +struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> > > +		struct xfs_rui_log_item *ruip, uint nextents);
> > > +int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> > > +		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> > > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > +		xfs_exntst_t state);
> > > +
> > >  #endif	/* __XFS_TRANS_H__ */
> > > diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> > > new file mode 100644
> > > index 0000000..b55a725
> > > --- /dev/null
> > > +++ b/fs/xfs/xfs_trans_rmap.c
> > > @@ -0,0 +1,235 @@
> > > +/*
> > > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > > + *
> > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > + *
> > > + * This program is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU General Public License
> > > + * as published by the Free Software Foundation; either version 2
> > > + * of the License, or (at your option) any later version.
> > > + *
> > > + * This program is distributed in the hope that it would be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License
> > > + * along with this program; if not, write the Free Software Foundation,
> > > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > > + */
> > > +#include "xfs.h"
> > > +#include "xfs_fs.h"
> > > +#include "xfs_shared.h"
> > > +#include "xfs_format.h"
> > > +#include "xfs_log_format.h"
> > > +#include "xfs_trans_resv.h"
> > > +#include "xfs_mount.h"
> > > +#include "xfs_defer.h"
> > > +#include "xfs_trans.h"
> > > +#include "xfs_trans_priv.h"
> > > +#include "xfs_rmap_item.h"
> > > +#include "xfs_alloc.h"
> > > +#include "xfs_rmap_btree.h"
> > > +
> > > +/*
> > > + * This routine is called to allocate an "rmap update intent"
> > > + * log item that will hold nextents worth of extents.  The
> > > + * caller must use all nextents extents, because we are not
> > > + * flexible about this at all.
> > > + */
> > > +struct xfs_rui_log_item *
> > > +xfs_trans_get_rui(
> > > +	struct xfs_trans		*tp,
> > > +	uint				nextents)
> > > +{
> > > +	struct xfs_rui_log_item		*ruip;
> > > +
> > > +	ASSERT(tp != NULL);
> > > +	ASSERT(nextents > 0);
> > > +
> > > +	ruip = xfs_rui_init(tp->t_mountp, nextents);
> > > +	ASSERT(ruip != NULL);
> > > +
> > > +	/*
> > > +	 * Get a log_item_desc to point at the new item.
> > > +	 */
> > > +	xfs_trans_add_item(tp, &ruip->rui_item);
> > > +	return ruip;
> > > +}
> > > +
> > > +/*
> > > + * This routine is called to indicate that the described
> > > + * extent is to be logged as needing to be freed.  It should
> > > + * be called once for each extent to be freed.
> > > + */
> > 
> > Stale comment.
> 
> <nod>
> 
> > > +void
> > > +xfs_trans_log_start_rmap_update(
> > > +	struct xfs_trans		*tp,
> > > +	struct xfs_rui_log_item		*ruip,
> > > +	enum xfs_rmap_intent_type	type,
> > > +	__uint64_t			owner,
> > > +	int				whichfork,
> > > +	xfs_fileoff_t			startoff,
> > > +	xfs_fsblock_t			startblock,
> > > +	xfs_filblks_t			blockcount,
> > > +	xfs_exntst_t			state)
> > > +{
> > > +	uint				next_extent;
> > > +	struct xfs_map_extent		*rmap;
> > > +
> > > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > > +	ruip->rui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > > +
> > > +	/*
> > > +	 * atomic_inc_return gives us the value after the increment;
> > > +	 * we want to use it as an array index so we need to subtract 1 from
> > > +	 * it.
> > > +	 */
> > > +	next_extent = atomic_inc_return(&ruip->rui_next_extent) - 1;
> > > +	ASSERT(next_extent < ruip->rui_format.rui_nextents);
> > > +	rmap = &(ruip->rui_format.rui_extents[next_extent]);
> > > +	rmap->me_owner = owner;
> > > +	rmap->me_startblock = startblock;
> > > +	rmap->me_startoff = startoff;
> > > +	rmap->me_len = blockcount;
> > > +	rmap->me_flags = 0;
> > > +	if (state == XFS_EXT_UNWRITTEN)
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> > > +	if (whichfork == XFS_ATTR_FORK)
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> > > +	switch (type) {
> > > +	case XFS_RMAP_MAP:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> > > +		break;
> > > +	case XFS_RMAP_MAP_SHARED:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> > > +		break;
> > > +	case XFS_RMAP_UNMAP:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> > > +		break;
> > > +	case XFS_RMAP_UNMAP_SHARED:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> > > +		break;
> > > +	case XFS_RMAP_CONVERT:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> > > +		break;
> > > +	case XFS_RMAP_CONVERT_SHARED:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> > > +		break;
> > > +	case XFS_RMAP_ALLOC:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> > > +		break;
> > > +	case XFS_RMAP_FREE:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> > > +		break;
> > > +	default:
> > > +		ASSERT(0);
> > > +	}
> > 
> > Between here and the finish function, it looks like we could use a
> > helper to convert the state and whatnot to extent flags.
> 
> Ok.
> 
> > > +}
> > > +
> > > +
> > > +/*
> > > + * This routine is called to allocate an "extent free done"
> > > + * log item that will hold nextents worth of extents.  The
> > > + * caller must use all nextents extents, because we are not
> > > + * flexible about this at all.
> > > + */
> > 
> > Comment needs updating.
> 
> Ok.
> 
> > Brian
> > 
> > > +struct xfs_rud_log_item *
> > > +xfs_trans_get_rud(
> > > +	struct xfs_trans		*tp,
> > > +	struct xfs_rui_log_item		*ruip,
> > > +	uint				nextents)
> > > +{
> > > +	struct xfs_rud_log_item		*rudp;
> > > +
> > > +	ASSERT(tp != NULL);
> > > +	ASSERT(nextents > 0);
> > > +
> > > +	rudp = xfs_rud_init(tp->t_mountp, ruip, nextents);
> > > +	ASSERT(rudp != NULL);
> > > +
> > > +	/*
> > > +	 * Get a log_item_desc to point at the new item.
> > > +	 */
> > > +	xfs_trans_add_item(tp, &rudp->rud_item);
> > > +	return rudp;
> > > +}
> > > +
> > > +/*
> > > + * Finish an rmap update and log it to the RUD. Note that the transaction is
> > > + * marked dirty regardless of whether the rmap update succeeds or fails to
> > > + * support the RUI/RUD lifecycle rules.
> > > + */
> > > +int
> > > +xfs_trans_log_finish_rmap_update(
> > > +	struct xfs_trans		*tp,
> > > +	struct xfs_rud_log_item		*rudp,
> > > +	enum xfs_rmap_intent_type	type,
> > > +	__uint64_t			owner,
> > > +	int				whichfork,
> > > +	xfs_fileoff_t			startoff,
> > > +	xfs_fsblock_t			startblock,
> > > +	xfs_filblks_t			blockcount,
> > > +	xfs_exntst_t			state)
> > > +{
> > > +	uint				next_extent;
> > > +	struct xfs_map_extent		*rmap;
> > > +	int				error;
> > > +
> > > +	/* XXX: actually finish the rmap update here */
> > > +	error = -EFSCORRUPTED;
> > > +
> > > +	/*
> > > +	 * Mark the transaction dirty, even on error. This ensures the
> > > +	 * transaction is aborted, which:
> > > +	 *
> > > +	 * 1.) releases the RUI and frees the RUD
> > > +	 * 2.) shuts down the filesystem
> > > +	 */
> > > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > > +	rudp->rud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > > +
> > > +	next_extent = rudp->rud_next_extent;
> > > +	ASSERT(next_extent < rudp->rud_format.rud_nextents);
> > > +	rmap = &(rudp->rud_format.rud_extents[next_extent]);
> > > +	rmap->me_owner = owner;
> > > +	rmap->me_startblock = startblock;
> > > +	rmap->me_startoff = startoff;
> > > +	rmap->me_len = blockcount;
> > > +	rmap->me_flags = 0;
> > > +	if (state == XFS_EXT_UNWRITTEN)
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> > > +	if (whichfork == XFS_ATTR_FORK)
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> > > +	switch (type) {
> > > +	case XFS_RMAP_MAP:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> > > +		break;
> > > +	case XFS_RMAP_MAP_SHARED:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> > > +		break;
> > > +	case XFS_RMAP_UNMAP:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> > > +		break;
> > > +	case XFS_RMAP_UNMAP_SHARED:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> > > +		break;
> > > +	case XFS_RMAP_CONVERT:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> > > +		break;
> > > +	case XFS_RMAP_CONVERT_SHARED:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> > > +		break;
> > > +	case XFS_RMAP_ALLOC:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> > > +		break;
> > > +	case XFS_RMAP_FREE:
> > > +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> > > +		break;
> > > +	default:
> > > +		ASSERT(0);
> > > +	}
> > > +	rudp->rud_next_extent++;
> > > +
> > > +	return error;
> > > +}
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-07-16  7:26     ` Darrick J. Wong
  2016-07-18  1:21       ` Dave Chinner
@ 2016-07-18 12:55       ` Brian Foster
  2016-07-19  1:53         ` Darrick J. Wong
  1 sibling, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-18 12:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Sat, Jul 16, 2016 at 12:26:21AM -0700, Darrick J. Wong wrote:
> On Fri, Jul 15, 2016 at 02:33:56PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2016 at 06:22:34PM -0700, Darrick J. Wong wrote:
> > > When we map, unmap, or convert an extent in a file's data or attr
> > > fork, schedule a respective update in the rmapbt.  Previous versions
> > > of this patch required a 1:1 correspondence between bmap and rmap,
> > > but this is no longer true.
> > > 
> > > v2: Remove the 1:1 correspondence requirement now that we have the
> > > ability to make interval queries against the rmapbt.  Update the
> > > commit message to reflect the broad restructuring of this patch.
> > > Fix the bmap shift code to adjust the rmaps correctly.
> > > 
> > > v3: Use the deferred operations code to handle redo operations
> > > atomically and deadlock free.  Plumb in all five rmap actions
> > > (map, unmap, convert extent, alloc, free); we'll use the first
> > > three now for file data, and reflink will want the last two.
> > > Add an error injection site to test log recovery.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c       |   56 ++++++++-
> > >  fs/xfs/libxfs/xfs_rmap.c       |  252 ++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_rmap_btree.h |   24 ++++
> > >  fs/xfs/xfs_bmap_util.c         |    1 
> > >  fs/xfs/xfs_defer_item.c        |    6 +
> > >  fs/xfs/xfs_error.h             |    4 -
> > >  fs/xfs/xfs_log_recover.c       |   56 +++++++++
> > >  fs/xfs/xfs_trans.h             |    3 
> > >  fs/xfs/xfs_trans_rmap.c        |    7 +
> > >  9 files changed, 393 insertions(+), 16 deletions(-)
> > > 
> > > 
...
> > > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > > index 76fc5c2..f179ea4 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > > @@ -36,6 +36,8 @@
> > >  #include "xfs_trace.h"
> > >  #include "xfs_error.h"
> > >  #include "xfs_extent_busy.h"
> > > +#include "xfs_bmap.h"
> > > +#include "xfs_inode.h"
> > >  
> > >  /*
> > >   * Lookup the first record less than or equal to [bno, len, owner, offset]
> > > @@ -1212,3 +1214,253 @@ xfs_rmapbt_query_range(
> > >  	return xfs_btree_query_range(cur, &low_brec, &high_brec,
> > >  			xfs_rmapbt_query_range_helper, &query);
> > >  }
> > > +
> > > +/* Clean up after calling xfs_rmap_finish_one. */
> > > +void
> > > +xfs_rmap_finish_one_cleanup(
> > > +	struct xfs_trans	*tp,
> > > +	struct xfs_btree_cur	*rcur,
> > > +	int			error)
> > > +{
> > > +	struct xfs_buf		*agbp;
> > > +
> > > +	if (rcur == NULL)
> > > +		return;
> > > +	agbp = rcur->bc_private.a.agbp;
> > > +	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > > +	xfs_trans_brelse(tp, agbp);
> > 
> > Why unconditionally release the agbp (and not just on error)?
> 
> We grabbed the agbp (er, AGF buffer) to construct the rmapbt cursor, so we have
> to free it after the cursor is deleted regardless of whether or not there's an
> error.
> 

It looks like it's attached to the transaction via xfs_trans_read_*(),
which means it will be released properly on transaction commit. I don't
think it's necessarily a bug because xfs_trans_brelse() bails out when
the item is dirty, but it looks like a departure from how this is used
elsewhere throughout XFS (when no modifications are made or otherwise as
an error condition cleanup). E.g., see the similar pattern in
xfs_free_extent().

Maybe I'm missing something.. was there a known issue that required this
call, or had it always been there?

> > > +}
> > > +
> > > +/*
> > > + * Process one of the deferred rmap operations.  We pass back the
> > > + * btree cursor to maintain our lock on the rmapbt between calls.
> > > + * This saves time and eliminates a buffer deadlock between the
> > > + * superblock and the AGF because we'll always grab them in the same
> > > + * order.
> > > + */
> > > +int
> > > +xfs_rmap_finish_one(
> > > +	struct xfs_trans		*tp,
> > > +	enum xfs_rmap_intent_type	type,
> > > +	__uint64_t			owner,
> > > +	int				whichfork,
> > > +	xfs_fileoff_t			startoff,
> > > +	xfs_fsblock_t			startblock,
> > > +	xfs_filblks_t			blockcount,
> > > +	xfs_exntst_t			state,
> > > +	struct xfs_btree_cur		**pcur)
> > > +{
> > > +	struct xfs_mount		*mp = tp->t_mountp;
> > > +	struct xfs_btree_cur		*rcur;
> > > +	struct xfs_buf			*agbp = NULL;
> > > +	int				error = 0;
> > > +	xfs_agnumber_t			agno;
> > > +	struct xfs_owner_info		oinfo;
> > > +	xfs_agblock_t			bno;
> > > +	bool				unwritten;
> > > +
> > > +	agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > +	ASSERT(agno != NULLAGNUMBER);
> > > +	bno = XFS_FSB_TO_AGBNO(mp, startblock);
> > > +
> > > +	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
> > > +			startoff, blockcount, state);
> > > +
> > > +	if (XFS_TEST_ERROR(false, mp,
> > > +			XFS_ERRTAG_RMAP_FINISH_ONE,
> > > +			XFS_RANDOM_RMAP_FINISH_ONE))
> > > +		return -EIO;
> > > +
> > > +	/*
> > > +	 * If we haven't gotten a cursor or the cursor AG doesn't match
> > > +	 * the startblock, get one now.
> > > +	 */
> > > +	rcur = *pcur;
> > > +	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
> > > +		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
> > > +		rcur = NULL;
> > > +		*pcur = NULL;
> > > +	}
> > > +	if (rcur == NULL) {
> > > +		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
> > 
> > Comment? Why is this here? (Maybe we should rename that function while
> > we're at it..)
> 
> /*
>  * Ensure the freelist is of a sufficient length to provide for any btree
>  * splits that could happen when we make changes to the rmapbt.
>  */
> 
> (I don't know why the function has that name; Dave supplied it.)
> 
> > > +		if (error)
> > > +			return error;
> > > +		if (!agbp)
> > > +			return -EFSCORRUPTED;
> > > +
> > > +		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> > > +		if (!rcur) {
> > > +			error = -ENOMEM;
> > > +			goto out_cur;
> > > +		}
> > > +	}
> > > +	*pcur = rcur;
> > > +
> > > +	xfs_rmap_ino_owner(&oinfo, owner, whichfork, startoff);
> > > +	unwritten = state == XFS_EXT_UNWRITTEN;
> > > +	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, startblock);
> > > +
> > > +	switch (type) {
> > > +	case XFS_RMAP_MAP:
> > > +		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
> > > +		break;
> > > +	case XFS_RMAP_UNMAP:
> > > +		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
> > > +				&oinfo);
> > > +		break;
> > > +	case XFS_RMAP_CONVERT:
> > > +		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
> > > +				&oinfo);
> > > +		break;
> > > +	case XFS_RMAP_ALLOC:
> > > +		error = __xfs_rmap_alloc(rcur, bno, blockcount, unwritten,
> > > +				&oinfo);
> > > +		break;
> > > +	case XFS_RMAP_FREE:
> > > +		error = __xfs_rmap_free(rcur, bno, blockcount, unwritten,
> > > +				&oinfo);
> > > +		break;
> > > +	default:
> > > +		ASSERT(0);
> > > +		error = -EFSCORRUPTED;
> > > +	}
> > > +	return error;
> > > +
> > > +out_cur:
> > > +	xfs_trans_brelse(tp, agbp);
> > > +
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Record a rmap intent; the list is kept sorted first by AG and then by
> > > + * increasing age.
> > > + */
> > > +static int
> > > +__xfs_rmap_add(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_defer_ops	*dfops,
> > > +	struct xfs_rmap_intent	*ri)
> > > +{
> > > +	struct xfs_rmap_intent	*new;
> > > +
> > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > +		return 0;
> > > +
> > > +	trace_xfs_rmap_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock),
> > > +			ri->ri_type,
> > > +			XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock),
> > > +			ri->ri_owner, ri->ri_whichfork,
> > > +			ri->ri_bmap.br_startoff,
> > > +			ri->ri_bmap.br_blockcount,
> > > +			ri->ri_bmap.br_state);
> > > +
> > > +	new = kmem_zalloc(sizeof(struct xfs_rmap_intent), KM_SLEEP | KM_NOFS);
> > > +	*new = *ri;
> > > +
> > > +	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_RMAP, &new->ri_list);
> > > +	return 0;
> > > +}
> > > +
> > > +/* Map an extent into a file. */
> > > +int
> > > +xfs_rmap_map_extent(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_defer_ops	*dfops,
> > > +	struct xfs_inode	*ip,
> > > +	int			whichfork,
> > > +	struct xfs_bmbt_irec	*PREV)
> > > +{
> > > +	struct xfs_rmap_intent	ri;
> > > +
> > > +	ri.ri_type = XFS_RMAP_MAP;
> > > +	ri.ri_owner = ip->i_ino;
> > > +	ri.ri_whichfork = whichfork;
> > > +	ri.ri_bmap = *PREV;
> > > +
> > 
> > I think we should probably initialize ri_list as well (maybe turn this
> > into an xfs_rmap_init helper).
> 
> __xfs_rmap_add calls xfs_defer_add, which calls list_add_tail, which
> initializes ri_list.  Could probably just make an _rmap_init helper that
> allocates the structure, then have _rmap_*_extent fill out the new intent, and
> make the _rmap_add function pass it to _defer_add, which I think is what you're
> getting at.
> 

I didn't mean to suggest it was a bug. It's more of a defensive thing
than anything.

Brian

> > Also, for some reason it feels to me like the _hasrmapbt() feature check
> > should be up at this level (or higher), rather than buried in
> > __xfs_rmap_add(). I don't feel too strongly about that if others think
> > differently, however.
> 
> <shrug> It probably ought to be in the higher level function.
> 
> > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > +}
> > > +
> > > +/* Unmap an extent out of a file. */
> > > +int
> > > +xfs_rmap_unmap_extent(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_defer_ops	*dfops,
> > > +	struct xfs_inode	*ip,
> > > +	int			whichfork,
> > > +	struct xfs_bmbt_irec	*PREV)
> > > +{
> > > +	struct xfs_rmap_intent	ri;
> > > +
> > > +	ri.ri_type = XFS_RMAP_UNMAP;
> > > +	ri.ri_owner = ip->i_ino;
> > > +	ri.ri_whichfork = whichfork;
> > > +	ri.ri_bmap = *PREV;
> > > +
> > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > +}
> > > +
> > > +/* Convert a data fork extent from unwritten to real or vice versa. */
> > > +int
> > > +xfs_rmap_convert_extent(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_defer_ops	*dfops,
> > > +	struct xfs_inode	*ip,
> > > +	int			whichfork,
> > > +	struct xfs_bmbt_irec	*PREV)
> > > +{
> > > +	struct xfs_rmap_intent	ri;
> > > +
> > > +	ri.ri_type = XFS_RMAP_CONVERT;
> > > +	ri.ri_owner = ip->i_ino;
> > > +	ri.ri_whichfork = whichfork;
> > > +	ri.ri_bmap = *PREV;
> > > +
> > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > +}
> > > +
> > > +/* Schedule the creation of an rmap for non-file data. */
> > > +int
> > > +xfs_rmap_alloc_defer(
> > 
> > xfs_rmap_[alloc|free]_extent() like the others..?
> 
> Yeah.  The naming has shifted a bit over the past few revisions.
> 
> --D
> 
> > 
> > Brian 
> > 
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_defer_ops	*dfops,
> > > +	xfs_agnumber_t		agno,
> > > +	xfs_agblock_t		bno,
> > > +	xfs_extlen_t		len,
> > > +	__uint64_t		owner)
> > > +{
> > > +	struct xfs_rmap_intent	ri;
> > > +
> > > +	ri.ri_type = XFS_RMAP_ALLOC;
> > > +	ri.ri_owner = owner;
> > > +	ri.ri_whichfork = XFS_DATA_FORK;
> > > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > > +	ri.ri_bmap.br_blockcount = len;
> > > +	ri.ri_bmap.br_startoff = 0;
> > > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > > +
> > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > +}
> > > +
> > > +/* Schedule the deletion of an rmap for non-file data. */
> > > +int
> > > +xfs_rmap_free_defer(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_defer_ops	*dfops,
> > > +	xfs_agnumber_t		agno,
> > > +	xfs_agblock_t		bno,
> > > +	xfs_extlen_t		len,
> > > +	__uint64_t		owner)
> > > +{
> > > +	struct xfs_rmap_intent	ri;
> > > +
> > > +	ri.ri_type = XFS_RMAP_FREE;
> > > +	ri.ri_owner = owner;
> > > +	ri.ri_whichfork = XFS_DATA_FORK;
> > > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > > +	ri.ri_bmap.br_blockcount = len;
> > > +	ri.ri_bmap.br_startoff = 0;
> > > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > > +
> > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > index aff60dc..5df406e 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > @@ -106,4 +106,28 @@ struct xfs_rmap_intent {
> > >  	struct xfs_bmbt_irec			ri_bmap;
> > >  };
> > >  
> > > +/* functions for updating the rmapbt based on bmbt map/unmap operations */
> > > +int xfs_rmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > +		struct xfs_inode *ip, int whichfork,
> > > +		struct xfs_bmbt_irec *imap);
> > > +int xfs_rmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > +		struct xfs_inode *ip, int whichfork,
> > > +		struct xfs_bmbt_irec *imap);
> > > +int xfs_rmap_convert_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > +		struct xfs_inode *ip, int whichfork,
> > > +		struct xfs_bmbt_irec *imap);
> > > +int xfs_rmap_alloc_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > > +		__uint64_t owner);
> > > +int xfs_rmap_free_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > > +		__uint64_t owner);
> > > +
> > > +void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
> > > +		struct xfs_btree_cur *rcur, int error);
> > > +int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
> > > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> > > +
> > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > index 62d194e..450fd49 100644
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -41,6 +41,7 @@
> > >  #include "xfs_trace.h"
> > >  #include "xfs_icache.h"
> > >  #include "xfs_log.h"
> > > +#include "xfs_rmap_btree.h"
> > >  
> > >  /* Kernel only BMAP related definitions and functions */
> > >  
> > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > index dbd10fc..9ed060d 100644
> > > --- a/fs/xfs/xfs_defer_item.c
> > > +++ b/fs/xfs/xfs_defer_item.c
> > > @@ -213,7 +213,8 @@ xfs_rmap_update_finish_item(
> > >  			rmap->ri_bmap.br_startoff,
> > >  			rmap->ri_bmap.br_startblock,
> > >  			rmap->ri_bmap.br_blockcount,
> > > -			rmap->ri_bmap.br_state);
> > > +			rmap->ri_bmap.br_state,
> > > +			(struct xfs_btree_cur **)state);
> > >  	kmem_free(rmap);
> > >  	return error;
> > >  }
> > > @@ -225,6 +226,9 @@ xfs_rmap_update_finish_cleanup(
> > >  	void			*state,
> > >  	int			error)
> > >  {
> > > +	struct xfs_btree_cur	*rcur = state;
> > > +
> > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > >  }
> > >  
> > >  /* Abort all pending RUIs. */
> > > diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> > > index ee4680e..6bc614c 100644
> > > --- a/fs/xfs/xfs_error.h
> > > +++ b/fs/xfs/xfs_error.h
> > > @@ -91,7 +91,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> > >  #define XFS_ERRTAG_DIOWRITE_IOERR			20
> > >  #define XFS_ERRTAG_BMAPIFORMAT				21
> > >  #define XFS_ERRTAG_FREE_EXTENT				22
> > > -#define XFS_ERRTAG_MAX					23
> > > +#define XFS_ERRTAG_RMAP_FINISH_ONE			23
> > > +#define XFS_ERRTAG_MAX					24
> > >  
> > >  /*
> > >   * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> > > @@ -119,6 +120,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> > >  #define XFS_RANDOM_DIOWRITE_IOERR			(XFS_RANDOM_DEFAULT/10)
> > >  #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
> > >  #define XFS_RANDOM_FREE_EXTENT				1
> > > +#define XFS_RANDOM_RMAP_FINISH_ONE			1
> > >  
> > >  #ifdef DEBUG
> > >  extern int xfs_error_test_active;
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index c9fe0c4..f7f9635 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -45,6 +45,7 @@
> > >  #include "xfs_error.h"
> > >  #include "xfs_dir2.h"
> > >  #include "xfs_rmap_item.h"
> > > +#include "xfs_rmap_btree.h"
> > >  
> > >  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
> > >  
> > > @@ -4486,6 +4487,12 @@ xlog_recover_process_rui(
> > >  	struct xfs_map_extent		*rmap;
> > >  	xfs_fsblock_t			startblock_fsb;
> > >  	bool				op_ok;
> > > +	struct xfs_rud_log_item		*rudp;
> > > +	enum xfs_rmap_intent_type	type;
> > > +	int				whichfork;
> > > +	xfs_exntst_t			state;
> > > +	struct xfs_trans		*tp;
> > > +	struct xfs_btree_cur		*rcur = NULL;
> > >  
> > >  	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
> > >  
> > > @@ -4528,9 +4535,54 @@ xlog_recover_process_rui(
> > >  		}
> > >  	}
> > >  
> > > -	/* XXX: do nothing for now */
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > > +	if (error)
> > > +		return error;
> > > +	rudp = xfs_trans_get_rud(tp, ruip, ruip->rui_format.rui_nextents);
> > > +
> > > +	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
> > > +		rmap = &(ruip->rui_format.rui_extents[i]);
> > > +		state = (rmap->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
> > > +				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
> > > +		whichfork = (rmap->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
> > > +				XFS_ATTR_FORK : XFS_DATA_FORK;
> > > +		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
> > > +		case XFS_RMAP_EXTENT_MAP:
> > > +			type = XFS_RMAP_MAP;
> > > +			break;
> > > +		case XFS_RMAP_EXTENT_UNMAP:
> > > +			type = XFS_RMAP_UNMAP;
> > > +			break;
> > > +		case XFS_RMAP_EXTENT_CONVERT:
> > > +			type = XFS_RMAP_CONVERT;
> > > +			break;
> > > +		case XFS_RMAP_EXTENT_ALLOC:
> > > +			type = XFS_RMAP_ALLOC;
> > > +			break;
> > > +		case XFS_RMAP_EXTENT_FREE:
> > > +			type = XFS_RMAP_FREE;
> > > +			break;
> > > +		default:
> > > +			error = -EFSCORRUPTED;
> > > +			goto abort_error;
> > > +		}
> > > +		error = xfs_trans_log_finish_rmap_update(tp, rudp, type,
> > > +				rmap->me_owner, whichfork,
> > > +				rmap->me_startoff, rmap->me_startblock,
> > > +				rmap->me_len, state, &rcur);
> > > +		if (error)
> > > +			goto abort_error;
> > > +
> > > +	}
> > > +
> > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > >  	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> > > -	xfs_rui_release(ruip);
> > > +	error = xfs_trans_commit(tp);
> > > +	return error;
> > > +
> > > +abort_error:
> > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > > +	xfs_trans_cancel(tp);
> > >  	return error;
> > >  }
> > >  
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index c48be63..f59d934 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -244,12 +244,13 @@ void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> > >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > >  		xfs_exntst_t state);
> > >  
> > > +struct xfs_btree_cur;
> > >  struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> > >  		struct xfs_rui_log_item *ruip, uint nextents);
> > >  int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> > >  		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> > >  		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > -		xfs_exntst_t state);
> > > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> > >  
> > >  #endif	/* __XFS_TRANS_H__ */
> > > diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> > > index b55a725..0c0df18 100644
> > > --- a/fs/xfs/xfs_trans_rmap.c
> > > +++ b/fs/xfs/xfs_trans_rmap.c
> > > @@ -170,14 +170,15 @@ xfs_trans_log_finish_rmap_update(
> > >  	xfs_fileoff_t			startoff,
> > >  	xfs_fsblock_t			startblock,
> > >  	xfs_filblks_t			blockcount,
> > > -	xfs_exntst_t			state)
> > > +	xfs_exntst_t			state,
> > > +	struct xfs_btree_cur		**pcur)
> > >  {
> > >  	uint				next_extent;
> > >  	struct xfs_map_extent		*rmap;
> > >  	int				error;
> > >  
> > > -	/* XXX: actually finish the rmap update here */
> > > -	error = -EFSCORRUPTED;
> > > +	error = xfs_rmap_finish_one(tp, type, owner, whichfork, startoff,
> > > +			startblock, blockcount, state, pcur);
> > >  
> > >  	/*
> > >  	 * Mark the transaction dirty, even on error. This ensures the
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-07-18  1:21       ` Dave Chinner
@ 2016-07-18 12:56         ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-18 12:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-fsdevel, vishal.l.verma, xfs

On Mon, Jul 18, 2016 at 11:21:22AM +1000, Dave Chinner wrote:
> On Sat, Jul 16, 2016 at 12:26:21AM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 15, 2016 at 02:33:56PM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:22:34PM -0700, Darrick J. Wong wrote:
> > > > When we map, unmap, or convert an extent in a file's data or attr
> > > > fork, schedule a respective update in the rmapbt.  Previous versions
> > > > of this patch required a 1:1 correspondence between bmap and rmap,
> > > > but this is no longer true.
> > > > 
> > > > v2: Remove the 1:1 correspondence requirement now that we have the
> > > > ability to make interval queries against the rmapbt.  Update the
> > > > commit message to reflect the broad restructuring of this patch.
> > > > Fix the bmap shift code to adjust the rmaps correctly.
> > > > 
> > > > v3: Use the deferred operations code to handle redo operations
> > > > atomically and deadlock free.  Plumb in all five rmap actions
> > > > (map, unmap, convert extent, alloc, free); we'll use the first
> > > > three now for file data, and reflink will want the last two.
> > > > Add an error injection site to test log recovery.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> .....
> > > > + * superblock and the AGF because we'll always grab them in the same
> > > > + * order.
> > > > + */
> > > > +int
> > > > +xfs_rmap_finish_one(
> > > > +	struct xfs_trans		*tp,
> > > > +	enum xfs_rmap_intent_type	type,
> > > > +	__uint64_t			owner,
> > > > +	int				whichfork,
> > > > +	xfs_fileoff_t			startoff,
> > > > +	xfs_fsblock_t			startblock,
> > > > +	xfs_filblks_t			blockcount,
> > > > +	xfs_exntst_t			state,
> > > > +	struct xfs_btree_cur		**pcur)
> > > > +{
> > > > +	struct xfs_mount		*mp = tp->t_mountp;
> > > > +	struct xfs_btree_cur		*rcur;
> > > > +	struct xfs_buf			*agbp = NULL;
> > > > +	int				error = 0;
> > > > +	xfs_agnumber_t			agno;
> > > > +	struct xfs_owner_info		oinfo;
> > > > +	xfs_agblock_t			bno;
> > > > +	bool				unwritten;
> > > > +
> > > > +	agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > > +	ASSERT(agno != NULLAGNUMBER);
> > > > +	bno = XFS_FSB_TO_AGBNO(mp, startblock);
> > > > +
> > > > +	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
> > > > +			startoff, blockcount, state);
> > > > +
> > > > +	if (XFS_TEST_ERROR(false, mp,
> > > > +			XFS_ERRTAG_RMAP_FINISH_ONE,
> > > > +			XFS_RANDOM_RMAP_FINISH_ONE))
> > > > +		return -EIO;
> > > > +
> > > > +	/*
> > > > +	 * If we haven't gotten a cursor or the cursor AG doesn't match
> > > > +	 * the startblock, get one now.
> > > > +	 */
> > > > +	rcur = *pcur;
> > > > +	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
> > > > +		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
> > > > +		rcur = NULL;
> > > > +		*pcur = NULL;
> > > > +	}
> > > > +	if (rcur == NULL) {
> > > > +		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
> > > 
> > > Comment? Why is this here? (Maybe we should rename that function while
> > > we're at it..)
> > 
> > /*
> >  * Ensure the freelist is of a sufficient length to provide for any btree
> >  * splits that could happen when we make changes to the rmapbt.
> >  */
> > 
> > (I don't know why the function has that name; Dave supplied it.)
> 
> I named it that way because it was common code factored out of
> xfs_free_extent() for use by multiple callers on the extent freeing
> side of things. Feel free to name it differently if you can think of
> something more appropriate.
> 

Right, that's why it stood out to me. I don't feel too strongly about
it, perhaps xfs_fix_freelist()? xfs_agf_fix_freelist()?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 045/119] xfs: add rmap btree geometry feature flag
  2016-06-17  1:22 ` [PATCH 045/119] xfs: add rmap btree geometry feature flag Darrick J. Wong
@ 2016-07-18 13:34   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-18 13:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:22:40PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So xfs_info and other userspace utilities know the filesystem is
> using this feature.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_fs.h |    1 +
>  fs/xfs/xfs_fsops.c     |    4 +++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index f5ec9c5..7945505 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -206,6 +206,7 @@ typedef struct xfs_fsop_resblks {
>  #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
>  #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
>  #define XFS_FSOP_GEOM_FLAGS_SPINODES	0x40000	/* sparse inode chunks	*/
> +#define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* Reverse mapping btree */
>  
>  /*
>   * Minimum and maximum sizes need for growth checks.
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 3772f6c..5980d5c 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -105,7 +105,9 @@ xfs_fs_geometry(
>  			(xfs_sb_version_hasfinobt(&mp->m_sb) ?
>  				XFS_FSOP_GEOM_FLAGS_FINOBT : 0) |
>  			(xfs_sb_version_hassparseinodes(&mp->m_sb) ?
> -				XFS_FSOP_GEOM_FLAGS_SPINODES : 0);
> +				XFS_FSOP_GEOM_FLAGS_SPINODES : 0) |
> +			(xfs_sb_version_hasrmapbt(&mp->m_sb) ?
> +				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0);
>  		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
>  				mp->m_sb.sb_logsectsize : BBSIZE;
>  		geo->rtsectsize = mp->m_sb.sb_blocksize;
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 046/119] xfs: add rmap btree block detection to log recovery
  2016-06-17  1:22 ` [PATCH 046/119] xfs: add rmap btree block detection to log recovery Darrick J. Wong
@ 2016-07-18 13:34   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-18 13:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:22:46PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So such blocks can be correctly identified and have their operations
> structutes attached to validate recovery has not resulted in a

  structures					not?

> correct block.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_log_recover.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index f7f9635..dbfbc26 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2233,6 +2233,7 @@ xlog_recover_get_buf_lsn(
>  	case XFS_ABTC_CRC_MAGIC:
>  	case XFS_ABTB_MAGIC:
>  	case XFS_ABTC_MAGIC:
> +	case XFS_RMAP_CRC_MAGIC:
>  	case XFS_IBT_CRC_MAGIC:
>  	case XFS_IBT_MAGIC: {
>  		struct xfs_btree_block *btb = blk;
> @@ -2401,6 +2402,9 @@ xlog_recover_validate_buf_type(
>  		case XFS_BMAP_MAGIC:
>  			bp->b_ops = &xfs_bmbt_buf_ops;
>  			break;
> +		case XFS_RMAP_CRC_MAGIC:
> +			bp->b_ops = &xfs_rmapbt_buf_ops;
> +			break;
>  		default:
>  			xfs_warn(mp, "Bad btree block magic!");
>  			ASSERT(0);
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 047/119] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled
  2016-06-17  1:22 ` [PATCH 047/119] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled Darrick J. Wong
@ 2016-07-18 13:34   ` Brian Foster
  2016-07-18 16:18     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-18 13:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:22:53PM -0700, Darrick J. Wong wrote:
> Swapping extents between two inodes requires the owner to be updated
> in the rmap tree for all the extents that are swapped. This code
> does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until
> support has been implemented. This will need to be done before the
> rmap btree code can have the experimental tag removed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> [darrick.wong@oracle.com: fix extent swapping when rmap enabled]
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_bmap_util.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 450fd49..8666873 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1618,6 +1618,10 @@ xfs_swap_extents(
>  	__uint64_t	tmp;
>  	int		lock_flags;
>  
> +	/* XXX: we can't do this with rmap, will fix later */
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
>  	tempifp = kmem_alloc(sizeof(xfs_ifork_t), KM_MAYFAIL);
>  	if (!tempifp) {
>  		error = -ENOMEM;
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 048/119] xfs: don't update rmapbt when fixing agfl
  2016-06-17  1:22 ` [PATCH 048/119] xfs: don't update rmapbt when fixing agfl Darrick J. Wong
@ 2016-07-18 13:34   ` Brian Foster
  2016-07-18 15:53     ` Darrick J. Wong
  0 siblings, 1 reply; 236+ messages in thread
From: Brian Foster @ 2016-07-18 13:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Thu, Jun 16, 2016 at 06:22:59PM -0700, Darrick J. Wong wrote:
> Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates
> when fixing the AG freelist.  xfs_repair needs this during phase 5
> to be able to adjust the freelist while it's reconstructing the rmap
> btree; the missing entries will be added back at the very end of
> phase 5 once the AGFL contents settle down.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |   40 ++++++++++++++++++++++++++--------------
>  fs/xfs/libxfs/xfs_alloc.h |    3 +++
>  2 files changed, 29 insertions(+), 14 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 4c8ffd4..6eabab1 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2092,26 +2092,38 @@ xfs_alloc_fix_freelist(
>  	 * anything other than extra overhead when we need to put more blocks
>  	 * back on the free list? Maybe we should only do this when space is
>  	 * getting low or the AGFL is more than half full?
> +	 *
> +	 * The NOSHRINK flag prevents the AGFL from being shrunk if it's too
> +	 * big; the NORMAP flag prevents AGFL expand/shrink operations from
> +	 * updating the rmapbt.  Both flags are used in xfs_repair while we're
> +	 * rebuilding the rmapbt, and neither are used by the kernel.  They're
> +	 * both required to ensure that rmaps are correctly recorded for the
> +	 * regenerated AGFL, bnobt, and cntbt.  See repair/phase5.c and
> +	 * repair/rmap.c in xfsprogs for details.
>  	 */
> -	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> -	while (pag->pagf_flcount > need) {
> -		struct xfs_buf	*bp;
> +	memset(&targs, 0, sizeof(targs));
> +	if (!(flags & XFS_ALLOC_FLAG_NOSHRINK)) {
> +		if (!(flags & XFS_ALLOC_FLAG_NORMAP))
> +			xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);

Can we get away with setting targs.oinfo once rather than here and
below? If so, I think something like the following might clean this up a
bit and save some indentation:

	memset(targs, 0, ...);
	if (!(flags & NORMAP))
		xfs_rmap_ag_owner(...);
	while (!(flags & NOSHRINK) &&
	       flcount > need) {
		...
	}
	...

Hm?

Brian

> +		while (pag->pagf_flcount > need) {
> +			struct xfs_buf	*bp;
>  
> -		error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
> -		if (error)
> -			goto out_agbp_relse;
> -		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> -					   &targs.oinfo, 1);
> -		if (error)
> -			goto out_agbp_relse;
> -		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> -		xfs_trans_binval(tp, bp);
> +			error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
> +			if (error)
> +				goto out_agbp_relse;
> +			error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> +						   &targs.oinfo, 1);
> +			if (error)
> +				goto out_agbp_relse;
> +			bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> +			xfs_trans_binval(tp, bp);
> +		}
>  	}
>  
> -	memset(&targs, 0, sizeof(targs));
>  	targs.tp = tp;
>  	targs.mp = mp;
> -	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> +	if (!(flags & XFS_ALLOC_FLAG_NORMAP))
> +		xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
>  	targs.agbp = agbp;
>  	targs.agno = args->agno;
>  	targs.alignment = targs.minlen = targs.prod = targs.isfl = 1;
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 7b6c66b..7b9e67e 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -54,6 +54,9 @@ typedef unsigned int xfs_alloctype_t;
>   */
>  #define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
>  #define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
> +#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
> +#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
> +
>  
>  /*
>   * Argument structure for xfs_alloc routines.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 049/119] xfs: enable the rmap btree functionality
  2016-06-17  1:23 ` [PATCH 049/119] xfs: enable the rmap btree functionality Darrick J. Wong
@ 2016-07-18 13:34   ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-18 13:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Thu, Jun 16, 2016 at 06:23:05PM -0700, Darrick J. Wong wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Add the feature flag to the supported matrix so that the kernel can
> mount and use rmap btree enabled filesystems
> 
> v2: Move the EXPERIMENTAL message to fill_super so it only prints once.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> [darrick.wong@oracle.com: move the experimental tag]
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_format.h |    3 ++-
>  fs/xfs/xfs_super.c         |    4 ++++
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 6efc7a3..1b08237 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -457,7 +457,8 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
>  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> -		(XFS_SB_FEAT_RO_COMPAT_FINOBT)
> +		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> +		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index a8300e4..9328821 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1571,6 +1571,10 @@ xfs_fs_fill_super(
>  		xfs_alert(mp,
>  	"EXPERIMENTAL sparse inode feature enabled. Use at your own risk!");
>  
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		xfs_alert(mp,
> +	"EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!");
> +
>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 048/119] xfs: don't update rmapbt when fixing agfl
  2016-07-18 13:34   ` Brian Foster
@ 2016-07-18 15:53     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-18 15:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, xfs

On Mon, Jul 18, 2016 at 09:34:34AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:22:59PM -0700, Darrick J. Wong wrote:
> > Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates
> > when fixing the AG freelist.  xfs_repair needs this during phase 5
> > to be able to adjust the freelist while it's reconstructing the rmap
> > btree; the missing entries will be added back at the very end of
> > phase 5 once the AGFL contents settle down.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |   40 ++++++++++++++++++++++++++--------------
> >  fs/xfs/libxfs/xfs_alloc.h |    3 +++
> >  2 files changed, 29 insertions(+), 14 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 4c8ffd4..6eabab1 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2092,26 +2092,38 @@ xfs_alloc_fix_freelist(
> >  	 * anything other than extra overhead when we need to put more blocks
> >  	 * back on the free list? Maybe we should only do this when space is
> >  	 * getting low or the AGFL is more than half full?
> > +	 *
> > +	 * The NOSHRINK flag prevents the AGFL from being shrunk if it's too
> > +	 * big; the NORMAP flag prevents AGFL expand/shrink operations from
> > +	 * updating the rmapbt.  Both flags are used in xfs_repair while we're
> > +	 * rebuilding the rmapbt, and neither are used by the kernel.  They're
> > +	 * both required to ensure that rmaps are correctly recorded for the
> > +	 * regenerated AGFL, bnobt, and cntbt.  See repair/phase5.c and
> > +	 * repair/rmap.c in xfsprogs for details.
> >  	 */
> > -	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> > -	while (pag->pagf_flcount > need) {
> > -		struct xfs_buf	*bp;
> > +	memset(&targs, 0, sizeof(targs));
> > +	if (!(flags & XFS_ALLOC_FLAG_NOSHRINK)) {
> > +		if (!(flags & XFS_ALLOC_FLAG_NORMAP))
> > +			xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> 
> Can we get away with setting targs.oinfo once rather than here and
> below? If so, I think something like the following might clean this up a
> bit and save some indentation:
> 
> 	memset(targs, 0, ...);
> 	if (!(flags & NORMAP))
> 		xfs_rmap_ag_owner(...);
> 	while (!(flags & NOSHRINK) &&
> 	       flcount > need) {
> 		...
> 	}
> 	...
> 
> Hm?

Yeah, I think that is the case.  In the end it'll look like:

memset(targs, 0...);
if (flags & NORMAP)
	xfs_rmap_skip_update(&targs.oinfo);
else
	xfs_rmap_ag_owner(&targs.oinfo...);
while (!(flags & NOSHRINK) && flcount > need) {
	...
}

--D

> 
> Brian
> 
> > +		while (pag->pagf_flcount > need) {
> > +			struct xfs_buf	*bp;
> >  
> > -		error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
> > -		if (error)
> > -			goto out_agbp_relse;
> > -		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > -					   &targs.oinfo, 1);
> > -		if (error)
> > -			goto out_agbp_relse;
> > -		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > -		xfs_trans_binval(tp, bp);
> > +			error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
> > +			if (error)
> > +				goto out_agbp_relse;
> > +			error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > +						   &targs.oinfo, 1);
> > +			if (error)
> > +				goto out_agbp_relse;
> > +			bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > +			xfs_trans_binval(tp, bp);
> > +		}
> >  	}
> >  
> > -	memset(&targs, 0, sizeof(targs));
> >  	targs.tp = tp;
> >  	targs.mp = mp;
> > -	xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> > +	if (!(flags & XFS_ALLOC_FLAG_NORMAP))
> > +		xfs_rmap_ag_owner(&targs.oinfo, XFS_RMAP_OWN_AG);
> >  	targs.agbp = agbp;
> >  	targs.agno = args->agno;
> >  	targs.alignment = targs.minlen = targs.prod = targs.isfl = 1;
> > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > index 7b6c66b..7b9e67e 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.h
> > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > @@ -54,6 +54,9 @@ typedef unsigned int xfs_alloctype_t;
> >   */
> >  #define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
> >  #define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
> > +#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
> > +#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
> > +
> >  
> >  /*
> >   * Argument structure for xfs_alloc routines.
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 047/119] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled
  2016-07-18 13:34   ` Brian Foster
@ 2016-07-18 16:18     ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-18 16:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-fsdevel, vishal.l.verma, Dave Chinner, xfs

On Mon, Jul 18, 2016 at 09:34:28AM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2016 at 06:22:53PM -0700, Darrick J. Wong wrote:
> > Swapping extents between two inodes requires the owner to be updated
> > in the rmap tree for all the extents that are swapped. This code
> > does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until
> > support has been implemented. This will need to be done before the
> > rmap btree code can have the experimental tag removed.

"This functionality will be provided in a (much) later patch, as the rmap
implementation uses a few parts of the reflink functionality to accomplish its
means."

> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > [darrick.wong@oracle.com: fix extent swapping when rmap enabled]

"[darrick: update commit log]"

--D

> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> >  fs/xfs/xfs_bmap_util.c |    4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 450fd49..8666873 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1618,6 +1618,10 @@ xfs_swap_extents(
> >  	__uint64_t	tmp;
> >  	int		lock_flags;
> >  
> > +	/* XXX: we can't do this with rmap, will fix later */
> > +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> >  	tempifp = kmem_alloc(sizeof(xfs_ifork_t), KM_MAYFAIL);
> >  	if (!tempifp) {
> >  		error = -ENOMEM;
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-07-18 12:55       ` Brian Foster
@ 2016-07-19  1:53         ` Darrick J. Wong
  2016-07-19 11:37           ` Brian Foster
  0 siblings, 1 reply; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-19  1:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jul 18, 2016 at 08:55:29AM -0400, Brian Foster wrote:
> On Sat, Jul 16, 2016 at 12:26:21AM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 15, 2016 at 02:33:56PM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:22:34PM -0700, Darrick J. Wong wrote:
> > > > When we map, unmap, or convert an extent in a file's data or attr
> > > > fork, schedule a respective update in the rmapbt.  Previous versions
> > > > of this patch required a 1:1 correspondence between bmap and rmap,
> > > > but this is no longer true.
> > > > 
> > > > v2: Remove the 1:1 correspondence requirement now that we have the
> > > > ability to make interval queries against the rmapbt.  Update the
> > > > commit message to reflect the broad restructuring of this patch.
> > > > Fix the bmap shift code to adjust the rmaps correctly.
> > > > 
> > > > v3: Use the deferred operations code to handle redo operations
> > > > atomically and deadlock free.  Plumb in all five rmap actions
> > > > (map, unmap, convert extent, alloc, free); we'll use the first
> > > > three now for file data, and reflink will want the last two.
> > > > Add an error injection site to test log recovery.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c       |   56 ++++++++-
> > > >  fs/xfs/libxfs/xfs_rmap.c       |  252 ++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_rmap_btree.h |   24 ++++
> > > >  fs/xfs/xfs_bmap_util.c         |    1 
> > > >  fs/xfs/xfs_defer_item.c        |    6 +
> > > >  fs/xfs/xfs_error.h             |    4 -
> > > >  fs/xfs/xfs_log_recover.c       |   56 +++++++++
> > > >  fs/xfs/xfs_trans.h             |    3 
> > > >  fs/xfs/xfs_trans_rmap.c        |    7 +
> > > >  9 files changed, 393 insertions(+), 16 deletions(-)
> > > > 
> > > > 
> ...
> > > > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > > > index 76fc5c2..f179ea4 100644
> > > > --- a/fs/xfs/libxfs/xfs_rmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > > > @@ -36,6 +36,8 @@
> > > >  #include "xfs_trace.h"
> > > >  #include "xfs_error.h"
> > > >  #include "xfs_extent_busy.h"
> > > > +#include "xfs_bmap.h"
> > > > +#include "xfs_inode.h"
> > > >  
> > > >  /*
> > > >   * Lookup the first record less than or equal to [bno, len, owner, offset]
> > > > @@ -1212,3 +1214,253 @@ xfs_rmapbt_query_range(
> > > >  	return xfs_btree_query_range(cur, &low_brec, &high_brec,
> > > >  			xfs_rmapbt_query_range_helper, &query);
> > > >  }
> > > > +
> > > > +/* Clean up after calling xfs_rmap_finish_one. */
> > > > +void
> > > > +xfs_rmap_finish_one_cleanup(
> > > > +	struct xfs_trans	*tp,
> > > > +	struct xfs_btree_cur	*rcur,
> > > > +	int			error)
> > > > +{
> > > > +	struct xfs_buf		*agbp;
> > > > +
> > > > +	if (rcur == NULL)
> > > > +		return;
> > > > +	agbp = rcur->bc_private.a.agbp;
> > > > +	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > > > +	xfs_trans_brelse(tp, agbp);
> > > 
> > > Why unconditionally release the agbp (and not just on error)?
> > 
> > We grabbed the agbp (er, AGF buffer) to construct the rmapbt cursor, so we have
> > to free it after the cursor is deleted regardless of whether or not there's an
> > error.
> > 
> 
> It looks like it's attached to the transaction via xfs_trans_read_*(),
> which means it will be released properly on transaction commit. I don't
> think it's necessarily a bug because xfs_trans_brelse() bails out when
> the item is dirty, but it looks like a departure from how this is used
> elsewhere throughout XFS (when no modifications are made or otherwise as
> an error condition cleanup). E.g., see the similar pattern in
> xfs_free_extent().
> 
> Maybe I'm missing something.. was there a known issue that required this
> call, or had it always been there?

/me finally figures out that you're just wondering why I brelse the agbp
even when there *isn't* an error.  Yes, that's unnecessary, will change it
tomorrow.

--D

> 
> > > > +}
> > > > +
> > > > +/*
> > > > + * Process one of the deferred rmap operations.  We pass back the
> > > > + * btree cursor to maintain our lock on the rmapbt between calls.
> > > > + * This saves time and eliminates a buffer deadlock between the
> > > > + * superblock and the AGF because we'll always grab them in the same
> > > > + * order.
> > > > + */
> > > > +int
> > > > +xfs_rmap_finish_one(
> > > > +	struct xfs_trans		*tp,
> > > > +	enum xfs_rmap_intent_type	type,
> > > > +	__uint64_t			owner,
> > > > +	int				whichfork,
> > > > +	xfs_fileoff_t			startoff,
> > > > +	xfs_fsblock_t			startblock,
> > > > +	xfs_filblks_t			blockcount,
> > > > +	xfs_exntst_t			state,
> > > > +	struct xfs_btree_cur		**pcur)
> > > > +{
> > > > +	struct xfs_mount		*mp = tp->t_mountp;
> > > > +	struct xfs_btree_cur		*rcur;
> > > > +	struct xfs_buf			*agbp = NULL;
> > > > +	int				error = 0;
> > > > +	xfs_agnumber_t			agno;
> > > > +	struct xfs_owner_info		oinfo;
> > > > +	xfs_agblock_t			bno;
> > > > +	bool				unwritten;
> > > > +
> > > > +	agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > > +	ASSERT(agno != NULLAGNUMBER);
> > > > +	bno = XFS_FSB_TO_AGBNO(mp, startblock);
> > > > +
> > > > +	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
> > > > +			startoff, blockcount, state);
> > > > +
> > > > +	if (XFS_TEST_ERROR(false, mp,
> > > > +			XFS_ERRTAG_RMAP_FINISH_ONE,
> > > > +			XFS_RANDOM_RMAP_FINISH_ONE))
> > > > +		return -EIO;
> > > > +
> > > > +	/*
> > > > +	 * If we haven't gotten a cursor or the cursor AG doesn't match
> > > > +	 * the startblock, get one now.
> > > > +	 */
> > > > +	rcur = *pcur;
> > > > +	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
> > > > +		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
> > > > +		rcur = NULL;
> > > > +		*pcur = NULL;
> > > > +	}
> > > > +	if (rcur == NULL) {
> > > > +		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
> > > 
> > > Comment? Why is this here? (Maybe we should rename that function while
> > > we're at it..)
> > 
> > /*
> >  * Ensure the freelist is of a sufficient length to provide for any btree
> >  * splits that could happen when we make changes to the rmapbt.
> >  */
> > 
> > (I don't know why the function has that name; Dave supplied it.)
> > 
> > > > +		if (error)
> > > > +			return error;
> > > > +		if (!agbp)
> > > > +			return -EFSCORRUPTED;
> > > > +
> > > > +		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> > > > +		if (!rcur) {
> > > > +			error = -ENOMEM;
> > > > +			goto out_cur;
> > > > +		}
> > > > +	}
> > > > +	*pcur = rcur;
> > > > +
> > > > +	xfs_rmap_ino_owner(&oinfo, owner, whichfork, startoff);
> > > > +	unwritten = state == XFS_EXT_UNWRITTEN;
> > > > +	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, startblock);
> > > > +
> > > > +	switch (type) {
> > > > +	case XFS_RMAP_MAP:
> > > > +		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
> > > > +		break;
> > > > +	case XFS_RMAP_UNMAP:
> > > > +		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
> > > > +				&oinfo);
> > > > +		break;
> > > > +	case XFS_RMAP_CONVERT:
> > > > +		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
> > > > +				&oinfo);
> > > > +		break;
> > > > +	case XFS_RMAP_ALLOC:
> > > > +		error = __xfs_rmap_alloc(rcur, bno, blockcount, unwritten,
> > > > +				&oinfo);
> > > > +		break;
> > > > +	case XFS_RMAP_FREE:
> > > > +		error = __xfs_rmap_free(rcur, bno, blockcount, unwritten,
> > > > +				&oinfo);
> > > > +		break;
> > > > +	default:
> > > > +		ASSERT(0);
> > > > +		error = -EFSCORRUPTED;
> > > > +	}
> > > > +	return error;
> > > > +
> > > > +out_cur:
> > > > +	xfs_trans_brelse(tp, agbp);
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Record a rmap intent; the list is kept sorted first by AG and then by
> > > > + * increasing age.
> > > > + */
> > > > +static int
> > > > +__xfs_rmap_add(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_defer_ops	*dfops,
> > > > +	struct xfs_rmap_intent	*ri)
> > > > +{
> > > > +	struct xfs_rmap_intent	*new;
> > > > +
> > > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > +		return 0;
> > > > +
> > > > +	trace_xfs_rmap_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock),
> > > > +			ri->ri_type,
> > > > +			XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock),
> > > > +			ri->ri_owner, ri->ri_whichfork,
> > > > +			ri->ri_bmap.br_startoff,
> > > > +			ri->ri_bmap.br_blockcount,
> > > > +			ri->ri_bmap.br_state);
> > > > +
> > > > +	new = kmem_zalloc(sizeof(struct xfs_rmap_intent), KM_SLEEP | KM_NOFS);
> > > > +	*new = *ri;
> > > > +
> > > > +	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_RMAP, &new->ri_list);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/* Map an extent into a file. */
> > > > +int
> > > > +xfs_rmap_map_extent(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_defer_ops	*dfops,
> > > > +	struct xfs_inode	*ip,
> > > > +	int			whichfork,
> > > > +	struct xfs_bmbt_irec	*PREV)
> > > > +{
> > > > +	struct xfs_rmap_intent	ri;
> > > > +
> > > > +	ri.ri_type = XFS_RMAP_MAP;
> > > > +	ri.ri_owner = ip->i_ino;
> > > > +	ri.ri_whichfork = whichfork;
> > > > +	ri.ri_bmap = *PREV;
> > > > +
> > > 
> > > I think we should probably initialize ri_list as well (maybe turn this
> > > into an xfs_rmap_init helper).
> > 
> > __xfs_rmap_add calls xfs_defer_add, which calls list_add_tail, which
> > initializes ri_list.  Could probably just make an _rmap_init helper that
> > allocates the structure, then have _rmap_*_extent fill out the new intent, and
> > make the _rmap_add function pass it to _defer_add, which I think is what you're
> > getting at.
> > 
> 
> I didn't mean to suggest it was a bug. It's more of a defensive thing
> than anything.

Oh, sure, it's not a bug at all, but it is a little goofy to initialize
a stack variable, then allocate a slab object and copy the stack variable's
contents into the slab object and then push it out for later processing.

(The dangers of repeatedly revising one's code. :))

--D

> 
> Brian
> 
> > > Also, for some reason it feels to me like the _hasrmapbt() feature check
> > > should be up at this level (or higher), rather than buried in
> > > __xfs_rmap_add(). I don't feel too strongly about that if others think
> > > differently, however.
> > 
> > <shrug> It probably ought to be in the higher level function.
> > 
> > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > +}
> > > > +
> > > > +/* Unmap an extent out of a file. */
> > > > +int
> > > > +xfs_rmap_unmap_extent(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_defer_ops	*dfops,
> > > > +	struct xfs_inode	*ip,
> > > > +	int			whichfork,
> > > > +	struct xfs_bmbt_irec	*PREV)
> > > > +{
> > > > +	struct xfs_rmap_intent	ri;
> > > > +
> > > > +	ri.ri_type = XFS_RMAP_UNMAP;
> > > > +	ri.ri_owner = ip->i_ino;
> > > > +	ri.ri_whichfork = whichfork;
> > > > +	ri.ri_bmap = *PREV;
> > > > +
> > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > +}
> > > > +
> > > > +/* Convert a data fork extent from unwritten to real or vice versa. */
> > > > +int
> > > > +xfs_rmap_convert_extent(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_defer_ops	*dfops,
> > > > +	struct xfs_inode	*ip,
> > > > +	int			whichfork,
> > > > +	struct xfs_bmbt_irec	*PREV)
> > > > +{
> > > > +	struct xfs_rmap_intent	ri;
> > > > +
> > > > +	ri.ri_type = XFS_RMAP_CONVERT;
> > > > +	ri.ri_owner = ip->i_ino;
> > > > +	ri.ri_whichfork = whichfork;
> > > > +	ri.ri_bmap = *PREV;
> > > > +
> > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > +}
> > > > +
> > > > +/* Schedule the creation of an rmap for non-file data. */
> > > > +int
> > > > +xfs_rmap_alloc_defer(
> > > 
> > > xfs_rmap_[alloc|free]_extent() like the others..?
> > 
> > Yeah.  The naming has shifted a bit over the past few revisions.
> > 
> > --D
> > 
> > > 
> > > Brian 
> > > 
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_defer_ops	*dfops,
> > > > +	xfs_agnumber_t		agno,
> > > > +	xfs_agblock_t		bno,
> > > > +	xfs_extlen_t		len,
> > > > +	__uint64_t		owner)
> > > > +{
> > > > +	struct xfs_rmap_intent	ri;
> > > > +
> > > > +	ri.ri_type = XFS_RMAP_ALLOC;
> > > > +	ri.ri_owner = owner;
> > > > +	ri.ri_whichfork = XFS_DATA_FORK;
> > > > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > > > +	ri.ri_bmap.br_blockcount = len;
> > > > +	ri.ri_bmap.br_startoff = 0;
> > > > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > > > +
> > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > +}
> > > > +
> > > > +/* Schedule the deletion of an rmap for non-file data. */
> > > > +int
> > > > +xfs_rmap_free_defer(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_defer_ops	*dfops,
> > > > +	xfs_agnumber_t		agno,
> > > > +	xfs_agblock_t		bno,
> > > > +	xfs_extlen_t		len,
> > > > +	__uint64_t		owner)
> > > > +{
> > > > +	struct xfs_rmap_intent	ri;
> > > > +
> > > > +	ri.ri_type = XFS_RMAP_FREE;
> > > > +	ri.ri_owner = owner;
> > > > +	ri.ri_whichfork = XFS_DATA_FORK;
> > > > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > > > +	ri.ri_bmap.br_blockcount = len;
> > > > +	ri.ri_bmap.br_startoff = 0;
> > > > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > > > +
> > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > index aff60dc..5df406e 100644
> > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > @@ -106,4 +106,28 @@ struct xfs_rmap_intent {
> > > >  	struct xfs_bmbt_irec			ri_bmap;
> > > >  };
> > > >  
> > > > +/* functions for updating the rmapbt based on bmbt map/unmap operations */
> > > > +int xfs_rmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > +		struct xfs_inode *ip, int whichfork,
> > > > +		struct xfs_bmbt_irec *imap);
> > > > +int xfs_rmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > +		struct xfs_inode *ip, int whichfork,
> > > > +		struct xfs_bmbt_irec *imap);
> > > > +int xfs_rmap_convert_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > +		struct xfs_inode *ip, int whichfork,
> > > > +		struct xfs_bmbt_irec *imap);
> > > > +int xfs_rmap_alloc_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > > > +		__uint64_t owner);
> > > > +int xfs_rmap_free_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > > > +		__uint64_t owner);
> > > > +
> > > > +void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
> > > > +		struct xfs_btree_cur *rcur, int error);
> > > > +int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
> > > > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> > > > +
> > > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > index 62d194e..450fd49 100644
> > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > @@ -41,6 +41,7 @@
> > > >  #include "xfs_trace.h"
> > > >  #include "xfs_icache.h"
> > > >  #include "xfs_log.h"
> > > > +#include "xfs_rmap_btree.h"
> > > >  
> > > >  /* Kernel only BMAP related definitions and functions */
> > > >  
> > > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > > index dbd10fc..9ed060d 100644
> > > > --- a/fs/xfs/xfs_defer_item.c
> > > > +++ b/fs/xfs/xfs_defer_item.c
> > > > @@ -213,7 +213,8 @@ xfs_rmap_update_finish_item(
> > > >  			rmap->ri_bmap.br_startoff,
> > > >  			rmap->ri_bmap.br_startblock,
> > > >  			rmap->ri_bmap.br_blockcount,
> > > > -			rmap->ri_bmap.br_state);
> > > > +			rmap->ri_bmap.br_state,
> > > > +			(struct xfs_btree_cur **)state);
> > > >  	kmem_free(rmap);
> > > >  	return error;
> > > >  }
> > > > @@ -225,6 +226,9 @@ xfs_rmap_update_finish_cleanup(
> > > >  	void			*state,
> > > >  	int			error)
> > > >  {
> > > > +	struct xfs_btree_cur	*rcur = state;
> > > > +
> > > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > > >  }
> > > >  
> > > >  /* Abort all pending RUIs. */
> > > > diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> > > > index ee4680e..6bc614c 100644
> > > > --- a/fs/xfs/xfs_error.h
> > > > +++ b/fs/xfs/xfs_error.h
> > > > @@ -91,7 +91,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> > > >  #define XFS_ERRTAG_DIOWRITE_IOERR			20
> > > >  #define XFS_ERRTAG_BMAPIFORMAT				21
> > > >  #define XFS_ERRTAG_FREE_EXTENT				22
> > > > -#define XFS_ERRTAG_MAX					23
> > > > +#define XFS_ERRTAG_RMAP_FINISH_ONE			23
> > > > +#define XFS_ERRTAG_MAX					24
> > > >  
> > > >  /*
> > > >   * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> > > > @@ -119,6 +120,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> > > >  #define XFS_RANDOM_DIOWRITE_IOERR			(XFS_RANDOM_DEFAULT/10)
> > > >  #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
> > > >  #define XFS_RANDOM_FREE_EXTENT				1
> > > > +#define XFS_RANDOM_RMAP_FINISH_ONE			1
> > > >  
> > > >  #ifdef DEBUG
> > > >  extern int xfs_error_test_active;
> > > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > > index c9fe0c4..f7f9635 100644
> > > > --- a/fs/xfs/xfs_log_recover.c
> > > > +++ b/fs/xfs/xfs_log_recover.c
> > > > @@ -45,6 +45,7 @@
> > > >  #include "xfs_error.h"
> > > >  #include "xfs_dir2.h"
> > > >  #include "xfs_rmap_item.h"
> > > > +#include "xfs_rmap_btree.h"
> > > >  
> > > >  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
> > > >  
> > > > @@ -4486,6 +4487,12 @@ xlog_recover_process_rui(
> > > >  	struct xfs_map_extent		*rmap;
> > > >  	xfs_fsblock_t			startblock_fsb;
> > > >  	bool				op_ok;
> > > > +	struct xfs_rud_log_item		*rudp;
> > > > +	enum xfs_rmap_intent_type	type;
> > > > +	int				whichfork;
> > > > +	xfs_exntst_t			state;
> > > > +	struct xfs_trans		*tp;
> > > > +	struct xfs_btree_cur		*rcur = NULL;
> > > >  
> > > >  	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
> > > >  
> > > > @@ -4528,9 +4535,54 @@ xlog_recover_process_rui(
> > > >  		}
> > > >  	}
> > > >  
> > > > -	/* XXX: do nothing for now */
> > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > > > +	if (error)
> > > > +		return error;
> > > > +	rudp = xfs_trans_get_rud(tp, ruip, ruip->rui_format.rui_nextents);
> > > > +
> > > > +	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
> > > > +		rmap = &(ruip->rui_format.rui_extents[i]);
> > > > +		state = (rmap->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
> > > > +				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
> > > > +		whichfork = (rmap->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
> > > > +				XFS_ATTR_FORK : XFS_DATA_FORK;
> > > > +		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
> > > > +		case XFS_RMAP_EXTENT_MAP:
> > > > +			type = XFS_RMAP_MAP;
> > > > +			break;
> > > > +		case XFS_RMAP_EXTENT_UNMAP:
> > > > +			type = XFS_RMAP_UNMAP;
> > > > +			break;
> > > > +		case XFS_RMAP_EXTENT_CONVERT:
> > > > +			type = XFS_RMAP_CONVERT;
> > > > +			break;
> > > > +		case XFS_RMAP_EXTENT_ALLOC:
> > > > +			type = XFS_RMAP_ALLOC;
> > > > +			break;
> > > > +		case XFS_RMAP_EXTENT_FREE:
> > > > +			type = XFS_RMAP_FREE;
> > > > +			break;
> > > > +		default:
> > > > +			error = -EFSCORRUPTED;
> > > > +			goto abort_error;
> > > > +		}
> > > > +		error = xfs_trans_log_finish_rmap_update(tp, rudp, type,
> > > > +				rmap->me_owner, whichfork,
> > > > +				rmap->me_startoff, rmap->me_startblock,
> > > > +				rmap->me_len, state, &rcur);
> > > > +		if (error)
> > > > +			goto abort_error;
> > > > +
> > > > +	}
> > > > +
> > > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > > >  	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> > > > -	xfs_rui_release(ruip);
> > > > +	error = xfs_trans_commit(tp);
> > > > +	return error;
> > > > +
> > > > +abort_error:
> > > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > > > +	xfs_trans_cancel(tp);
> > > >  	return error;
> > > >  }
> > > >  
> > > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > > index c48be63..f59d934 100644
> > > > --- a/fs/xfs/xfs_trans.h
> > > > +++ b/fs/xfs/xfs_trans.h
> > > > @@ -244,12 +244,13 @@ void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> > > >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > >  		xfs_exntst_t state);
> > > >  
> > > > +struct xfs_btree_cur;
> > > >  struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> > > >  		struct xfs_rui_log_item *ruip, uint nextents);
> > > >  int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> > > >  		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> > > >  		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > > -		xfs_exntst_t state);
> > > > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> > > >  
> > > >  #endif	/* __XFS_TRANS_H__ */
> > > > diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> > > > index b55a725..0c0df18 100644
> > > > --- a/fs/xfs/xfs_trans_rmap.c
> > > > +++ b/fs/xfs/xfs_trans_rmap.c
> > > > @@ -170,14 +170,15 @@ xfs_trans_log_finish_rmap_update(
> > > >  	xfs_fileoff_t			startoff,
> > > >  	xfs_fsblock_t			startblock,
> > > >  	xfs_filblks_t			blockcount,
> > > > -	xfs_exntst_t			state)
> > > > +	xfs_exntst_t			state,
> > > > +	struct xfs_btree_cur		**pcur)
> > > >  {
> > > >  	uint				next_extent;
> > > >  	struct xfs_map_extent		*rmap;
> > > >  	int				error;
> > > >  
> > > > -	/* XXX: actually finish the rmap update here */
> > > > -	error = -EFSCORRUPTED;
> > > > +	error = xfs_rmap_finish_one(tp, type, owner, whichfork, startoff,
> > > > +			startblock, blockcount, state, pcur);
> > > >  
> > > >  	/*
> > > >  	 * Mark the transaction dirty, even on error. This ensures the
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 044/119] xfs: propagate bmap updates to rmapbt
  2016-07-19  1:53         ` Darrick J. Wong
@ 2016-07-19 11:37           ` Brian Foster
  0 siblings, 0 replies; 236+ messages in thread
From: Brian Foster @ 2016-07-19 11:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jul 18, 2016 at 06:53:41PM -0700, Darrick J. Wong wrote:
> On Mon, Jul 18, 2016 at 08:55:29AM -0400, Brian Foster wrote:
> > On Sat, Jul 16, 2016 at 12:26:21AM -0700, Darrick J. Wong wrote:
> > > On Fri, Jul 15, 2016 at 02:33:56PM -0400, Brian Foster wrote:
> > > > On Thu, Jun 16, 2016 at 06:22:34PM -0700, Darrick J. Wong wrote:
> > > > > When we map, unmap, or convert an extent in a file's data or attr
> > > > > fork, schedule a respective update in the rmapbt.  Previous versions
> > > > > of this patch required a 1:1 correspondence between bmap and rmap,
> > > > > but this is no longer true.
> > > > > 
> > > > > v2: Remove the 1:1 correspondence requirement now that we have the
> > > > > ability to make interval queries against the rmapbt.  Update the
> > > > > commit message to reflect the broad restructuring of this patch.
> > > > > Fix the bmap shift code to adjust the rmaps correctly.
> > > > > 
> > > > > v3: Use the deferred operations code to handle redo operations
> > > > > atomically and deadlock free.  Plumb in all five rmap actions
> > > > > (map, unmap, convert extent, alloc, free); we'll use the first
> > > > > three now for file data, and reflink will want the last two.
> > > > > Add an error injection site to test log recovery.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_bmap.c       |   56 ++++++++-
> > > > >  fs/xfs/libxfs/xfs_rmap.c       |  252 ++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/libxfs/xfs_rmap_btree.h |   24 ++++
> > > > >  fs/xfs/xfs_bmap_util.c         |    1 
> > > > >  fs/xfs/xfs_defer_item.c        |    6 +
> > > > >  fs/xfs/xfs_error.h             |    4 -
> > > > >  fs/xfs/xfs_log_recover.c       |   56 +++++++++
> > > > >  fs/xfs/xfs_trans.h             |    3 
> > > > >  fs/xfs/xfs_trans_rmap.c        |    7 +
> > > > >  9 files changed, 393 insertions(+), 16 deletions(-)
> > > > > 
> > > > > 
> > ...
> > > > > diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
> > > > > index 76fc5c2..f179ea4 100644
> > > > > --- a/fs/xfs/libxfs/xfs_rmap.c
> > > > > +++ b/fs/xfs/libxfs/xfs_rmap.c
> > > > > @@ -36,6 +36,8 @@
> > > > >  #include "xfs_trace.h"
> > > > >  #include "xfs_error.h"
> > > > >  #include "xfs_extent_busy.h"
> > > > > +#include "xfs_bmap.h"
> > > > > +#include "xfs_inode.h"
> > > > >  
> > > > >  /*
> > > > >   * Lookup the first record less than or equal to [bno, len, owner, offset]
> > > > > @@ -1212,3 +1214,253 @@ xfs_rmapbt_query_range(
> > > > >  	return xfs_btree_query_range(cur, &low_brec, &high_brec,
> > > > >  			xfs_rmapbt_query_range_helper, &query);
> > > > >  }
> > > > > +
> > > > > +/* Clean up after calling xfs_rmap_finish_one. */
> > > > > +void
> > > > > +xfs_rmap_finish_one_cleanup(
> > > > > +	struct xfs_trans	*tp,
> > > > > +	struct xfs_btree_cur	*rcur,
> > > > > +	int			error)
> > > > > +{
> > > > > +	struct xfs_buf		*agbp;
> > > > > +
> > > > > +	if (rcur == NULL)
> > > > > +		return;
> > > > > +	agbp = rcur->bc_private.a.agbp;
> > > > > +	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > > > > +	xfs_trans_brelse(tp, agbp);
> > > > 
> > > > Why unconditionally release the agbp (and not just on error)?
> > > 
> > > We grabbed the agbp (er, AGF buffer) to construct the rmapbt cursor, so we have
> > > to free it after the cursor is deleted regardless of whether or not there's an
> > > error.
> > > 
> > 
> > It looks like it's attached to the transaction via xfs_trans_read_*(),
> > which means it will be released properly on transaction commit. I don't
> > think it's necessarily a bug because xfs_trans_brelse() bails out when
> > the item is dirty, but it looks like a departure from how this is used
> > elsewhere throughout XFS (when no modifications are made or otherwise as
> > an error condition cleanup). E.g., see the similar pattern in
> > xfs_free_extent().
> > 
> > Maybe I'm missing something.. was there a known issue that required this
> > call, or had it always been there?
> 
> /me finally figures out that you're just wondering why I brelse the agbp
> even when there *isn't* an error.  Yes, that's unnecessary, will change it
> tomorrow.
> 

Ok!

> --D
> 
> > 
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Process one of the deferred rmap operations.  We pass back the
> > > > > + * btree cursor to maintain our lock on the rmapbt between calls.
> > > > > + * This saves time and eliminates a buffer deadlock between the
> > > > > + * superblock and the AGF because we'll always grab them in the same
> > > > > + * order.
> > > > > + */
> > > > > +int
> > > > > +xfs_rmap_finish_one(
> > > > > +	struct xfs_trans		*tp,
> > > > > +	enum xfs_rmap_intent_type	type,
> > > > > +	__uint64_t			owner,
> > > > > +	int				whichfork,
> > > > > +	xfs_fileoff_t			startoff,
> > > > > +	xfs_fsblock_t			startblock,
> > > > > +	xfs_filblks_t			blockcount,
> > > > > +	xfs_exntst_t			state,
> > > > > +	struct xfs_btree_cur		**pcur)
> > > > > +{
> > > > > +	struct xfs_mount		*mp = tp->t_mountp;
> > > > > +	struct xfs_btree_cur		*rcur;
> > > > > +	struct xfs_buf			*agbp = NULL;
> > > > > +	int				error = 0;
> > > > > +	xfs_agnumber_t			agno;
> > > > > +	struct xfs_owner_info		oinfo;
> > > > > +	xfs_agblock_t			bno;
> > > > > +	bool				unwritten;
> > > > > +
> > > > > +	agno = XFS_FSB_TO_AGNO(mp, startblock);
> > > > > +	ASSERT(agno != NULLAGNUMBER);
> > > > > +	bno = XFS_FSB_TO_AGBNO(mp, startblock);
> > > > > +
> > > > > +	trace_xfs_rmap_deferred(mp, agno, type, bno, owner, whichfork,
> > > > > +			startoff, blockcount, state);
> > > > > +
> > > > > +	if (XFS_TEST_ERROR(false, mp,
> > > > > +			XFS_ERRTAG_RMAP_FINISH_ONE,
> > > > > +			XFS_RANDOM_RMAP_FINISH_ONE))
> > > > > +		return -EIO;
> > > > > +
> > > > > +	/*
> > > > > +	 * If we haven't gotten a cursor or the cursor AG doesn't match
> > > > > +	 * the startblock, get one now.
> > > > > +	 */
> > > > > +	rcur = *pcur;
> > > > > +	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
> > > > > +		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
> > > > > +		rcur = NULL;
> > > > > +		*pcur = NULL;
> > > > > +	}
> > > > > +	if (rcur == NULL) {
> > > > > +		error = xfs_free_extent_fix_freelist(tp, agno, &agbp);
> > > > 
> > > > Comment? Why is this here? (Maybe we should rename that function while
> > > > we're at it..)
> > > 
> > > /*
> > >  * Ensure the freelist is of a sufficient length to provide for any btree
> > >  * splits that could happen when we make changes to the rmapbt.
> > >  */
> > > 
> > > (I don't know why the function has that name; Dave supplied it.)
> > > 
> > > > > +		if (error)
> > > > > +			return error;
> > > > > +		if (!agbp)
> > > > > +			return -EFSCORRUPTED;
> > > > > +
> > > > > +		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, agno);
> > > > > +		if (!rcur) {
> > > > > +			error = -ENOMEM;
> > > > > +			goto out_cur;
> > > > > +		}
> > > > > +	}
> > > > > +	*pcur = rcur;
> > > > > +
> > > > > +	xfs_rmap_ino_owner(&oinfo, owner, whichfork, startoff);
> > > > > +	unwritten = state == XFS_EXT_UNWRITTEN;
> > > > > +	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, startblock);
> > > > > +
> > > > > +	switch (type) {
> > > > > +	case XFS_RMAP_MAP:
> > > > > +		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
> > > > > +		break;
> > > > > +	case XFS_RMAP_UNMAP:
> > > > > +		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
> > > > > +				&oinfo);
> > > > > +		break;
> > > > > +	case XFS_RMAP_CONVERT:
> > > > > +		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
> > > > > +				&oinfo);
> > > > > +		break;
> > > > > +	case XFS_RMAP_ALLOC:
> > > > > +		error = __xfs_rmap_alloc(rcur, bno, blockcount, unwritten,
> > > > > +				&oinfo);
> > > > > +		break;
> > > > > +	case XFS_RMAP_FREE:
> > > > > +		error = __xfs_rmap_free(rcur, bno, blockcount, unwritten,
> > > > > +				&oinfo);
> > > > > +		break;
> > > > > +	default:
> > > > > +		ASSERT(0);
> > > > > +		error = -EFSCORRUPTED;
> > > > > +	}
> > > > > +	return error;
> > > > > +
> > > > > +out_cur:
> > > > > +	xfs_trans_brelse(tp, agbp);
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Record a rmap intent; the list is kept sorted first by AG and then by
> > > > > + * increasing age.
> > > > > + */
> > > > > +static int
> > > > > +__xfs_rmap_add(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	struct xfs_defer_ops	*dfops,
> > > > > +	struct xfs_rmap_intent	*ri)
> > > > > +{
> > > > > +	struct xfs_rmap_intent	*new;
> > > > > +
> > > > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > > +		return 0;
> > > > > +
> > > > > +	trace_xfs_rmap_defer(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock),
> > > > > +			ri->ri_type,
> > > > > +			XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock),
> > > > > +			ri->ri_owner, ri->ri_whichfork,
> > > > > +			ri->ri_bmap.br_startoff,
> > > > > +			ri->ri_bmap.br_blockcount,
> > > > > +			ri->ri_bmap.br_state);
> > > > > +
> > > > > +	new = kmem_zalloc(sizeof(struct xfs_rmap_intent), KM_SLEEP | KM_NOFS);
> > > > > +	*new = *ri;
> > > > > +
> > > > > +	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_RMAP, &new->ri_list);
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/* Map an extent into a file. */
> > > > > +int
> > > > > +xfs_rmap_map_extent(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	struct xfs_defer_ops	*dfops,
> > > > > +	struct xfs_inode	*ip,
> > > > > +	int			whichfork,
> > > > > +	struct xfs_bmbt_irec	*PREV)
> > > > > +{
> > > > > +	struct xfs_rmap_intent	ri;
> > > > > +
> > > > > +	ri.ri_type = XFS_RMAP_MAP;
> > > > > +	ri.ri_owner = ip->i_ino;
> > > > > +	ri.ri_whichfork = whichfork;
> > > > > +	ri.ri_bmap = *PREV;
> > > > > +
> > > > 
> > > > I think we should probably initialize ri_list as well (maybe turn this
> > > > into an xfs_rmap_init helper).
> > > 
> > > __xfs_rmap_add calls xfs_defer_add, which calls list_add_tail, which
> > > initializes ri_list.  Could probably just make an _rmap_init helper that
> > > allocates the structure, then have _rmap_*_extent fill out the new intent, and
> > > make the _rmap_add function pass it to _defer_add, which I think is what you're
> > > getting at.
> > > 
> > 
> > I didn't mean to suggest it was a bug. It's more of a defensive thing
> > than anything.
> 
> Oh, sure, it's not a bug at all, but it is a little goofy to initialize
> a stack variable, then allocate a slab object and copy the stack variable's
> contents into the slab object and then push it out for later processing.
> 

Perhaps.. but that seems irrelevant to me. What gives me pause is that
we basically pass off stack cruft to another "subsystem" in the fs.
E.g., we do the following in __xfs_rmap_add():

	...
	new = kmem_zalloc(...);
	*new = *ri;

	xfs_defer_add(..., &new->ri_list);
	return 0;

So the separate memory object is irrelevant. All I'm basically saying is
I think we should pass initialized content across this kind of boundary.
AFAICT, it should be perfectly sane to ASSERT(list_empty(li)) in
xfs_defer_add(), for example, which might help prevent new callers from
erroneously reusing items, etc. (now that I look again, we might want to
error check 'type' as well since it indexes an array).

So it isn't currently a bug, but it's an easily avoidable landmine IMO.

Brian

> (The dangers of repeatedly revising one's code. :))
> 
> --D
> 
> > 
> > Brian
> > 
> > > > Also, for some reason it feels to me like the _hasrmapbt() feature check
> > > > should be up at this level (or higher), rather than buried in
> > > > __xfs_rmap_add(). I don't feel too strongly about that if others think
> > > > differently, however.
> > > 
> > > <shrug> It probably ought to be in the higher level function.
> > > 
> > > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > > +}
> > > > > +
> > > > > +/* Unmap an extent out of a file. */
> > > > > +int
> > > > > +xfs_rmap_unmap_extent(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	struct xfs_defer_ops	*dfops,
> > > > > +	struct xfs_inode	*ip,
> > > > > +	int			whichfork,
> > > > > +	struct xfs_bmbt_irec	*PREV)
> > > > > +{
> > > > > +	struct xfs_rmap_intent	ri;
> > > > > +
> > > > > +	ri.ri_type = XFS_RMAP_UNMAP;
> > > > > +	ri.ri_owner = ip->i_ino;
> > > > > +	ri.ri_whichfork = whichfork;
> > > > > +	ri.ri_bmap = *PREV;
> > > > > +
> > > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > > +}
> > > > > +
> > > > > +/* Convert a data fork extent from unwritten to real or vice versa. */
> > > > > +int
> > > > > +xfs_rmap_convert_extent(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	struct xfs_defer_ops	*dfops,
> > > > > +	struct xfs_inode	*ip,
> > > > > +	int			whichfork,
> > > > > +	struct xfs_bmbt_irec	*PREV)
> > > > > +{
> > > > > +	struct xfs_rmap_intent	ri;
> > > > > +
> > > > > +	ri.ri_type = XFS_RMAP_CONVERT;
> > > > > +	ri.ri_owner = ip->i_ino;
> > > > > +	ri.ri_whichfork = whichfork;
> > > > > +	ri.ri_bmap = *PREV;
> > > > > +
> > > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > > +}
> > > > > +
> > > > > +/* Schedule the creation of an rmap for non-file data. */
> > > > > +int
> > > > > +xfs_rmap_alloc_defer(
> > > > 
> > > > xfs_rmap_[alloc|free]_extent() like the others..?
> > > 
> > > Yeah.  The naming has shifted a bit over the past few revisions.
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian 
> > > > 
> > > > > +	struct xfs_mount	*mp,
> > > > > +	struct xfs_defer_ops	*dfops,
> > > > > +	xfs_agnumber_t		agno,
> > > > > +	xfs_agblock_t		bno,
> > > > > +	xfs_extlen_t		len,
> > > > > +	__uint64_t		owner)
> > > > > +{
> > > > > +	struct xfs_rmap_intent	ri;
> > > > > +
> > > > > +	ri.ri_type = XFS_RMAP_ALLOC;
> > > > > +	ri.ri_owner = owner;
> > > > > +	ri.ri_whichfork = XFS_DATA_FORK;
> > > > > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > > > > +	ri.ri_bmap.br_blockcount = len;
> > > > > +	ri.ri_bmap.br_startoff = 0;
> > > > > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > > > > +
> > > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > > +}
> > > > > +
> > > > > +/* Schedule the deletion of an rmap for non-file data. */
> > > > > +int
> > > > > +xfs_rmap_free_defer(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	struct xfs_defer_ops	*dfops,
> > > > > +	xfs_agnumber_t		agno,
> > > > > +	xfs_agblock_t		bno,
> > > > > +	xfs_extlen_t		len,
> > > > > +	__uint64_t		owner)
> > > > > +{
> > > > > +	struct xfs_rmap_intent	ri;
> > > > > +
> > > > > +	ri.ri_type = XFS_RMAP_FREE;
> > > > > +	ri.ri_owner = owner;
> > > > > +	ri.ri_whichfork = XFS_DATA_FORK;
> > > > > +	ri.ri_bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
> > > > > +	ri.ri_bmap.br_blockcount = len;
> > > > > +	ri.ri_bmap.br_startoff = 0;
> > > > > +	ri.ri_bmap.br_state = XFS_EXT_NORM;
> > > > > +
> > > > > +	return __xfs_rmap_add(mp, dfops, &ri);
> > > > > +}
> > > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > index aff60dc..5df406e 100644
> > > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > @@ -106,4 +106,28 @@ struct xfs_rmap_intent {
> > > > >  	struct xfs_bmbt_irec			ri_bmap;
> > > > >  };
> > > > >  
> > > > > +/* functions for updating the rmapbt based on bmbt map/unmap operations */
> > > > > +int xfs_rmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > > +		struct xfs_inode *ip, int whichfork,
> > > > > +		struct xfs_bmbt_irec *imap);
> > > > > +int xfs_rmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > > +		struct xfs_inode *ip, int whichfork,
> > > > > +		struct xfs_bmbt_irec *imap);
> > > > > +int xfs_rmap_convert_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > > +		struct xfs_inode *ip, int whichfork,
> > > > > +		struct xfs_bmbt_irec *imap);
> > > > > +int xfs_rmap_alloc_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > > > > +		__uint64_t owner);
> > > > > +int xfs_rmap_free_defer(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
> > > > > +		xfs_agnumber_t agno, xfs_agblock_t bno, xfs_extlen_t len,
> > > > > +		__uint64_t owner);
> > > > > +
> > > > > +void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
> > > > > +		struct xfs_btree_cur *rcur, int error);
> > > > > +int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
> > > > > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > > > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > > > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> > > > > +
> > > > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > > index 62d194e..450fd49 100644
> > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > @@ -41,6 +41,7 @@
> > > > >  #include "xfs_trace.h"
> > > > >  #include "xfs_icache.h"
> > > > >  #include "xfs_log.h"
> > > > > +#include "xfs_rmap_btree.h"
> > > > >  
> > > > >  /* Kernel only BMAP related definitions and functions */
> > > > >  
> > > > > diff --git a/fs/xfs/xfs_defer_item.c b/fs/xfs/xfs_defer_item.c
> > > > > index dbd10fc..9ed060d 100644
> > > > > --- a/fs/xfs/xfs_defer_item.c
> > > > > +++ b/fs/xfs/xfs_defer_item.c
> > > > > @@ -213,7 +213,8 @@ xfs_rmap_update_finish_item(
> > > > >  			rmap->ri_bmap.br_startoff,
> > > > >  			rmap->ri_bmap.br_startblock,
> > > > >  			rmap->ri_bmap.br_blockcount,
> > > > > -			rmap->ri_bmap.br_state);
> > > > > +			rmap->ri_bmap.br_state,
> > > > > +			(struct xfs_btree_cur **)state);
> > > > >  	kmem_free(rmap);
> > > > >  	return error;
> > > > >  }
> > > > > @@ -225,6 +226,9 @@ xfs_rmap_update_finish_cleanup(
> > > > >  	void			*state,
> > > > >  	int			error)
> > > > >  {
> > > > > +	struct xfs_btree_cur	*rcur = state;
> > > > > +
> > > > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > > > >  }
> > > > >  
> > > > >  /* Abort all pending RUIs. */
> > > > > diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> > > > > index ee4680e..6bc614c 100644
> > > > > --- a/fs/xfs/xfs_error.h
> > > > > +++ b/fs/xfs/xfs_error.h
> > > > > @@ -91,7 +91,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> > > > >  #define XFS_ERRTAG_DIOWRITE_IOERR			20
> > > > >  #define XFS_ERRTAG_BMAPIFORMAT				21
> > > > >  #define XFS_ERRTAG_FREE_EXTENT				22
> > > > > -#define XFS_ERRTAG_MAX					23
> > > > > +#define XFS_ERRTAG_RMAP_FINISH_ONE			23
> > > > > +#define XFS_ERRTAG_MAX					24
> > > > >  
> > > > >  /*
> > > > >   * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
> > > > > @@ -119,6 +120,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
> > > > >  #define XFS_RANDOM_DIOWRITE_IOERR			(XFS_RANDOM_DEFAULT/10)
> > > > >  #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
> > > > >  #define XFS_RANDOM_FREE_EXTENT				1
> > > > > +#define XFS_RANDOM_RMAP_FINISH_ONE			1
> > > > >  
> > > > >  #ifdef DEBUG
> > > > >  extern int xfs_error_test_active;
> > > > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > > > index c9fe0c4..f7f9635 100644
> > > > > --- a/fs/xfs/xfs_log_recover.c
> > > > > +++ b/fs/xfs/xfs_log_recover.c
> > > > > @@ -45,6 +45,7 @@
> > > > >  #include "xfs_error.h"
> > > > >  #include "xfs_dir2.h"
> > > > >  #include "xfs_rmap_item.h"
> > > > > +#include "xfs_rmap_btree.h"
> > > > >  
> > > > >  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
> > > > >  
> > > > > @@ -4486,6 +4487,12 @@ xlog_recover_process_rui(
> > > > >  	struct xfs_map_extent		*rmap;
> > > > >  	xfs_fsblock_t			startblock_fsb;
> > > > >  	bool				op_ok;
> > > > > +	struct xfs_rud_log_item		*rudp;
> > > > > +	enum xfs_rmap_intent_type	type;
> > > > > +	int				whichfork;
> > > > > +	xfs_exntst_t			state;
> > > > > +	struct xfs_trans		*tp;
> > > > > +	struct xfs_btree_cur		*rcur = NULL;
> > > > >  
> > > > >  	ASSERT(!test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags));
> > > > >  
> > > > > @@ -4528,9 +4535,54 @@ xlog_recover_process_rui(
> > > > >  		}
> > > > >  	}
> > > > >  
> > > > > -	/* XXX: do nothing for now */
> > > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +	rudp = xfs_trans_get_rud(tp, ruip, ruip->rui_format.rui_nextents);
> > > > > +
> > > > > +	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
> > > > > +		rmap = &(ruip->rui_format.rui_extents[i]);
> > > > > +		state = (rmap->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
> > > > > +				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
> > > > > +		whichfork = (rmap->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
> > > > > +				XFS_ATTR_FORK : XFS_DATA_FORK;
> > > > > +		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
> > > > > +		case XFS_RMAP_EXTENT_MAP:
> > > > > +			type = XFS_RMAP_MAP;
> > > > > +			break;
> > > > > +		case XFS_RMAP_EXTENT_UNMAP:
> > > > > +			type = XFS_RMAP_UNMAP;
> > > > > +			break;
> > > > > +		case XFS_RMAP_EXTENT_CONVERT:
> > > > > +			type = XFS_RMAP_CONVERT;
> > > > > +			break;
> > > > > +		case XFS_RMAP_EXTENT_ALLOC:
> > > > > +			type = XFS_RMAP_ALLOC;
> > > > > +			break;
> > > > > +		case XFS_RMAP_EXTENT_FREE:
> > > > > +			type = XFS_RMAP_FREE;
> > > > > +			break;
> > > > > +		default:
> > > > > +			error = -EFSCORRUPTED;
> > > > > +			goto abort_error;
> > > > > +		}
> > > > > +		error = xfs_trans_log_finish_rmap_update(tp, rudp, type,
> > > > > +				rmap->me_owner, whichfork,
> > > > > +				rmap->me_startoff, rmap->me_startblock,
> > > > > +				rmap->me_len, state, &rcur);
> > > > > +		if (error)
> > > > > +			goto abort_error;
> > > > > +
> > > > > +	}
> > > > > +
> > > > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > > > >  	set_bit(XFS_RUI_RECOVERED, &ruip->rui_flags);
> > > > > -	xfs_rui_release(ruip);
> > > > > +	error = xfs_trans_commit(tp);
> > > > > +	return error;
> > > > > +
> > > > > +abort_error:
> > > > > +	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> > > > > +	xfs_trans_cancel(tp);
> > > > >  	return error;
> > > > >  }
> > > > >  
> > > > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > > > index c48be63..f59d934 100644
> > > > > --- a/fs/xfs/xfs_trans.h
> > > > > +++ b/fs/xfs/xfs_trans.h
> > > > > @@ -244,12 +244,13 @@ void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> > > > >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > > >  		xfs_exntst_t state);
> > > > >  
> > > > > +struct xfs_btree_cur;
> > > > >  struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> > > > >  		struct xfs_rui_log_item *ruip, uint nextents);
> > > > >  int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> > > > >  		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> > > > >  		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > > >  		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > > > -		xfs_exntst_t state);
> > > > > +		xfs_exntst_t state, struct xfs_btree_cur **pcur);
> > > > >  
> > > > >  #endif	/* __XFS_TRANS_H__ */
> > > > > diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> > > > > index b55a725..0c0df18 100644
> > > > > --- a/fs/xfs/xfs_trans_rmap.c
> > > > > +++ b/fs/xfs/xfs_trans_rmap.c
> > > > > @@ -170,14 +170,15 @@ xfs_trans_log_finish_rmap_update(
> > > > >  	xfs_fileoff_t			startoff,
> > > > >  	xfs_fsblock_t			startblock,
> > > > >  	xfs_filblks_t			blockcount,
> > > > > -	xfs_exntst_t			state)
> > > > > +	xfs_exntst_t			state,
> > > > > +	struct xfs_btree_cur		**pcur)
> > > > >  {
> > > > >  	uint				next_extent;
> > > > >  	struct xfs_map_extent		*rmap;
> > > > >  	int				error;
> > > > >  
> > > > > -	/* XXX: actually finish the rmap update here */
> > > > > -	error = -EFSCORRUPTED;
> > > > > +	error = xfs_rmap_finish_one(tp, type, owner, whichfork, startoff,
> > > > > +			startblock, blockcount, state, pcur);
> > > > >  
> > > > >  	/*
> > > > >  	 * Mark the transaction dirty, even on error. This ensures the
> > > > > 
> > > > > _______________________________________________
> > > > > xfs mailing list
> > > > > xfs@oss.sgi.com
> > > > > http://oss.sgi.com/mailman/listinfo/xfs
> > > 
> > > _______________________________________________
> > > xfs mailing list
> > > xfs@oss.sgi.com
> > > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 042/119] xfs: log rmap intent items
  2016-07-18 12:55       ` Brian Foster
@ 2016-07-19 17:10         ` Darrick J. Wong
  0 siblings, 0 replies; 236+ messages in thread
From: Darrick J. Wong @ 2016-07-19 17:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, vishal.l.verma, xfs

On Mon, Jul 18, 2016 at 08:55:02AM -0400, Brian Foster wrote:
> On Sat, Jul 16, 2016 at 12:34:09AM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 15, 2016 at 02:33:46PM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2016 at 06:22:21PM -0700, Darrick J. Wong wrote:
> > > > Provide a mechanism for higher levels to create RUI/RUD items, submit
> > > > them to the log, and a stub function to deal with recovered RUI items.
> > > > These parts will be connected to the rmapbt in a later patch.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > The commit log makes no mention of log recovery.. perhaps this should be
> > > split in two?
> > > 
> > > >  fs/xfs/Makefile          |    1 
> > > >  fs/xfs/xfs_log_recover.c |  344 +++++++++++++++++++++++++++++++++++++++++++++-
> > > >  fs/xfs/xfs_trans.h       |   17 ++
> > > >  fs/xfs/xfs_trans_rmap.c  |  235 +++++++++++++++++++++++++++++++
> > > >  4 files changed, 589 insertions(+), 8 deletions(-)
> > > >  create mode 100644 fs/xfs/xfs_trans_rmap.c
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > > index 8ae0a10..1980110 100644
> > > > --- a/fs/xfs/Makefile
> > > > +++ b/fs/xfs/Makefile
> > > > @@ -110,6 +110,7 @@ xfs-y				+= xfs_log.o \
> > > >  				   xfs_trans_buf.o \
> > > >  				   xfs_trans_extfree.o \
> > > >  				   xfs_trans_inode.o \
> > > > +				   xfs_trans_rmap.o \
> > > >  
> > > >  # optional features
> > > >  xfs-$(CONFIG_XFS_QUOTA)		+= xfs_dquot.o \
> > > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > > index b33187b..c9fe0c4 100644
> > > > --- a/fs/xfs/xfs_log_recover.c
> > > > +++ b/fs/xfs/xfs_log_recover.c
> ...
> > > > @@ -4265,17 +4383,23 @@ xlog_recover_process_efis(
> > > >  	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > > >  	while (lip != NULL) {
> > > >  		/*
> > > > -		 * We're done when we see something other than an EFI.
> > > > -		 * There should be no EFIs left in the AIL now.
> > > > +		 * We're done when we see something other than an intent.
> > > > +		 * There should be no intents left in the AIL now.
> > > >  		 */
> > > > -		if (lip->li_type != XFS_LI_EFI) {
> > > > +		if (!xlog_item_is_intent(lip)) {
> > > >  #ifdef DEBUG
> > > >  			for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
> > > > -				ASSERT(lip->li_type != XFS_LI_EFI);
> > > > +				ASSERT(!xlog_item_is_intent(lip));
> > > >  #endif
> > > >  			break;
> > > >  		}
> > > >  
> > > > +		/* Skip anything that isn't an EFI */
> > > > +		if (lip->li_type != XFS_LI_EFI) {
> > > > +			lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > > > +			continue;
> > > > +		}
> > > > +
> > > 
> > > Hmm, so previously this function used the existence of any non-EFI item
> > > as an end of traversal marker, since the freeing operations add more
> > > items to the AIL. It's not immediately clear to me whether this is just
> > > an efficiency thing or a potential problem, but I wonder if we should
> > > grab the last item and use that or its lsn as an end of list marker.
> > 
> > FWIW I designed all this under the impression that it was safe to stop looking
> > for intent items once we found something that wasn't an intent item because all
> > the new items generated during log recovery came after, and therefore there was
> > no problem.
> > 
> 
> Ok. To be clear, are you saying that any new intents should follow
> non-intent items? If so, that sounds... reasonable (perhaps a little
> landmind-ish :P).

I've refactored the redo item processing into a single function
xlog_recover_process_intents, and will put in an assert to check that each redo
item's LSN is not larger than whatever LSN(curr_cycle, curr_block) is at the
start of intent processing.  That'll hopefully catch any case where we
accidentally stray into new intent items.

Looks like everything still passes with the review refactoring, so I'll start
integrating the last of those changes into the patchset proper.

--D

> > > At the very least we need to update the comment at the top of the
> > > function wrt to the current behavior.
> > 
> > Oops, missed that, yeah.
> > 
> > > >  		/*
> > > >  		 * Skip EFIs that we've already processed.
> > > >  		 */
> ...
> > > > @@ -5144,11 +5458,19 @@ xlog_recover_finish(
> > > >  	 */
> > > >  	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
> > > >  		int	error;
> > > > +
> > > > +		error = xlog_recover_process_ruis(log);
> > > > +		if (error) {
> > > > +			xfs_alert(log->l_mp, "Failed to recover RUIs");
> > > > +			return error;
> > > > +		}
> > > > +
> > > >  		error = xlog_recover_process_efis(log);
> > > >  		if (error) {
> > > >  			xfs_alert(log->l_mp, "Failed to recover EFIs");
> > > >  			return error;
> > > >  		}
> > > > +
> > > 
> > > Is the order important here in any way (e.g., RUIs before EFIs)? If so,
> > > it might be a good idea to call it out.
> > 
> > AFAIK the intent items within a particular type have to be replayed in
> > order, but between types, there isn't a problem with the current code.
> > 
> > That said, I'd also been wondering if it made more sense to iterate the
> > list of items /once/ and actually replay items in order.  Less iteration
> > and the order of replayed items matches the log order much more closely.
> > 
> 
> That sounds like a nice idea to me. There might actually be some room
> for consolidation between the RUI/EFI recovered bits and whatnot, but
> only if it makes things more clean and simple.
> 
> Brian
> 
> > > >  		/*
> > > >  		 * Sync the log to get all the EFIs out of the AIL.
> > > >  		 * This isn't absolutely necessary, but it helps in
> > > > @@ -5176,9 +5498,15 @@ xlog_recover_cancel(
> > > >  	struct xlog	*log)
> > > >  {
> > > >  	int		error = 0;
> > > > +	int		err2;
> > > >  
> > > > -	if (log->l_flags & XLOG_RECOVERY_NEEDED)
> > > > -		error = xlog_recover_cancel_efis(log);
> > > > +	if (log->l_flags & XLOG_RECOVERY_NEEDED) {
> > > > +		error = xlog_recover_cancel_ruis(log);
> > > > +
> > > > +		err2 = xlog_recover_cancel_efis(log);
> > > > +		if (err2 && !error)
> > > > +			error = err2;
> > > > +	}
> > > >  
> > > >  	return error;
> > > >  }
> > > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > > index f8d363f..c48be63 100644
> > > > --- a/fs/xfs/xfs_trans.h
> > > > +++ b/fs/xfs/xfs_trans.h
> > > > @@ -235,4 +235,21 @@ void		xfs_trans_buf_copy_type(struct xfs_buf *dst_bp,
> > > >  extern kmem_zone_t	*xfs_trans_zone;
> > > >  extern kmem_zone_t	*xfs_log_item_desc_zone;
> > > >  
> > > > +enum xfs_rmap_intent_type;
> > > > +
> > > > +struct xfs_rui_log_item *xfs_trans_get_rui(struct xfs_trans *tp, uint nextents);
> > > > +void xfs_trans_log_start_rmap_update(struct xfs_trans *tp,
> > > > +		struct xfs_rui_log_item *ruip, enum xfs_rmap_intent_type type,
> > > > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > > +		xfs_exntst_t state);
> > > > +
> > > > +struct xfs_rud_log_item *xfs_trans_get_rud(struct xfs_trans *tp,
> > > > +		struct xfs_rui_log_item *ruip, uint nextents);
> > > > +int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> > > > +		struct xfs_rud_log_item *rudp, enum xfs_rmap_intent_type type,
> > > > +		__uint64_t owner, int whichfork, xfs_fileoff_t startoff,
> > > > +		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
> > > > +		xfs_exntst_t state);
> > > > +
> > > >  #endif	/* __XFS_TRANS_H__ */
> > > > diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
> > > > new file mode 100644
> > > > index 0000000..b55a725
> > > > --- /dev/null
> > > > +++ b/fs/xfs/xfs_trans_rmap.c
> > > > @@ -0,0 +1,235 @@
> > > > +/*
> > > > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > > > + *
> > > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > + * modify it under the terms of the GNU General Public License
> > > > + * as published by the Free Software Foundation; either version 2
> > > > + * of the License, or (at your option) any later version.
> > > > + *
> > > > + * This program is distributed in the hope that it would be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > + * GNU General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU General Public License
> > > > + * along with this program; if not, write the Free Software Foundation,
> > > > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > > > + */
> > > > +#include "xfs.h"
> > > > +#include "xfs_fs.h"
> > > > +#include "xfs_shared.h"
> > > > +#include "xfs_format.h"
> > > > +#include "xfs_log_format.h"
> > > > +#include "xfs_trans_resv.h"
> > > > +#include "xfs_mount.h"
> > > > +#include "xfs_defer.h"
> > > > +#include "xfs_trans.h"
> > > > +#include "xfs_trans_priv.h"
> > > > +#include "xfs_rmap_item.h"
> > > > +#include "xfs_alloc.h"
> > > > +#include "xfs_rmap_btree.h"
> > > > +
> > > > +/*
> > > > + * This routine is called to allocate an "rmap update intent"
> > > > + * log item that will hold nextents worth of extents.  The
> > > > + * caller must use all nextents extents, because we are not
> > > > + * flexible about this at all.
> > > > + */
> > > > +struct xfs_rui_log_item *
> > > > +xfs_trans_get_rui(
> > > > +	struct xfs_trans		*tp,
> > > > +	uint				nextents)
> > > > +{
> > > > +	struct xfs_rui_log_item		*ruip;
> > > > +
> > > > +	ASSERT(tp != NULL);
> > > > +	ASSERT(nextents > 0);
> > > > +
> > > > +	ruip = xfs_rui_init(tp->t_mountp, nextents);
> > > > +	ASSERT(ruip != NULL);
> > > > +
> > > > +	/*
> > > > +	 * Get a log_item_desc to point at the new item.
> > > > +	 */
> > > > +	xfs_trans_add_item(tp, &ruip->rui_item);
> > > > +	return ruip;
> > > > +}
> > > > +
> > > > +/*
> > > > + * This routine is called to indicate that the described
> > > > + * extent is to be logged as needing to be freed.  It should
> > > > + * be called once for each extent to be freed.
> > > > + */
> > > 
> > > Stale comment.
> > 
> > <nod>
> > 
> > > > +void
> > > > +xfs_trans_log_start_rmap_update(
> > > > +	struct xfs_trans		*tp,
> > > > +	struct xfs_rui_log_item		*ruip,
> > > > +	enum xfs_rmap_intent_type	type,
> > > > +	__uint64_t			owner,
> > > > +	int				whichfork,
> > > > +	xfs_fileoff_t			startoff,
> > > > +	xfs_fsblock_t			startblock,
> > > > +	xfs_filblks_t			blockcount,
> > > > +	xfs_exntst_t			state)
> > > > +{
> > > > +	uint				next_extent;
> > > > +	struct xfs_map_extent		*rmap;
> > > > +
> > > > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > > > +	ruip->rui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > > > +
> > > > +	/*
> > > > +	 * atomic_inc_return gives us the value after the increment;
> > > > +	 * we want to use it as an array index so we need to subtract 1 from
> > > > +	 * it.
> > > > +	 */
> > > > +	next_extent = atomic_inc_return(&ruip->rui_next_extent) - 1;
> > > > +	ASSERT(next_extent < ruip->rui_format.rui_nextents);
> > > > +	rmap = &(ruip->rui_format.rui_extents[next_extent]);
> > > > +	rmap->me_owner = owner;
> > > > +	rmap->me_startblock = startblock;
> > > > +	rmap->me_startoff = startoff;
> > > > +	rmap->me_len = blockcount;
> > > > +	rmap->me_flags = 0;
> > > > +	if (state == XFS_EXT_UNWRITTEN)
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> > > > +	if (whichfork == XFS_ATTR_FORK)
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> > > > +	switch (type) {
> > > > +	case XFS_RMAP_MAP:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> > > > +		break;
> > > > +	case XFS_RMAP_MAP_SHARED:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> > > > +		break;
> > > > +	case XFS_RMAP_UNMAP:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> > > > +		break;
> > > > +	case XFS_RMAP_UNMAP_SHARED:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> > > > +		break;
> > > > +	case XFS_RMAP_CONVERT:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> > > > +		break;
> > > > +	case XFS_RMAP_CONVERT_SHARED:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> > > > +		break;
> > > > +	case XFS_RMAP_ALLOC:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> > > > +		break;
> > > > +	case XFS_RMAP_FREE:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> > > > +		break;
> > > > +	default:
> > > > +		ASSERT(0);
> > > > +	}
> > > 
> > > Between here and the finish function, it looks like we could use a
> > > helper to convert the state and whatnot to extent flags.
> > 
> > Ok.
> > 
> > > > +}
> > > > +
> > > > +
> > > > +/*
> > > > + * This routine is called to allocate an "extent free done"
> > > > + * log item that will hold nextents worth of extents.  The
> > > > + * caller must use all nextents extents, because we are not
> > > > + * flexible about this at all.
> > > > + */
> > > 
> > > Comment needs updating.
> > 
> > Ok.
> > 
> > > Brian
> > > 
> > > > +struct xfs_rud_log_item *
> > > > +xfs_trans_get_rud(
> > > > +	struct xfs_trans		*tp,
> > > > +	struct xfs_rui_log_item		*ruip,
> > > > +	uint				nextents)
> > > > +{
> > > > +	struct xfs_rud_log_item		*rudp;
> > > > +
> > > > +	ASSERT(tp != NULL);
> > > > +	ASSERT(nextents > 0);
> > > > +
> > > > +	rudp = xfs_rud_init(tp->t_mountp, ruip, nextents);
> > > > +	ASSERT(rudp != NULL);
> > > > +
> > > > +	/*
> > > > +	 * Get a log_item_desc to point at the new item.
> > > > +	 */
> > > > +	xfs_trans_add_item(tp, &rudp->rud_item);
> > > > +	return rudp;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Finish an rmap update and log it to the RUD. Note that the transaction is
> > > > + * marked dirty regardless of whether the rmap update succeeds or fails to
> > > > + * support the RUI/RUD lifecycle rules.
> > > > + */
> > > > +int
> > > > +xfs_trans_log_finish_rmap_update(
> > > > +	struct xfs_trans		*tp,
> > > > +	struct xfs_rud_log_item		*rudp,
> > > > +	enum xfs_rmap_intent_type	type,
> > > > +	__uint64_t			owner,
> > > > +	int				whichfork,
> > > > +	xfs_fileoff_t			startoff,
> > > > +	xfs_fsblock_t			startblock,
> > > > +	xfs_filblks_t			blockcount,
> > > > +	xfs_exntst_t			state)
> > > > +{
> > > > +	uint				next_extent;
> > > > +	struct xfs_map_extent		*rmap;
> > > > +	int				error;
> > > > +
> > > > +	/* XXX: actually finish the rmap update here */
> > > > +	error = -EFSCORRUPTED;
> > > > +
> > > > +	/*
> > > > +	 * Mark the transaction dirty, even on error. This ensures the
> > > > +	 * transaction is aborted, which:
> > > > +	 *
> > > > +	 * 1.) releases the RUI and frees the RUD
> > > > +	 * 2.) shuts down the filesystem
> > > > +	 */
> > > > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > > > +	rudp->rud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > > > +
> > > > +	next_extent = rudp->rud_next_extent;
> > > > +	ASSERT(next_extent < rudp->rud_format.rud_nextents);
> > > > +	rmap = &(rudp->rud_format.rud_extents[next_extent]);
> > > > +	rmap->me_owner = owner;
> > > > +	rmap->me_startblock = startblock;
> > > > +	rmap->me_startoff = startoff;
> > > > +	rmap->me_len = blockcount;
> > > > +	rmap->me_flags = 0;
> > > > +	if (state == XFS_EXT_UNWRITTEN)
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
> > > > +	if (whichfork == XFS_ATTR_FORK)
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
> > > > +	switch (type) {
> > > > +	case XFS_RMAP_MAP:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
> > > > +		break;
> > > > +	case XFS_RMAP_MAP_SHARED:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
> > > > +		break;
> > > > +	case XFS_RMAP_UNMAP:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
> > > > +		break;
> > > > +	case XFS_RMAP_UNMAP_SHARED:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
> > > > +		break;
> > > > +	case XFS_RMAP_CONVERT:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
> > > > +		break;
> > > > +	case XFS_RMAP_CONVERT_SHARED:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
> > > > +		break;
> > > > +	case XFS_RMAP_ALLOC:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
> > > > +		break;
> > > > +	case XFS_RMAP_FREE:
> > > > +		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
> > > > +		break;
> > > > +	default:
> > > > +		ASSERT(0);
> > > > +	}
> > > > +	rudp->rud_next_extent++;
> > > > +
> > > > +	return error;
> > > > +}
> > > > 
> > > > _______________________________________________
> > > > xfs mailing list
> > > > xfs@oss.sgi.com
> > > > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 236+ messages in thread

end of thread, other threads:[~2016-07-19 17:10 UTC | newest]

Thread overview: 236+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-17  1:17 [PATCH v6 000/119] xfs: add reverse mapping, reflink, dedupe, and online scrub support Darrick J. Wong
2016-06-17  1:17 ` [PATCH 001/119] vfs: fix return type of ioctl_file_dedupe_range Darrick J. Wong
2016-06-17 11:32   ` Christoph Hellwig
2016-06-28 19:19     ` Darrick J. Wong
2016-06-17  1:18 ` [PATCH 002/119] vfs: support FS_XFLAG_REFLINK and FS_XFLAG_COWEXTSIZE Darrick J. Wong
2016-06-17 11:41   ` Christoph Hellwig
2016-06-17 12:16     ` Brian Foster
2016-06-17 15:06       ` Christoph Hellwig
2016-06-17 16:54       ` Darrick J. Wong
2016-06-17 17:38         ` Brian Foster
2016-06-17  1:18 ` [PATCH 003/119] xfs: check offsets of variable length structures Darrick J. Wong
2016-06-17 11:33   ` Christoph Hellwig
2016-06-17 17:34   ` Brian Foster
2016-06-18 18:01     ` Darrick J. Wong
2016-06-20 12:38       ` Brian Foster
2016-06-17  1:18 ` [PATCH 004/119] xfs: enable buffer deadlock postmortem diagnosis via ftrace Darrick J. Wong
2016-06-17 11:34   ` Christoph Hellwig
2016-06-21  0:47     ` Dave Chinner
2016-06-17  1:18 ` [PATCH 005/119] xfs: check for a valid error_tag in errortag_add Darrick J. Wong
2016-06-17 11:34   ` Christoph Hellwig
2016-06-17  1:18 ` [PATCH 006/119] xfs: port differences from xfsprogs libxfs Darrick J. Wong
2016-06-17 15:06   ` Christoph Hellwig
2016-06-20  0:21   ` Dave Chinner
2016-07-13 23:39     ` Darrick J. Wong
2016-06-17  1:18 ` [PATCH 007/119] xfs: rearrange xfs_bmap_add_free parameters Darrick J. Wong
2016-06-17 11:39   ` Christoph Hellwig
2016-06-17  1:18 ` [PATCH 008/119] xfs: separate freelist fixing into a separate helper Darrick J. Wong
2016-06-17 11:52   ` Christoph Hellwig
2016-06-21  0:48     ` Dave Chinner
2016-06-21  1:40   ` Dave Chinner
2016-06-17  1:18 ` [PATCH 009/119] xfs: convert list of extents to free into a regular list Darrick J. Wong
2016-06-17 11:59   ` Christoph Hellwig
2016-06-18 20:15     ` Darrick J. Wong
2016-06-21  0:57       ` Dave Chinner
2016-06-17  1:18 ` [PATCH 010/119] xfs: create a standard btree size calculator code Darrick J. Wong
2016-06-20 14:31   ` Brian Foster
2016-06-20 19:34     ` Darrick J. Wong
2016-06-17  1:19 ` [PATCH 011/119] xfs: refactor btree maxlevels computation Darrick J. Wong
2016-06-20 14:31   ` Brian Foster
2016-06-20 18:23     ` Darrick J. Wong
2016-06-17  1:19 ` [PATCH 012/119] xfs: during btree split, save new block key & ptr for future insertion Darrick J. Wong
2016-06-21 13:00   ` Brian Foster
2016-06-27 22:30     ` Darrick J. Wong
2016-06-28 12:31       ` Brian Foster
2016-06-17  1:19 ` [PATCH 013/119] xfs: support btrees with overlapping intervals for keys Darrick J. Wong
2016-06-22 15:17   ` Brian Foster
2016-06-28  3:26     ` Darrick J. Wong
2016-06-28 12:32       ` Brian Foster
2016-06-28 17:36         ` Darrick J. Wong
2016-07-06  4:59   ` Dave Chinner
2016-07-06  8:09     ` Darrick J. Wong
2016-06-17  1:19 ` [PATCH 014/119] xfs: introduce interval queries on btrees Darrick J. Wong
2016-06-22 15:18   ` Brian Foster
2016-06-27 21:07     ` Darrick J. Wong
2016-06-28 12:32       ` Brian Foster
2016-06-28 16:29         ` Darrick J. Wong
2016-06-17  1:19 ` [PATCH 015/119] xfs: refactor btree owner change into a separate visit-blocks function Darrick J. Wong
2016-06-23 17:19   ` Brian Foster
2016-06-17  1:19 ` [PATCH 016/119] xfs: move deferred operations into a separate file Darrick J. Wong
2016-06-27 13:14   ` Brian Foster
2016-06-27 19:14     ` Darrick J. Wong
2016-06-28 12:32       ` Brian Foster
2016-06-28 18:51         ` Darrick J. Wong
2016-06-17  1:19 ` [PATCH 017/119] xfs: add tracepoints for the deferred ops mechanism Darrick J. Wong
2016-06-27 13:15   ` Brian Foster
2016-06-17  1:19 ` [PATCH 018/119] xfs: enable the xfs_defer mechanism to process extents to free Darrick J. Wong
2016-06-27 13:15   ` Brian Foster
2016-06-27 21:41     ` Darrick J. Wong
2016-06-27 22:00       ` Darrick J. Wong
2016-06-28 12:32         ` Brian Foster
2016-06-28 16:33           ` Darrick J. Wong
2016-06-17  1:19 ` [PATCH 019/119] xfs: rework xfs_bmap_free callers to use xfs_defer_ops Darrick J. Wong
2016-06-17  1:20 ` [PATCH 020/119] xfs: change xfs_bmap_{finish, cancel, init, free} -> xfs_defer_* Darrick J. Wong
2016-06-30  0:11   ` Darrick J. Wong
2016-06-17  1:20 ` [PATCH 021/119] xfs: rename flist/free_list to dfops Darrick J. Wong
2016-06-17  1:20 ` [PATCH 022/119] xfs: add tracepoints and error injection for deferred extent freeing Darrick J. Wong
2016-06-17  1:20 ` [PATCH 023/119] xfs: introduce rmap btree definitions Darrick J. Wong
2016-06-30 17:32   ` Brian Foster
2016-06-17  1:20 ` [PATCH 024/119] xfs: add rmap btree stats infrastructure Darrick J. Wong
2016-06-30 17:32   ` Brian Foster
2016-06-17  1:20 ` [PATCH 025/119] xfs: rmap btree add more reserved blocks Darrick J. Wong
2016-06-30 17:32   ` Brian Foster
2016-06-17  1:20 ` [PATCH 026/119] xfs: add owner field to extent allocation and freeing Darrick J. Wong
2016-07-06  4:01   ` Dave Chinner
2016-07-06  6:44     ` Darrick J. Wong
2016-07-07 15:12   ` Brian Foster
2016-07-07 19:09     ` Darrick J. Wong
2016-07-07 22:55       ` Dave Chinner
2016-07-08 11:37       ` Brian Foster
2016-06-17  1:20 ` [PATCH 027/119] xfs: introduce rmap extent operation stubs Darrick J. Wong
2016-06-17  1:20 ` [PATCH 028/119] xfs: define the on-disk rmap btree format Darrick J. Wong
2016-07-06  4:05   ` Dave Chinner
2016-07-06  6:44     ` Darrick J. Wong
2016-07-07 18:41   ` Brian Foster
2016-07-07 19:18     ` Darrick J. Wong
2016-07-07 23:14       ` Dave Chinner
2016-07-07 23:58         ` Darrick J. Wong
2016-06-17  1:20 ` [PATCH 029/119] xfs: add rmap btree growfs support Darrick J. Wong
2016-06-17  1:21 ` [PATCH 030/119] xfs: rmap btree transaction reservations Darrick J. Wong
2016-07-08 13:21   ` Brian Foster
2016-06-17  1:21 ` [PATCH 031/119] xfs: rmap btree requires more reserved free space Darrick J. Wong
2016-07-08 13:21   ` Brian Foster
2016-07-13 16:50     ` Darrick J. Wong
2016-07-13 18:32       ` Brian Foster
2016-07-13 23:50         ` Dave Chinner
2016-06-17  1:21 ` [PATCH 032/119] xfs: add rmap btree operations Darrick J. Wong
2016-07-08 18:33   ` Brian Foster
2016-07-08 23:53     ` Darrick J. Wong
2016-06-17  1:21 ` [PATCH 033/119] xfs: support overlapping intervals in the rmap btree Darrick J. Wong
2016-07-08 18:33   ` Brian Foster
2016-07-09  0:14     ` Darrick J. Wong
2016-07-09 13:25       ` Brian Foster
2016-06-17  1:21 ` [PATCH 034/119] xfs: teach rmapbt to support interval queries Darrick J. Wong
2016-07-08 18:34   ` Brian Foster
2016-07-09  0:16     ` Darrick J. Wong
2016-07-09 13:25       ` Brian Foster
2016-06-17  1:21 ` [PATCH 035/119] xfs: add tracepoints for the rmap functions Darrick J. Wong
2016-07-08 18:34   ` Brian Foster
2016-06-17  1:21 ` [PATCH 036/119] xfs: add an extent to the rmap btree Darrick J. Wong
2016-07-11 18:49   ` Brian Foster
2016-07-11 23:01     ` Darrick J. Wong
2016-06-17  1:21 ` [PATCH 037/119] xfs: remove an extent from " Darrick J. Wong
2016-07-11 18:49   ` Brian Foster
2016-06-17  1:21 ` [PATCH 038/119] xfs: convert unwritten status of reverse mappings Darrick J. Wong
2016-06-30  0:15   ` Darrick J. Wong
2016-07-13 18:27   ` Brian Foster
2016-07-13 20:43     ` Darrick J. Wong
2016-06-17  1:22 ` [PATCH 039/119] xfs: add rmap btree insert and delete helpers Darrick J. Wong
2016-07-13 18:28   ` Brian Foster
2016-07-13 18:37     ` Darrick J. Wong
2016-07-13 18:42       ` Brian Foster
2016-06-17  1:22 ` [PATCH 040/119] xfs: create helpers for mapping, unmapping, and converting file fork extents Darrick J. Wong
2016-07-13 18:28   ` Brian Foster
2016-07-13 18:47     ` Darrick J. Wong
2016-07-13 23:54       ` Dave Chinner
2016-07-13 23:55         ` Darrick J. Wong
2016-06-17  1:22 ` [PATCH 041/119] xfs: create rmap update intent log items Darrick J. Wong
2016-07-15 18:33   ` Brian Foster
2016-07-16  7:10     ` Darrick J. Wong
2016-06-17  1:22 ` [PATCH 042/119] xfs: log rmap intent items Darrick J. Wong
2016-07-15 18:33   ` Brian Foster
2016-07-16  7:34     ` Darrick J. Wong
2016-07-18 12:55       ` Brian Foster
2016-07-19 17:10         ` Darrick J. Wong
2016-06-17  1:22 ` [PATCH 043/119] xfs: enable the xfs_defer mechanism to process rmaps to update Darrick J. Wong
2016-07-15 18:33   ` Brian Foster
2016-06-17  1:22 ` [PATCH 044/119] xfs: propagate bmap updates to rmapbt Darrick J. Wong
2016-07-15 18:33   ` Brian Foster
2016-07-16  7:26     ` Darrick J. Wong
2016-07-18  1:21       ` Dave Chinner
2016-07-18 12:56         ` Brian Foster
2016-07-18 12:55       ` Brian Foster
2016-07-19  1:53         ` Darrick J. Wong
2016-07-19 11:37           ` Brian Foster
2016-06-17  1:22 ` [PATCH 045/119] xfs: add rmap btree geometry feature flag Darrick J. Wong
2016-07-18 13:34   ` Brian Foster
2016-06-17  1:22 ` [PATCH 046/119] xfs: add rmap btree block detection to log recovery Darrick J. Wong
2016-07-18 13:34   ` Brian Foster
2016-06-17  1:22 ` [PATCH 047/119] xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled Darrick J. Wong
2016-07-18 13:34   ` Brian Foster
2016-07-18 16:18     ` Darrick J. Wong
2016-06-17  1:22 ` [PATCH 048/119] xfs: don't update rmapbt when fixing agfl Darrick J. Wong
2016-07-18 13:34   ` Brian Foster
2016-07-18 15:53     ` Darrick J. Wong
2016-06-17  1:23 ` [PATCH 049/119] xfs: enable the rmap btree functionality Darrick J. Wong
2016-07-18 13:34   ` Brian Foster
2016-06-17  1:23 ` [PATCH 050/119] xfs: count the blocks in a btree Darrick J. Wong
2016-06-17  1:23 ` [PATCH 051/119] xfs: introduce tracepoints for AG reservation code Darrick J. Wong
2016-06-17  1:23 ` [PATCH 052/119] xfs: set up per-AG free space reservations Darrick J. Wong
2016-06-17  1:23 ` [PATCH 053/119] xfs: define tracepoints for refcount btree activities Darrick J. Wong
2016-06-17  1:23 ` [PATCH 054/119] xfs: introduce refcount btree definitions Darrick J. Wong
2016-06-17  1:23 ` [PATCH 055/119] xfs: add refcount btree stats infrastructure Darrick J. Wong
2016-06-17  1:23 ` [PATCH 056/119] xfs: refcount btree add more reserved blocks Darrick J. Wong
2016-06-17  1:23 ` [PATCH 057/119] xfs: define the on-disk refcount btree format Darrick J. Wong
2016-06-17  1:24 ` [PATCH 058/119] xfs: add refcount btree support to growfs Darrick J. Wong
2016-06-17  1:24 ` [PATCH 059/119] xfs: account for the refcount btree in the alloc/free log reservation Darrick J. Wong
2016-06-17  1:24 ` [PATCH 060/119] xfs: add refcount btree operations Darrick J. Wong
2016-06-17  1:24 ` [PATCH 061/119] xfs: create refcount update intent log items Darrick J. Wong
2016-06-17  1:24 ` [PATCH 062/119] xfs: log refcount intent items Darrick J. Wong
2016-06-17  1:24 ` [PATCH 063/119] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
2016-06-17  1:24 ` [PATCH 064/119] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
2016-06-17  1:24 ` [PATCH 065/119] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
2016-06-17  1:24 ` [PATCH 066/119] xfs: add refcount btree block detection to log recovery Darrick J. Wong
2016-06-17  1:25 ` [PATCH 067/119] xfs: refcount btree requires more reserved space Darrick J. Wong
2016-06-17  1:25 ` [PATCH 068/119] xfs: introduce reflink utility functions Darrick J. Wong
2016-06-17  1:25 ` [PATCH 069/119] xfs: create bmbt update intent log items Darrick J. Wong
2016-06-17  1:25 ` [PATCH 070/119] xfs: log bmap intent items Darrick J. Wong
2016-06-17  1:25 ` [PATCH 071/119] xfs: map an inode's offset to an exact physical block Darrick J. Wong
2016-06-17  1:25 ` [PATCH 072/119] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
2016-06-17  1:25 ` [PATCH 073/119] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
2016-06-17  1:25 ` [PATCH 074/119] xfs: define tracepoints for reflink activities Darrick J. Wong
2016-06-17  1:25 ` [PATCH 075/119] xfs: add reflink feature flag to geometry Darrick J. Wong
2016-06-17  1:25 ` [PATCH 076/119] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
2016-06-17  1:26 ` [PATCH 077/119] xfs: introduce the CoW fork Darrick J. Wong
2016-06-17  1:26 ` [PATCH 078/119] xfs: support bmapping delalloc extents in " Darrick J. Wong
2016-06-17  1:26 ` [PATCH 079/119] xfs: create delalloc extents in " Darrick J. Wong
2016-06-17  1:26 ` [PATCH 080/119] xfs: support allocating delayed " Darrick J. Wong
2016-06-17  1:26 ` [PATCH 081/119] xfs: allocate " Darrick J. Wong
2016-06-17  1:26 ` [PATCH 082/119] xfs: support removing extents from " Darrick J. Wong
2016-06-17  1:26 ` [PATCH 083/119] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
2016-06-17  1:26 ` [PATCH 084/119] xfs: implement CoW for directio writes Darrick J. Wong
2016-06-17  1:26 ` [PATCH 085/119] xfs: copy-on-write reflinked blocks when zeroing ranges of blocks Darrick J. Wong
2016-06-17  1:27 ` [PATCH 086/119] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
2016-06-17  1:27 ` [PATCH 087/119] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
2016-06-17  1:27 ` [PATCH 088/119] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
2016-06-17  1:27 ` [PATCH 089/119] xfs: reflink extents from one file to another Darrick J. Wong
2016-06-17  1:27 ` [PATCH 090/119] xfs: add clone file and clone range vfs functions Darrick J. Wong
2016-06-17  1:27 ` [PATCH 091/119] xfs: add dedupe range vfs function Darrick J. Wong
2016-06-17  1:27 ` [PATCH 092/119] xfs: teach get_bmapx and fiemap about shared extents and the CoW fork Darrick J. Wong
2016-06-17  1:27 ` [PATCH 093/119] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
2016-06-17  1:27 ` [PATCH 094/119] xfs: unshare a range of blocks via fallocate Darrick J. Wong
2016-06-17  1:28 ` [PATCH 095/119] xfs: CoW shared EOF block when truncating file Darrick J. Wong
2016-06-17  1:28 ` [PATCH 096/119] xfs: support FS_XFLAG_REFLINK on reflink filesystems Darrick J. Wong
2016-06-17  1:28 ` [PATCH 097/119] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
2016-06-17  1:28 ` [PATCH 098/119] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
2016-06-17  1:28 ` [PATCH 099/119] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
2016-06-17  1:28 ` [PATCH 100/119] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
2016-06-17  1:28 ` [PATCH 101/119] xfs: promote buffered writes to CoW when cowextsz is set Darrick J. Wong
2016-06-17  1:28 ` [PATCH 102/119] xfs: garbage collect old cowextsz reservations Darrick J. Wong
2016-06-17  1:28 ` [PATCH 103/119] xfs: provide switch to force filesystem to copy-on-write all the time Darrick J. Wong
2016-06-17  1:29 ` [PATCH 104/119] xfs: increase log reservations for reflink Darrick J. Wong
2016-06-17  1:29 ` [PATCH 105/119] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
2016-06-17  1:29 ` [PATCH 106/119] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
2016-06-17  1:29 ` [PATCH 107/119] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
2016-06-17  1:29 ` [PATCH 108/119] xfs: don't allow realtime and reflinked files to mix Darrick J. Wong
2016-06-17  1:29 ` [PATCH 109/119] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
2016-06-17  1:29 ` [PATCH 110/119] xfs: fail ->bmap for reflink inodes Darrick J. Wong
2016-06-17  1:29 ` [PATCH 111/119] xfs: recognize the reflink feature bit Darrick J. Wong
2016-06-17  1:29 ` [PATCH 112/119] xfs: introduce the XFS_IOC_GETFSMAPX ioctl Darrick J. Wong
2016-06-17  1:30 ` [PATCH 113/119] xfs: scrub btree records and pointers while querying Darrick J. Wong
2016-06-17  1:30 ` [PATCH 114/119] xfs: create sysfs hooks to scrub various files Darrick J. Wong
2016-06-17  1:30 ` [PATCH 115/119] xfs: support scrubbing free space btrees Darrick J. Wong
2016-06-17  1:30 ` [PATCH 116/119] xfs: support scrubbing inode btrees Darrick J. Wong
2016-06-17  1:30 ` [PATCH 117/119] xfs: support scrubbing rmap btree Darrick J. Wong
2016-06-17  1:30 ` [PATCH 118/119] xfs: support scrubbing refcount btree Darrick J. Wong
2016-06-17  1:30 ` [PATCH 119/119] xfs: add btree scrub tracepoints Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).