All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/63] xfs: add reflink and dedupe support
@ 2016-09-30  3:05 Darrick J. Wong
  2016-09-30  3:05 ` [PATCH 01/63] vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint Darrick J. Wong
                   ` (63 more replies)
  0 siblings, 64 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:05 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Hi all,

This is the tenth revision of a patchset that adds to XFS kernel
support for mapping multiple file logical blocks to the same physical
block (reflink/deduplication).  There shouldn't be any incompatible on-disk
format changes, pending a thorough review of the patches within.

NOTE: This patchset contains all the review fixes from the v9 posting,
and no other changes.  If you were in the middle of reviewing patches,
you can continue from wherever you left off.

The reflink implementation features a simple per-AG b+tree containing
tuples of (physical block, blockcount, refcount) with the key being
the physical block.  Copy on Write (CoW) is implemented by creating a
separate CoW extent mapping fork and using the existing delayed
allocation mechanism to try to allocate as large of a replacement
extent as possible before committing the new data to media.  A CoW
extent size hint allows administrators to influence the size of the
replacement extents, and certain writes can be "promoted" to CoW when
it would be advantageous to reduce fragmentation.  The userspace
interface to reflink and dedupe are the VFS FICLONE, FICLONERANGE, and
FIDEDUPERANGE ioctls, which were previously private to btrfs.

Next comes the reference count B+tree, which tracks the reference
counts of shared extents (refcount > 1) and extents being used to
stage a copy-on-write operation (refcount == 1).  We define new log
redo item pairs both for refcount updates and for inode fork updates;
these plug into the deferred ops framework created for the reverse
mapping patches.

After that comes the reflink code, which handles the actual
copy-on-write behavior that is required for block sharing; and
connections to the VFS file ops for reflink, dedupe, and
copy_file_range.

At the very end of the patchset is a reimplementation of the swap
extents code that uses reverse mapping and block mapping deferred ops
to implement xfs_swap_extent for filesystems with reverse-mapping.

The patches contained within are also available as git trees.  The
kernel patches[1] apply against Dave Chinner's for-next branch, and for
user tools you'll want to use the xfsprogs dev branch[2].  I do not
intend to mail out userspace patches until after the merge window and
libxfs-apply resync happens.  The clone group xfstests should all pass.

This is an extraordinary way to eat your data.  Enjoy! 
Comments and questions are, as always, welcome.

--D

[1] https://github.com/djwong/linux/tree/for-dave-for-4.9-8
[2] https://github.com/djwong/xfsprogs/tree/for-dave-for-4.9-8

^ permalink raw reply	[flat|nested] 187+ messages in thread

* [PATCH 01/63] vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
@ 2016-09-30  3:05 ` Darrick J. Wong
  2016-09-30  3:05 ` [PATCH 02/63] vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks Darrick J. Wong
                   ` (62 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:05 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Introduce XFLAGs for the new XFS CoW extent size hint, and actually
plumb the CoW extent size hint into the fsxattr structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/uapi/linux/fs.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3b00f7c..cf39876 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -157,7 +157,8 @@ struct fsxattr {
 	__u32		fsx_extsize;	/* extsize field value (get/set)*/
 	__u32		fsx_nextents;	/* nextents field value (get)	*/
 	__u32		fsx_projid;	/* project identifier (get/set) */
-	unsigned char	fsx_pad[12];
+	__u32		fsx_cowextsize;	/* CoW extsize field value (get/set)*/
+	unsigned char	fsx_pad[8];
 };
 
 /*
@@ -178,6 +179,7 @@ struct fsxattr {
 #define FS_XFLAG_NODEFRAG	0x00002000	/* do not defragment */
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
+#define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 02/63] vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
  2016-09-30  3:05 ` [PATCH 01/63] vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint Darrick J. Wong
@ 2016-09-30  3:05 ` Darrick J. Wong
  2016-09-30  7:08   ` Christoph Hellwig
  2016-09-30  3:05 ` [PATCH 03/63] xfs: return an error when an inline directory is too small Darrick J. Wong
                   ` (61 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:05 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Add a new fallocate mode flag that explicitly unshares blocks on
filesystems that support such features.  The new flag can only
be used with an allocate-mode fallocate call.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/open.c                   |    5 +++++
 include/linux/falloc.h      |    3 ++-
 include/uapi/linux/falloc.h |   18 ++++++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)


diff --git a/fs/open.c b/fs/open.c
index 4fd6e25..d58525d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	    (mode & ~FALLOC_FL_INSERT_RANGE))
 		return -EINVAL;
 
+	/* Unshare range should only be used with allocate mode. */
+	if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
+	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
+		return -EINVAL;
+
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 9961110..7494dc6 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -25,6 +25,7 @@ struct space_resv {
 					 FALLOC_FL_PUNCH_HOLE |		\
 					 FALLOC_FL_COLLAPSE_RANGE |	\
 					 FALLOC_FL_ZERO_RANGE |		\
-					 FALLOC_FL_INSERT_RANGE)
+					 FALLOC_FL_INSERT_RANGE |	\
+					 FALLOC_FL_UNSHARE_RANGE)
 
 #endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 3e445a7..b075f60 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -58,4 +58,22 @@
  */
 #define FALLOC_FL_INSERT_RANGE		0x20
 
+/*
+ * FALLOC_FL_UNSHARE_RANGE is used to unshare shared blocks within the
+ * file size without overwriting any existing data. The purpose of this
+ * call is to preemptively reallocate any blocks that are subject to
+ * copy-on-write.
+ *
+ * Different filesystems may implement different limitations on the
+ * granularity of the operation. Most will limit operations to filesystem
+ * block size boundaries, but this boundary may be larger or smaller
+ * depending on the filesystem and/or the configuration of the filesystem
+ * or file.
+ *
+ * This flag can only be used with allocate-mode fallocate, which is
+ * to say that it cannot be used with the punch, zero, collapse, or
+ * insert range modes.
+ */
+#define FALLOC_FL_UNSHARE_RANGE		0x40
+
 #endif /* _UAPI_FALLOC_H_ */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 03/63] xfs: return an error when an inline directory is too small
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
  2016-09-30  3:05 ` [PATCH 01/63] vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint Darrick J. Wong
  2016-09-30  3:05 ` [PATCH 02/63] vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks Darrick J. Wong
@ 2016-09-30  3:05 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 04/63] xfs: define tracepoints for refcount btree activities Darrick J. Wong
                   ` (60 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:05 UTC (permalink / raw)
  To: david, darrick.wong
  Cc: linux-xfs, linux-fsdevel, Brian Foster, Jan Kara, Christoph Hellwig

If the size of an inline directory is so small that it doesn't
even cover the required header size, return an error to userspace
instead of ASSERTing and returning 0 like everything's ok.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_dir2_readdir.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index f44f799..2981698 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -84,7 +84,8 @@ xfs_dir2_sf_getdents(
 
 	sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;
 
-	ASSERT(dp->i_d.di_size >= xfs_dir2_sf_hdr_size(sfp->i8count));
+	if (dp->i_d.di_size < xfs_dir2_sf_hdr_size(sfp->i8count))
+		return -EFSCORRUPTED;
 
 	/*
 	 * If the block number in the offset is out of range, we're done.


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 04/63] xfs: define tracepoints for refcount btree activities
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2016-09-30  3:05 ` [PATCH 03/63] xfs: return an error when an inline directory is too small Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 05/63] xfs: introduce refcount btree definitions Darrick J. Wong
                   ` (59 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Define all the tracepoints we need to inspect the refcount btree
runtime operation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_trace.h |  301 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 301 insertions(+)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c6b2b1d..8446338 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -40,6 +40,16 @@ struct xfs_inode_log_format;
 struct xfs_bmbt_irec;
 struct xfs_btree_cur;
 
+#ifndef XFS_REFCOUNT_IREC_PLACEHOLDER
+#define XFS_REFCOUNT_IREC_PLACEHOLDER
+/* Placeholder definition to avoid breaking bisectability. */
+struct xfs_refcount_irec {
+	xfs_agblock_t	rc_startblock;	/* starting block number */
+	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
+	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
+};
+#endif
+
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),
 	TP_ARGS(ctx),
@@ -2640,6 +2650,297 @@ DEFINE_AG_RESV_EVENT(xfs_ag_resv_needed);
 DEFINE_AG_ERROR_EVENT(xfs_ag_resv_free_error);
 DEFINE_AG_ERROR_EVENT(xfs_ag_resv_init_error);
 
+/* refcount tracepoint classes */
+
+/* reuse the discard trace class for agbno/aglen-based traces */
+#define DEFINE_AG_EXTENT_EVENT(name) DEFINE_DISCARD_EVENT(name)
+
+/* ag btree lookup tracepoint class */
+#define XFS_AG_BTREE_CMP_FORMAT_STR \
+	{ XFS_LOOKUP_EQ,	"eq" }, \
+	{ XFS_LOOKUP_LE,	"le" }, \
+	{ XFS_LOOKUP_GE,	"ge" }
+DECLARE_EVENT_CLASS(xfs_ag_btree_lookup_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agblock_t agbno, xfs_lookup_t dir),
+	TP_ARGS(mp, agno, agbno, dir),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_lookup_t, dir)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agbno = agbno;
+		__entry->dir = dir;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u cmp %s(%d)\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __print_symbolic(__entry->dir, XFS_AG_BTREE_CMP_FORMAT_STR),
+		  __entry->dir)
+)
+
+#define DEFINE_AG_BTREE_LOOKUP_EVENT(name) \
+DEFINE_EVENT(xfs_ag_btree_lookup_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 xfs_agblock_t agbno, xfs_lookup_t dir), \
+	TP_ARGS(mp, agno, agbno, dir))
+
+/* single-rcext tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_extent_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *irec),
+	TP_ARGS(mp, agno, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, startblock)
+		__field(xfs_extlen_t, blockcount)
+		__field(xfs_nlink_t, refcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startblock = irec->rc_startblock;
+		__entry->blockcount = irec->rc_blockcount;
+		__entry->refcount = irec->rc_refcount;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->startblock,
+		  __entry->blockcount,
+		  __entry->refcount)
+)
+
+#define DEFINE_REFCOUNT_EXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_extent_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *irec), \
+	TP_ARGS(mp, agno, irec))
+
+/* single-rcext and an agbno tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_extent_at_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *irec, xfs_agblock_t agbno),
+	TP_ARGS(mp, agno, irec, agbno),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, startblock)
+		__field(xfs_extlen_t, blockcount)
+		__field(xfs_nlink_t, refcount)
+		__field(xfs_agblock_t, agbno)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startblock = irec->rc_startblock;
+		__entry->blockcount = irec->rc_blockcount;
+		__entry->refcount = irec->rc_refcount;
+		__entry->agbno = agbno;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u @ agbno %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->startblock,
+		  __entry->blockcount,
+		  __entry->refcount,
+		  __entry->agbno)
+)
+
+#define DEFINE_REFCOUNT_EXTENT_AT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_extent_at_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *irec, xfs_agblock_t agbno), \
+	TP_ARGS(mp, agno, irec, agbno))
+
+/* double-rcext tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_double_extent_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2),
+	TP_ARGS(mp, agno, i1, i2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, i1_startblock)
+		__field(xfs_extlen_t, i1_blockcount)
+		__field(xfs_nlink_t, i1_refcount)
+		__field(xfs_agblock_t, i2_startblock)
+		__field(xfs_extlen_t, i2_blockcount)
+		__field(xfs_nlink_t, i2_refcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->i1_startblock = i1->rc_startblock;
+		__entry->i1_blockcount = i1->rc_blockcount;
+		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_startblock = i2->rc_startblock;
+		__entry->i2_blockcount = i2->rc_blockcount;
+		__entry->i2_refcount = i2->rc_refcount;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->i1_startblock,
+		  __entry->i1_blockcount,
+		  __entry->i1_refcount,
+		  __entry->i2_startblock,
+		  __entry->i2_blockcount,
+		  __entry->i2_refcount)
+)
+
+#define DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_double_extent_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2), \
+	TP_ARGS(mp, agno, i1, i2))
+
+/* double-rcext and an agbno tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_double_extent_at_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2,
+		 xfs_agblock_t agbno),
+	TP_ARGS(mp, agno, i1, i2, agbno),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, i1_startblock)
+		__field(xfs_extlen_t, i1_blockcount)
+		__field(xfs_nlink_t, i1_refcount)
+		__field(xfs_agblock_t, i2_startblock)
+		__field(xfs_extlen_t, i2_blockcount)
+		__field(xfs_nlink_t, i2_refcount)
+		__field(xfs_agblock_t, agbno)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->i1_startblock = i1->rc_startblock;
+		__entry->i1_blockcount = i1->rc_blockcount;
+		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_startblock = i2->rc_startblock;
+		__entry->i2_blockcount = i2->rc_blockcount;
+		__entry->i2_refcount = i2->rc_refcount;
+		__entry->agbno = agbno;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u @ agbno %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->i1_startblock,
+		  __entry->i1_blockcount,
+		  __entry->i1_refcount,
+		  __entry->i2_startblock,
+		  __entry->i2_blockcount,
+		  __entry->i2_refcount,
+		  __entry->agbno)
+)
+
+#define DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_double_extent_at_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2, \
+		 xfs_agblock_t agbno), \
+	TP_ARGS(mp, agno, i1, i2, agbno))
+
+/* triple-rcext tracepoint class */
+DECLARE_EVENT_CLASS(xfs_refcount_triple_extent_class,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2,
+		 struct xfs_refcount_irec *i3),
+	TP_ARGS(mp, agno, i1, i2, i3),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, i1_startblock)
+		__field(xfs_extlen_t, i1_blockcount)
+		__field(xfs_nlink_t, i1_refcount)
+		__field(xfs_agblock_t, i2_startblock)
+		__field(xfs_extlen_t, i2_blockcount)
+		__field(xfs_nlink_t, i2_refcount)
+		__field(xfs_agblock_t, i3_startblock)
+		__field(xfs_extlen_t, i3_blockcount)
+		__field(xfs_nlink_t, i3_refcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->i1_startblock = i1->rc_startblock;
+		__entry->i1_blockcount = i1->rc_blockcount;
+		__entry->i1_refcount = i1->rc_refcount;
+		__entry->i2_startblock = i2->rc_startblock;
+		__entry->i2_blockcount = i2->rc_blockcount;
+		__entry->i2_refcount = i2->rc_refcount;
+		__entry->i3_startblock = i3->rc_startblock;
+		__entry->i3_blockcount = i3->rc_blockcount;
+		__entry->i3_refcount = i3->rc_refcount;
+	),
+	TP_printk("dev %d:%d agno %u agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u -- "
+		  "agbno %u len %u refcount %u\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->i1_startblock,
+		  __entry->i1_blockcount,
+		  __entry->i1_refcount,
+		  __entry->i2_startblock,
+		  __entry->i2_blockcount,
+		  __entry->i2_refcount,
+		  __entry->i3_startblock,
+		  __entry->i3_blockcount,
+		  __entry->i3_refcount)
+);
+
+#define DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_refcount_triple_extent_class, name, \
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, \
+		 struct xfs_refcount_irec *i1, struct xfs_refcount_irec *i2, \
+		 struct xfs_refcount_irec *i3), \
+	TP_ARGS(mp, agno, i1, i2, i3))
+
+/* refcount btree tracepoints */
+DEFINE_BUSY_EVENT(xfs_refcountbt_alloc_block);
+DEFINE_BUSY_EVENT(xfs_refcountbt_free_block);
+DEFINE_AG_BTREE_LOOKUP_EVENT(xfs_refcount_lookup);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_get);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_update);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_insert);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_delete);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_insert_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_delete_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_update_error);
+
+/* refcount adjustment tracepoints */
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_increase);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_decrease);
+DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(xfs_refcount_merge_center_extents);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_modify_extent);
+DEFINE_REFCOUNT_EXTENT_AT_EVENT(xfs_refcount_split_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_left_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_right_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_left_extent);
+DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_right_extent);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_center_extents_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_modify_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_split_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_left_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_right_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_find_left_extent_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
+
+/* reflink helpers */
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 05/63] xfs: introduce refcount btree definitions
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 04/63] xfs: define tracepoints for refcount btree activities Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 06/63] xfs: refcount btree add more reserved blocks Darrick J. Wong
                   ` (58 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Add new per-AG refcount btree definitions to the per-AG structures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c      |    5 +++++
 fs/xfs/libxfs/xfs_btree.c      |    5 +++--
 fs/xfs/libxfs/xfs_btree.h      |    4 ++++
 fs/xfs/libxfs/xfs_format.h     |   31 +++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_rmap_btree.c |   22 ++++++++++++++++++++--
 fs/xfs/libxfs/xfs_types.h      |    2 +-
 fs/xfs/xfs_inode.h             |    5 +++++
 fs/xfs/xfs_mount.h             |    3 +++
 fs/xfs/xfs_pnfs.c              |    7 +++++++
 fs/xfs/xfs_stats.c             |    1 +
 fs/xfs/xfs_stats.h             |   18 +++++++++++++++++-
 11 files changed, 93 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index ca75dc9..275d345 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2458,6 +2458,10 @@ xfs_agf_verify(
 	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
 		return false;
 
+	if (xfs_sb_version_hasreflink(&mp->m_sb) &&
+	    be32_to_cpu(agf->agf_refcount_level) > XFS_BTREE_MAXLEVELS)
+		return false;
+
 	return true;;
 
 }
@@ -2578,6 +2582,7 @@ xfs_alloc_read_agf(
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
 		pag->pagf_levels[XFS_BTNUM_RMAPi] =
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
+		pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
 		spin_lock_init(&pag->pagb_lock);
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index aa1752f..f8bab9b 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -45,9 +45,10 @@ kmem_zone_t	*xfs_btree_cur_zone;
  */
 static const __uint32_t xfs_magics[2][XFS_BTNUM_MAX] = {
 	{ XFS_ABTB_MAGIC, XFS_ABTC_MAGIC, 0, XFS_BMAP_MAGIC, XFS_IBT_MAGIC,
-	  XFS_FIBT_MAGIC },
+	  XFS_FIBT_MAGIC, 0 },
 	{ XFS_ABTB_CRC_MAGIC, XFS_ABTC_CRC_MAGIC, XFS_RMAP_CRC_MAGIC,
-	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC }
+	  XFS_BMAP_CRC_MAGIC, XFS_IBT_CRC_MAGIC, XFS_FIBT_CRC_MAGIC,
+	  XFS_REFC_CRC_MAGIC }
 };
 #define xfs_btree_magic(cur) \
 	xfs_magics[!!((cur)->bc_flags & XFS_BTREE_CRC_BLOCKS)][cur->bc_btnum]
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 3f8556a..e7ef1d9 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -72,6 +72,7 @@ union xfs_btree_rec {
 #define	XFS_BTNUM_INO	((xfs_btnum_t)XFS_BTNUM_INOi)
 #define	XFS_BTNUM_FINO	((xfs_btnum_t)XFS_BTNUM_FINOi)
 #define	XFS_BTNUM_RMAP	((xfs_btnum_t)XFS_BTNUM_RMAPi)
+#define	XFS_BTNUM_REFC	((xfs_btnum_t)XFS_BTNUM_REFCi)
 
 /*
  * For logging record fields.
@@ -105,6 +106,7 @@ do {    \
 	case XFS_BTNUM_INO: __XFS_BTREE_STATS_INC(__mp, ibt, stat); break; \
 	case XFS_BTNUM_FINO: __XFS_BTREE_STATS_INC(__mp, fibt, stat); break; \
 	case XFS_BTNUM_RMAP: __XFS_BTREE_STATS_INC(__mp, rmap, stat); break; \
+	case XFS_BTNUM_REFC: __XFS_BTREE_STATS_INC(__mp, refcbt, stat); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break;	\
 	}       \
 } while (0)
@@ -127,6 +129,8 @@ do {    \
 		__XFS_BTREE_STATS_ADD(__mp, fibt, stat, val); break; \
 	case XFS_BTNUM_RMAP:	\
 		__XFS_BTREE_STATS_ADD(__mp, rmap, stat, val); break; \
+	case XFS_BTNUM_REFC:	\
+		__XFS_BTREE_STATS_ADD(__mp, refcbt, stat, val); break; \
 	case XFS_BTNUM_MAX: ASSERT(0); /* fucking gcc */ ; break; \
 	}       \
 } while (0)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 270fb5c..57c52a6 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -456,6 +456,7 @@ xfs_sb_has_compat_feature(
 
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
+#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
@@ -546,6 +547,12 @@ static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_RMAPBT);
 }
 
+static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
+{
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
+}
+
 /*
  * end of superblock version macros
  */
@@ -641,14 +648,17 @@ typedef struct xfs_agf {
 	uuid_t		agf_uuid;	/* uuid of filesystem */
 
 	__be32		agf_rmap_blocks;	/* rmapbt blocks used */
-	__be32		agf_padding;		/* padding */
+	__be32		agf_refcount_blocks;	/* refcountbt blocks used */
+
+	__be32		agf_refcount_root;	/* refcount tree root block */
+	__be32		agf_refcount_level;	/* refcount btree levels */
 
 	/*
 	 * reserve some contiguous space for future logged fields before we add
 	 * the unlogged fields. This makes the range logging via flags and
 	 * structure offsets much simpler.
 	 */
-	__be64		agf_spare64[15];
+	__be64		agf_spare64[14];
 
 	/* unlogged fields, written during buffer writeback. */
 	__be64		agf_lsn;	/* last write sequence */
@@ -1041,9 +1051,14 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
  * 16 bits of the XFS_XFLAG_s range.
  */
 #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
+#define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
+#define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
+#define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
+#define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 
-#define XFS_DIFLAG2_ANY		(XFS_DIFLAG2_DAX)
+#define XFS_DIFLAG2_ANY \
+	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
 
 /*
  * Inode number format:
@@ -1353,7 +1368,8 @@ struct xfs_owner_info {
 #define XFS_RMAP_OWN_AG		(-5ULL)	/* AG freespace btree blocks */
 #define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
 #define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
-#define XFS_RMAP_OWN_MIN	(-8ULL) /* guard */
+#define XFS_RMAP_OWN_REFC	(-8ULL) /* refcount tree */
+#define XFS_RMAP_OWN_MIN	(-9ULL) /* guard */
 
 #define XFS_RMAP_NON_INODE_OWNER(owner)	(!!((owner) & (1ULL << 63)))
 
@@ -1434,6 +1450,13 @@ typedef __be32 xfs_rmap_ptr_t;
 	 XFS_IBT_BLOCK(mp) + 1)
 
 /*
+ * Reference Count Btree format definitions
+ *
+ */
+#define	XFS_REFC_CRC_MAGIC	0x52334643	/* 'R3FC' */
+
+
+/*
  * BMAP Btree format definitions
  *
  * This includes both the root block definition that sits inside an inode fork
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 17b8eeb..9c0585e 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -512,6 +512,24 @@ void
 xfs_rmapbt_compute_maxlevels(
 	struct xfs_mount		*mp)
 {
-	mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
-			mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
+	/*
+	 * On a non-reflink filesystem, the maximum number of rmap
+	 * records is the number of blocks in the AG, hence the max
+	 * rmapbt height is log_$maxrecs($agblocks).  However, with
+	 * reflink each AG block can have up to 2^32 (per the refcount
+	 * record format) owners, which means that theoretically we
+	 * could face up to 2^64 rmap records.
+	 *
+	 * That effectively means that the max rmapbt height must be
+	 * XFS_BTREE_MAXLEVELS.  "Fortunately" we'll run out of AG
+	 * blocks to feed the rmapbt long before the rmapbt reaches
+	 * maximum height.  The reflink code uses ag_resv_critical to
+	 * disallow reflinking when less than 10% of the per-AG metadata
+	 * block reservation since the fallback is a regular file copy.
+	 */
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		mp->m_rmap_maxlevels = XFS_BTREE_MAXLEVELS;
+	else
+		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
+				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
 }
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 3d50364..be7b6de 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -109,7 +109,7 @@ typedef enum {
 
 typedef enum {
 	XFS_BTNUM_BNOi, XFS_BTNUM_CNTi, XFS_BTNUM_RMAPi, XFS_BTNUM_BMAPi,
-	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_MAX
+	XFS_BTNUM_INOi, XFS_BTNUM_FINOi, XFS_BTNUM_REFCi, XFS_BTNUM_MAX
 } xfs_btnum_t;
 
 struct xfs_name {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8f30d25..a8658e6 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -202,6 +202,11 @@ xfs_get_initial_prid(struct xfs_inode *dp)
 	return XFS_PROJID_DEFAULT;
 }
 
+static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
+{
+	return ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+}
+
 /*
  * In-core inode flags.
  */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 041d949..8fab496 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -399,6 +399,9 @@ typedef struct xfs_perag {
 	struct xfs_ag_resv	pag_meta_resv;
 	/* Blocks reserved for just AGFL-based metadata. */
 	struct xfs_ag_resv	pag_agfl_resv;
+
+	/* reference count */
+	__uint8_t		pagf_refcount_level;
 } xfs_perag_t;
 
 static inline struct xfs_ag_resv *
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 0f14b2e..93a7aaf 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -114,6 +114,13 @@ xfs_fs_map_blocks(
 		return -ENXIO;
 
 	/*
+	 * The pNFS block layout spec actually supports reflink like
+	 * functionality, but the Linux pNFS server doesn't implement it yet.
+	 */
+	if (xfs_is_reflink_inode(ip))
+		return -ENXIO;
+
+	/*
 	 * Lock out any other I/O before we flush and invalidate the pagecache,
 	 * and then hand out a layout to the remote system.  This is very
 	 * similar to direct I/O, except that the synchronization is much more
diff --git a/fs/xfs/xfs_stats.c b/fs/xfs/xfs_stats.c
index 6e812fe0..12d48cd 100644
--- a/fs/xfs/xfs_stats.c
+++ b/fs/xfs/xfs_stats.c
@@ -62,6 +62,7 @@ int xfs_stats_format(struct xfsstats __percpu *stats, char *buf)
 		{ "ibt2",		XFSSTAT_END_IBT_V2		},
 		{ "fibt2",		XFSSTAT_END_FIBT_V2		},
 		{ "rmapbt",		XFSSTAT_END_RMAP_V2		},
+		{ "refcntbt",		XFSSTAT_END_REFCOUNT		},
 		/* we print both series of quota information together */
 		{ "qm",			XFSSTAT_END_QM			},
 	};
diff --git a/fs/xfs/xfs_stats.h b/fs/xfs/xfs_stats.h
index 657865f..79ad2e6 100644
--- a/fs/xfs/xfs_stats.h
+++ b/fs/xfs/xfs_stats.h
@@ -213,7 +213,23 @@ struct xfsstats {
 	__uint32_t		xs_rmap_2_alloc;
 	__uint32_t		xs_rmap_2_free;
 	__uint32_t		xs_rmap_2_moves;
-#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_RMAP_V2+6)
+#define XFSSTAT_END_REFCOUNT		(XFSSTAT_END_RMAP_V2 + 15)
+	__uint32_t		xs_refcbt_2_lookup;
+	__uint32_t		xs_refcbt_2_compare;
+	__uint32_t		xs_refcbt_2_insrec;
+	__uint32_t		xs_refcbt_2_delrec;
+	__uint32_t		xs_refcbt_2_newroot;
+	__uint32_t		xs_refcbt_2_killroot;
+	__uint32_t		xs_refcbt_2_increment;
+	__uint32_t		xs_refcbt_2_decrement;
+	__uint32_t		xs_refcbt_2_lshift;
+	__uint32_t		xs_refcbt_2_rshift;
+	__uint32_t		xs_refcbt_2_split;
+	__uint32_t		xs_refcbt_2_join;
+	__uint32_t		xs_refcbt_2_alloc;
+	__uint32_t		xs_refcbt_2_free;
+	__uint32_t		xs_refcbt_2_moves;
+#define XFSSTAT_END_XQMSTAT		(XFSSTAT_END_REFCOUNT + 6)
 	__uint32_t		xs_qm_dqreclaims;
 	__uint32_t		xs_qm_dqreclaim_misses;
 	__uint32_t		xs_qm_dquot_dups;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 06/63] xfs: refcount btree add more reserved blocks
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 05/63] xfs: introduce refcount btree definitions Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 07/63] xfs: define the on-disk refcount btree format Darrick J. Wong
                   ` (57 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c  |   13 +++++++++++++
 fs/xfs/libxfs/xfs_format.h |    2 ++
 2 files changed, 15 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 275d345..aa0e1ca 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -52,10 +52,23 @@ STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
 		xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
 
+unsigned int
+xfs_refc_block(
+	struct xfs_mount	*mp)
+{
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return XFS_RMAP_BLOCK(mp) + 1;
+	if (xfs_sb_version_hasfinobt(&mp->m_sb))
+		return XFS_FIBT_BLOCK(mp) + 1;
+	return XFS_IBT_BLOCK(mp) + 1;
+}
+
 xfs_extlen_t
 xfs_prealloc_blocks(
 	struct xfs_mount	*mp)
 {
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		return xfs_refc_block(mp) + 1;
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		return XFS_RMAP_BLOCK(mp) + 1;
 	if (xfs_sb_version_hasfinobt(&mp->m_sb))
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 57c52a6..622055b 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1455,6 +1455,8 @@ typedef __be32 xfs_rmap_ptr_t;
  */
 #define	XFS_REFC_CRC_MAGIC	0x52334643	/* 'R3FC' */
 
+unsigned int xfs_refc_block(struct xfs_mount *mp);
+
 
 /*
  * BMAP Btree format definitions


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 07/63] xfs: define the on-disk refcount btree format
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 06/63] xfs: refcount btree add more reserved blocks Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 08/63] xfs: add refcount btree support to growfs Darrick J. Wong
                   ` (56 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_btree.c          |    3 +
 fs/xfs/libxfs/xfs_btree.h          |   12 ++
 fs/xfs/libxfs/xfs_format.h         |   36 +++++++
 fs/xfs/libxfs/xfs_refcount_btree.c |  178 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount_btree.h |   67 ++++++++++++++
 fs/xfs/libxfs/xfs_sb.c             |    9 ++
 fs/xfs/libxfs/xfs_shared.h         |    2 
 fs/xfs/libxfs/xfs_trans_resv.c     |    2 
 fs/xfs/libxfs/xfs_trans_resv.h     |    1 
 fs/xfs/xfs_mount.c                 |    2 
 fs/xfs/xfs_mount.h                 |    3 +
 fs/xfs/xfs_ondisk.h                |    3 +
 fs/xfs/xfs_trace.h                 |   11 --
 14 files changed, 319 insertions(+), 11 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.c
 create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 584e87e..8d749f2 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -55,6 +55,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_ag_resv.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
+				   xfs_refcount_btree.o \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_resv.o \
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index f8bab9b..5c8e6f2 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1217,6 +1217,9 @@ xfs_btree_set_refs(
 	case XFS_BTNUM_RMAP:
 		xfs_buf_set_ref(bp, XFS_RMAP_BTREE_REF);
 		break;
+	case XFS_BTNUM_REFC:
+		xfs_buf_set_ref(bp, XFS_REFC_BTREE_REF);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index e7ef1d9..c2b01d1 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -49,6 +49,7 @@ union xfs_btree_key {
 	struct xfs_inobt_key		inobt;
 	struct xfs_rmap_key		rmap;
 	struct xfs_rmap_key		__rmap_bigkey[2];
+	struct xfs_refcount_key		refc;
 };
 
 union xfs_btree_rec {
@@ -57,6 +58,7 @@ union xfs_btree_rec {
 	struct xfs_alloc_rec		alloc;
 	struct xfs_inobt_rec		inobt;
 	struct xfs_rmap_rec		rmap;
+	struct xfs_refcount_rec		refc;
 };
 
 /*
@@ -221,6 +223,15 @@ union xfs_btree_irec {
 	struct xfs_bmbt_irec		b;
 	struct xfs_inobt_rec_incore	i;
 	struct xfs_rmap_irec		r;
+	struct xfs_refcount_irec	rc;
+};
+
+/* Per-AG btree private information. */
+union xfs_btree_cur_private {
+	struct {
+		unsigned long	nr_ops;		/* # record updates */
+		int		shape_changes;	/* # of extent splits */
+	} refc;
 };
 
 /*
@@ -247,6 +258,7 @@ typedef struct xfs_btree_cur
 			struct xfs_buf	*agbp;	/* agf/agi buffer pointer */
 			struct xfs_defer_ops *dfops;	/* deferred updates */
 			xfs_agnumber_t	agno;	/* ag number */
+			union xfs_btree_cur_private	priv;
 		} a;
 		struct {			/* needed for BMAP */
 			struct xfs_inode *ip;	/* pointer to our inode */
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 622055b..97c74f4 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1457,6 +1457,42 @@ typedef __be32 xfs_rmap_ptr_t;
 
 unsigned int xfs_refc_block(struct xfs_mount *mp);
 
+/*
+ * Data record/key structure
+ *
+ * Each record associates a range of physical blocks (starting at
+ * rc_startblock and ending rc_blockcount blocks later) with a reference
+ * count (rc_refcount).  Extents that are being used to stage a copy on
+ * write (CoW) operation are recorded in the refcount btree with a
+ * refcount of 1.  All other records must have a refcount > 1 and must
+ * track an extent mapped only by file data forks.
+ *
+ * Extents with a single owner (attributes, metadata, non-shared file
+ * data) are not tracked here.  Free space is also not tracked here.
+ * This is consistent with pre-reflink XFS.
+ */
+struct xfs_refcount_rec {
+	__be32		rc_startblock;	/* starting block number */
+	__be32		rc_blockcount;	/* count of blocks */
+	__be32		rc_refcount;	/* number of inodes linked here */
+};
+
+struct xfs_refcount_key {
+	__be32		rc_startblock;	/* starting block number */
+};
+
+struct xfs_refcount_irec {
+	xfs_agblock_t	rc_startblock;	/* starting block number */
+	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
+	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
+};
+
+#define MAXREFCOUNT	((xfs_nlink_t)~0U)
+#define MAXREFCEXTLEN	((xfs_extlen_t)~0U)
+
+/* btree pointer type */
+typedef __be32 xfs_refcount_ptr_t;
+
 
 /*
  * BMAP Btree format definitions
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
new file mode 100644
index 0000000..359cf0c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -0,0 +1,178 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_cksum.h"
+#include "xfs_trans.h"
+#include "xfs_bit.h"
+
+static struct xfs_btree_cur *
+xfs_refcountbt_dup_cursor(
+	struct xfs_btree_cur	*cur)
+{
+	return xfs_refcountbt_init_cursor(cur->bc_mp, cur->bc_tp,
+			cur->bc_private.a.agbp, cur->bc_private.a.agno,
+			cur->bc_private.a.dfops);
+}
+
+STATIC bool
+xfs_refcountbt_verify(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_target->bt_mount;
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_perag	*pag = bp->b_pag;
+	unsigned int		level;
+
+	if (block->bb_magic != cpu_to_be32(XFS_REFC_CRC_MAGIC))
+		return false;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return false;
+	if (!xfs_btree_sblock_v5hdr_verify(bp))
+		return false;
+
+	level = be16_to_cpu(block->bb_level);
+	if (pag && pag->pagf_init) {
+		if (level >= pag->pagf_refcount_level)
+			return false;
+	} else if (level >= mp->m_refc_maxlevels)
+		return false;
+
+	return xfs_btree_sblock_verify(bp, mp->m_refc_mxr[level != 0]);
+}
+
+STATIC void
+xfs_refcountbt_read_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_btree_sblock_verify_crc(bp))
+		xfs_buf_ioerror(bp, -EFSBADCRC);
+	else if (!xfs_refcountbt_verify(bp))
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+
+	if (bp->b_error) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_verifier_error(bp);
+	}
+}
+
+STATIC void
+xfs_refcountbt_write_verify(
+	struct xfs_buf	*bp)
+{
+	if (!xfs_refcountbt_verify(bp)) {
+		trace_xfs_btree_corrupt(bp, _RET_IP_);
+		xfs_buf_ioerror(bp, -EFSCORRUPTED);
+		xfs_verifier_error(bp);
+		return;
+	}
+	xfs_btree_sblock_calc_crc(bp);
+
+}
+
+const struct xfs_buf_ops xfs_refcountbt_buf_ops = {
+	.name			= "xfs_refcountbt",
+	.verify_read		= xfs_refcountbt_read_verify,
+	.verify_write		= xfs_refcountbt_write_verify,
+};
+
+static const struct xfs_btree_ops xfs_refcountbt_ops = {
+	.rec_len		= sizeof(struct xfs_refcount_rec),
+	.key_len		= sizeof(struct xfs_refcount_key),
+
+	.dup_cursor		= xfs_refcountbt_dup_cursor,
+	.buf_ops		= &xfs_refcountbt_buf_ops,
+};
+
+/*
+ * Allocate a new refcount btree cursor.
+ */
+struct xfs_btree_cur *
+xfs_refcountbt_init_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	xfs_agnumber_t		agno,
+	struct xfs_defer_ops	*dfops)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	struct xfs_btree_cur	*cur;
+
+	ASSERT(agno != NULLAGNUMBER);
+	ASSERT(agno < mp->m_sb.sb_agcount);
+	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_NOFS);
+
+	cur->bc_tp = tp;
+	cur->bc_mp = mp;
+	cur->bc_btnum = XFS_BTNUM_REFC;
+	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+	cur->bc_ops = &xfs_refcountbt_ops;
+
+	cur->bc_nlevels = be32_to_cpu(agf->agf_refcount_level);
+
+	cur->bc_private.a.agbp = agbp;
+	cur->bc_private.a.agno = agno;
+	cur->bc_private.a.dfops = dfops;
+	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
+
+	cur->bc_private.a.priv.refc.nr_ops = 0;
+	cur->bc_private.a.priv.refc.shape_changes = 0;
+
+	return cur;
+}
+
+/*
+ * Calculate the number of records in a refcount btree block.
+ */
+int
+xfs_refcountbt_maxrecs(
+	struct xfs_mount	*mp,
+	int			blocklen,
+	bool			leaf)
+{
+	blocklen -= XFS_REFCOUNT_BLOCK_LEN;
+
+	if (leaf)
+		return blocklen / sizeof(struct xfs_refcount_rec);
+	return blocklen / (sizeof(struct xfs_refcount_key) +
+			   sizeof(xfs_refcount_ptr_t));
+}
+
+/* Compute the maximum height of a refcount btree. */
+void
+xfs_refcountbt_compute_maxlevels(
+	struct xfs_mount		*mp)
+{
+	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(mp,
+			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
new file mode 100644
index 0000000..9e9ad7c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -0,0 +1,67 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_REFCOUNT_BTREE_H__
+#define	__XFS_REFCOUNT_BTREE_H__
+
+/*
+ * Reference Count Btree on-disk structures
+ */
+
+struct xfs_buf;
+struct xfs_btree_cur;
+struct xfs_mount;
+
+/*
+ * Btree block header size
+ */
+#define XFS_REFCOUNT_BLOCK_LEN	XFS_BTREE_SBLOCK_CRC_LEN
+
+/*
+ * Record, key, and pointer address macros for btree blocks.
+ *
+ * (note that some of these may appear unused, but they are used in userspace)
+ */
+#define XFS_REFCOUNT_REC_ADDR(block, index) \
+	((struct xfs_refcount_rec *) \
+		((char *)(block) + \
+		 XFS_REFCOUNT_BLOCK_LEN + \
+		 (((index) - 1) * sizeof(struct xfs_refcount_rec))))
+
+#define XFS_REFCOUNT_KEY_ADDR(block, index) \
+	((struct xfs_refcount_key *) \
+		((char *)(block) + \
+		 XFS_REFCOUNT_BLOCK_LEN + \
+		 ((index) - 1) * sizeof(struct xfs_refcount_key)))
+
+#define XFS_REFCOUNT_PTR_ADDR(block, index, maxrecs) \
+	((xfs_refcount_ptr_t *) \
+		((char *)(block) + \
+		 XFS_REFCOUNT_BLOCK_LEN + \
+		 (maxrecs) * sizeof(struct xfs_refcount_key) + \
+		 ((index) - 1) * sizeof(xfs_refcount_ptr_t)))
+
+extern struct xfs_btree_cur *xfs_refcountbt_init_cursor(struct xfs_mount *mp,
+		struct xfs_trans *tp, struct xfs_buf *agbp, xfs_agnumber_t agno,
+		struct xfs_defer_ops *dfops);
+extern int xfs_refcountbt_maxrecs(struct xfs_mount *mp, int blocklen,
+		bool leaf);
+extern void xfs_refcountbt_compute_maxlevels(struct xfs_mount *mp);
+
+#endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 4aecc5f..a70aec9 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -38,6 +38,8 @@
 #include "xfs_ialloc_btree.h"
 #include "xfs_log.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_refcount_btree.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -737,6 +739,13 @@ xfs_sb_mount_common(
 	mp->m_rmap_mnr[0] = mp->m_rmap_mxr[0] / 2;
 	mp->m_rmap_mnr[1] = mp->m_rmap_mxr[1] / 2;
 
+	mp->m_refc_mxr[0] = xfs_refcountbt_maxrecs(mp, sbp->sb_blocksize,
+			true);
+	mp->m_refc_mxr[1] = xfs_refcountbt_maxrecs(mp, sbp->sb_blocksize,
+			false);
+	mp->m_refc_mnr[0] = mp->m_refc_mxr[0] / 2;
+	mp->m_refc_mnr[1] = mp->m_refc_mxr[1] / 2;
+
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
 	mp->m_ialloc_inos = (int)MAX((__uint16_t)XFS_INODES_PER_CHUNK,
 					sbp->sb_inopblock);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 0c5b30b..c6f4eb4 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -39,6 +39,7 @@ extern const struct xfs_buf_ops xfs_agf_buf_ops;
 extern const struct xfs_buf_ops xfs_agfl_buf_ops;
 extern const struct xfs_buf_ops xfs_allocbt_buf_ops;
 extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
+extern const struct xfs_buf_ops xfs_refcountbt_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_leaf_buf_ops;
 extern const struct xfs_buf_ops xfs_attr3_rmt_buf_ops;
 extern const struct xfs_buf_ops xfs_bmbt_buf_ops;
@@ -122,6 +123,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define	XFS_INO_REF		2
 #define	XFS_ATTR_BTREE_REF	1
 #define	XFS_DQUOT_REF		1
+#define	XFS_REFC_BTREE_REF	1
 
 /*
  * Flags for xfs_trans_ichgtime().
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 301ef2f..7c840e1 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -73,7 +73,7 @@ xfs_calc_buf_res(
  *
  * Keep in mind that max depth is calculated separately for each type of tree.
  */
-static uint
+uint
 xfs_allocfree_log_count(
 	struct xfs_mount *mp,
 	uint		num_ops)
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 0eb46ed..36a1511 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -102,5 +102,6 @@ struct xfs_trans_resv {
 #define	XFS_ATTRRM_LOG_COUNT		3
 
 void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp);
+uint xfs_allocfree_log_count(struct xfs_mount *mp, uint num_ops);
 
 #endif	/* __XFS_TRANS_RESV_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 56e85a6..3f64615 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -43,6 +43,7 @@
 #include "xfs_icache.h"
 #include "xfs_sysfs.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_refcount_btree.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -684,6 +685,7 @@ xfs_mountfs(
 	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
 	xfs_ialloc_compute_maxlevels(mp);
 	xfs_rmapbt_compute_maxlevels(mp);
+	xfs_refcountbt_compute_maxlevels(mp);
 
 	xfs_set_maxicount(mp);
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8fab496..0be14a7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -124,10 +124,13 @@ typedef struct xfs_mount {
 	uint			m_inobt_mnr[2];	/* min inobt btree records */
 	uint			m_rmap_mxr[2];	/* max rmap btree records */
 	uint			m_rmap_mnr[2];	/* min rmap btree records */
+	uint			m_refc_mxr[2];	/* max refc btree records */
+	uint			m_refc_mnr[2];	/* min refc btree records */
 	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
 	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
 	uint			m_in_maxlevels;	/* max inobt btree levels. */
 	uint			m_rmap_maxlevels; /* max rmap btree levels */
+	uint			m_refc_maxlevels; /* max refcount btree level */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */
diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
index 69e2986..0c381d7 100644
--- a/fs/xfs/xfs_ondisk.h
+++ b/fs/xfs/xfs_ondisk.h
@@ -49,6 +49,8 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_refcount_key,		4);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_refcount_rec,		12);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_key,		20);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_rmap_rec,		24);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_timestamp,		8);
@@ -56,6 +58,7 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(xfs_alloc_ptr_t,			4);
 	XFS_CHECK_STRUCT_SIZE(xfs_alloc_rec_t,			8);
 	XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t,			4);
+	XFS_CHECK_STRUCT_SIZE(xfs_refcount_ptr_t,		4);
 	XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t,			4);
 
 	/* dir/attr trees */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8446338..c7b9853 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -39,16 +39,7 @@ struct xfs_buf_log_format;
 struct xfs_inode_log_format;
 struct xfs_bmbt_irec;
 struct xfs_btree_cur;
-
-#ifndef XFS_REFCOUNT_IREC_PLACEHOLDER
-#define XFS_REFCOUNT_IREC_PLACEHOLDER
-/* Placeholder definition to avoid breaking bisectability. */
-struct xfs_refcount_irec {
-	xfs_agblock_t	rc_startblock;	/* starting block number */
-	xfs_extlen_t	rc_blockcount;	/* count of free blocks */
-	xfs_nlink_t	rc_refcount;	/* number of inodes linked here */
-};
-#endif
+struct xfs_refcount_irec;
 
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 08/63] xfs: add refcount btree support to growfs
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 07/63] xfs: define the on-disk refcount btree format Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 09/63] xfs: account for the refcount btree in the alloc/free log reservation Darrick J. Wong
                   ` (55 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Modify the growfs code to initialize new refcount btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_fsops.c |   39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 94ac06f..4b4059b 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -259,6 +259,12 @@ xfs_growfs_data_private(
 		agf->agf_longest = cpu_to_be32(tmpsize);
 		if (xfs_sb_version_hascrc(&mp->m_sb))
 			uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
+		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			agf->agf_refcount_root = cpu_to_be32(
+					xfs_refc_block(mp));
+			agf->agf_refcount_level = cpu_to_be32(1);
+			agf->agf_refcount_blocks = cpu_to_be32(1);
+		}
 
 		error = xfs_bwrite(bp);
 		xfs_buf_relse(bp);
@@ -450,6 +456,17 @@ xfs_growfs_data_private(
 			rrec->rm_offset = 0;
 			be16_add_cpu(&block->bb_numrecs, 1);
 
+			/* account for refc btree root */
+			if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+				rrec = XFS_RMAP_REC_ADDR(block, 5);
+				rrec->rm_startblock = cpu_to_be32(
+						xfs_refc_block(mp));
+				rrec->rm_blockcount = cpu_to_be32(1);
+				rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_REFC);
+				rrec->rm_offset = 0;
+				be16_add_cpu(&block->bb_numrecs, 1);
+			}
+
 			error = xfs_bwrite(bp);
 			xfs_buf_relse(bp);
 			if (error)
@@ -507,6 +524,28 @@ xfs_growfs_data_private(
 				goto error0;
 		}
 
+		/*
+		 * refcount btree root block
+		 */
+		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AGB_TO_DADDR(mp, agno, xfs_refc_block(mp)),
+				BTOBB(mp->m_sb.sb_blocksize), 0,
+				&xfs_refcountbt_buf_ops);
+			if (!bp) {
+				error = -ENOMEM;
+				goto error0;
+			}
+
+			xfs_btree_init_block(mp, bp, XFS_REFC_CRC_MAGIC,
+					     0, 0, agno,
+					     XFS_BTREE_CRC_BLOCKS);
+
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+			if (error)
+				goto error0;
+		}
 	}
 	xfs_trans_agblocks_delta(tp, nfree);
 	/*


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 09/63] xfs: account for the refcount btree in the alloc/free log reservation
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 08/63] xfs: add refcount btree support to growfs Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 10/63] xfs: add refcount btree operations Darrick J. Wong
                   ` (54 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Every time we allocate or free a data extent, we might need to split
the refcount btree.  Reserve some blocks in the transaction to handle
this possibility.  Even though the deferred refcount code can roll a
transaction to avoid overloading the transaction, we can still exceed
the reservation.

Certain pathological workloads (1k blocks, no cowextsize hint, random
directio writes), cause a perfect storm wherein a refcount adjustment
of a large range of blocks causes full tree splits in two separate
extents in two separate refcount tree blocks; allocating new refcount
tree blocks causes rmap btree splits; and all the allocation activity
causes the freespace btrees to split, blowing the reservation.

(Reproduced by generic/167 over NFS atop XFS)

Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 7c840e1..a59838f 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -67,7 +67,8 @@ xfs_calc_buf_res(
  * Per-extent log reservation for the btree changes involved in freeing or
  * allocating an extent.  In classic XFS there were two trees that will be
  * modified (bnobt + cntbt).  With rmap enabled, there are three trees
- * (rmapbt).  The number of blocks reserved is based on the formula:
+ * (rmapbt).  With reflink, there are four trees (refcountbt).  The number of
+ * blocks reserved is based on the formula:
  *
  * num trees * ((2 blocks/level * max depth) - 1)
  *
@@ -83,6 +84,8 @@ xfs_allocfree_log_count(
 	blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
 
 	return blocks;
 }


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 10/63] xfs: add refcount btree operations
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 09/63] xfs: account for the refcount btree in the alloc/free log reservation Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 11/63] xfs: create refcount update intent log items Darrick J. Wong
                   ` (53 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Implement the generic btree operations required to manipulate refcount
btree blocks.  The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.

Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
v2: Remove init_rec_from_key since we no longer need it, and add
tracepoints when refcount btree operations fail.
---
 fs/xfs/Makefile                    |    1 
 fs/xfs/libxfs/xfs_alloc.c          |    3 +
 fs/xfs/libxfs/xfs_format.h         |   10 +-
 fs/xfs/libxfs/xfs_refcount.c       |  177 ++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h       |   30 +++++
 fs/xfs/libxfs/xfs_refcount_btree.c |  211 ++++++++++++++++++++++++++++++++++++
 6 files changed, 430 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_refcount.c
 create mode 100644 fs/xfs/libxfs/xfs_refcount.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 8d749f2..98b2427 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -55,6 +55,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_ag_resv.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
+				   xfs_refcount.o \
 				   xfs_refcount_btree.o \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index aa0e1ca..be7e3fc 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2334,6 +2334,9 @@ xfs_alloc_log_agf(
 		offsetof(xfs_agf_t, agf_btreeblks),
 		offsetof(xfs_agf_t, agf_uuid),
 		offsetof(xfs_agf_t, agf_rmap_blocks),
+		offsetof(xfs_agf_t, agf_refcount_blocks),
+		offsetof(xfs_agf_t, agf_refcount_root),
+		offsetof(xfs_agf_t, agf_refcount_level),
 		/* needed so that we don't log the whole rest of the structure: */
 		offsetof(xfs_agf_t, agf_spare64),
 		sizeof(xfs_agf_t)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 97c74f4..8b826102df 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -684,8 +684,11 @@ typedef struct xfs_agf {
 #define	XFS_AGF_BTREEBLKS	0x00000800
 #define	XFS_AGF_UUID		0x00001000
 #define	XFS_AGF_RMAP_BLOCKS	0x00002000
-#define	XFS_AGF_SPARE64		0x00004000
-#define	XFS_AGF_NUM_BITS	15
+#define	XFS_AGF_REFCOUNT_BLOCKS	0x00004000
+#define	XFS_AGF_REFCOUNT_ROOT	0x00008000
+#define	XFS_AGF_REFCOUNT_LEVEL	0x00010000
+#define	XFS_AGF_SPARE64		0x00020000
+#define	XFS_AGF_NUM_BITS	18
 #define	XFS_AGF_ALL_BITS	((1 << XFS_AGF_NUM_BITS) - 1)
 
 #define XFS_AGF_FLAGS \
@@ -703,6 +706,9 @@ typedef struct xfs_agf {
 	{ XFS_AGF_BTREEBLKS,	"BTREEBLKS" }, \
 	{ XFS_AGF_UUID,		"UUID" }, \
 	{ XFS_AGF_RMAP_BLOCKS,	"RMAP_BLOCKS" }, \
+	{ XFS_AGF_REFCOUNT_BLOCKS,	"REFCOUNT_BLOCKS" }, \
+	{ XFS_AGF_REFCOUNT_ROOT,	"REFCOUNT_ROOT" }, \
+	{ XFS_AGF_REFCOUNT_LEVEL,	"REFCOUNT_LEVEL" }, \
 	{ XFS_AGF_SPARE64,	"SPARE64" }
 
 /* disk block (xfs_daddr_t) in the AG */
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
new file mode 100644
index 0000000..de13406
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -0,0 +1,177 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bmap.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_cksum.h"
+#include "xfs_trans.h"
+#include "xfs_bit.h"
+#include "xfs_refcount.h"
+
+/*
+ * Look up the first record less than or equal to [bno, len] in the btree
+ * given by cur.
+ */
+int
+xfs_refcount_lookup_le(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	int			*stat)
+{
+	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_private.a.agno, bno,
+			XFS_LOOKUP_LE);
+	cur->bc_rec.rc.rc_startblock = bno;
+	cur->bc_rec.rc.rc_blockcount = 0;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
+}
+
+/*
+ * Look up the first record greater than or equal to [bno, len] in the btree
+ * given by cur.
+ */
+int
+xfs_refcount_lookup_ge(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	int			*stat)
+{
+	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_private.a.agno, bno,
+			XFS_LOOKUP_GE);
+	cur->bc_rec.rc.rc_startblock = bno;
+	cur->bc_rec.rc.rc_blockcount = 0;
+	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int
+xfs_refcount_get_rec(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*irec,
+	int				*stat)
+{
+	union xfs_btree_rec	*rec;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (!error && *stat == 1) {
+		irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
+		irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
+		irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
+		trace_xfs_refcount_get(cur->bc_mp, cur->bc_private.a.agno,
+				irec);
+	}
+	return error;
+}
+
+/*
+ * Update the record referred to by cur to the value given
+ * by [bno, len, refcount].
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int
+xfs_refcount_update(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*irec)
+{
+	union xfs_btree_rec	rec;
+	int			error;
+
+	trace_xfs_refcount_update(cur->bc_mp, cur->bc_private.a.agno, irec);
+	rec.refc.rc_startblock = cpu_to_be32(irec->rc_startblock);
+	rec.refc.rc_blockcount = cpu_to_be32(irec->rc_blockcount);
+	rec.refc.rc_refcount = cpu_to_be32(irec->rc_refcount);
+	error = xfs_btree_update(cur, &rec);
+	if (error)
+		trace_xfs_refcount_update_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Insert the record referred to by cur to the value given
+ * by [bno, len, refcount].
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int
+xfs_refcount_insert(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*irec,
+	int				*i)
+{
+	int				error;
+
+	trace_xfs_refcount_insert(cur->bc_mp, cur->bc_private.a.agno, irec);
+	cur->bc_rec.rc.rc_startblock = irec->rc_startblock;
+	cur->bc_rec.rc.rc_blockcount = irec->rc_blockcount;
+	cur->bc_rec.rc.rc_refcount = irec->rc_refcount;
+	error = xfs_btree_insert(cur, i);
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, *i == 1, out_error);
+out_error:
+	if (error)
+		trace_xfs_refcount_insert_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Remove the record referred to by cur, then set the pointer to the spot
+ * where the record could be re-inserted, in case we want to increment or
+ * decrement the cursor.
+ * This either works (return 0) or gets an EFSCORRUPTED error.
+ */
+STATIC int
+xfs_refcount_delete(
+	struct xfs_btree_cur	*cur,
+	int			*i)
+{
+	struct xfs_refcount_irec	irec;
+	int			found_rec;
+	int			error;
+
+	error = xfs_refcount_get_rec(cur, &irec, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+	trace_xfs_refcount_delete(cur->bc_mp, cur->bc_private.a.agno, &irec);
+	error = xfs_btree_delete(cur, i);
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, *i == 1, out_error);
+	if (error)
+		goto out_error;
+	error = xfs_refcount_lookup_ge(cur, irec.rc_startblock, &found_rec);
+out_error:
+	if (error)
+		trace_xfs_refcount_delete_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
new file mode 100644
index 0000000..4dc335a
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_REFCOUNT_H__
+#define __XFS_REFCOUNT_H__
+
+extern int xfs_refcount_lookup_le(struct xfs_btree_cur *cur,
+		xfs_agblock_t bno, int *stat);
+extern int xfs_refcount_lookup_ge(struct xfs_btree_cur *cur,
+		xfs_agblock_t bno, int *stat);
+extern int xfs_refcount_get_rec(struct xfs_btree_cur *cur,
+		struct xfs_refcount_irec *irec, int *stat);
+
+#endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 359cf0c..81d58b0 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -34,6 +34,7 @@
 #include "xfs_cksum.h"
 #include "xfs_trans.h"
 #include "xfs_bit.h"
+#include "xfs_rmap.h"
 
 static struct xfs_btree_cur *
 xfs_refcountbt_dup_cursor(
@@ -44,6 +45,178 @@ xfs_refcountbt_dup_cursor(
 			cur->bc_private.a.dfops);
 }
 
+STATIC void
+xfs_refcountbt_set_root(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr,
+	int			inc)
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agnumber_t		seqno = be32_to_cpu(agf->agf_seqno);
+	struct xfs_perag	*pag = xfs_perag_get(cur->bc_mp, seqno);
+
+	ASSERT(ptr->s != 0);
+
+	agf->agf_refcount_root = ptr->s;
+	be32_add_cpu(&agf->agf_refcount_level, inc);
+	pag->pagf_refcount_level += inc;
+	xfs_perag_put(pag);
+
+	xfs_alloc_log_agf(cur->bc_tp, agbp,
+			XFS_AGF_REFCOUNT_ROOT | XFS_AGF_REFCOUNT_LEVEL);
+}
+
+STATIC int
+xfs_refcountbt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	struct xfs_alloc_arg	args;		/* block allocation args */
+	int			error;		/* error return value */
+
+	memset(&args, 0, sizeof(args));
+	args.tp = cur->bc_tp;
+	args.mp = cur->bc_mp;
+	args.type = XFS_ALLOCTYPE_NEAR_BNO;
+	args.fsbno = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+			xfs_refc_block(args.mp));
+	args.firstblock = args.fsbno;
+	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
+	args.minlen = args.maxlen = args.prod = 1;
+
+	error = xfs_alloc_vextent(&args);
+	if (error)
+		goto out_error;
+	trace_xfs_refcountbt_alloc_block(cur->bc_mp, cur->bc_private.a.agno,
+			args.agbno, 1);
+	if (args.fsbno == NULLFSBLOCK) {
+		XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+		*stat = 0;
+		return 0;
+	}
+	ASSERT(args.agno == cur->bc_private.a.agno);
+	ASSERT(args.len == 1);
+
+	new->s = cpu_to_be32(args.agbno);
+	be32_add_cpu(&agf->agf_refcount_blocks, 1);
+	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
+
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
+	*stat = 1;
+	return 0;
+
+out_error:
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
+	return error;
+}
+
+STATIC int
+xfs_refcountbt_free_block(
+	struct xfs_btree_cur	*cur,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
+	struct xfs_owner_info	oinfo;
+
+	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
+	be32_add_cpu(&agf->agf_refcount_blocks, -1);
+	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
+	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
+			&oinfo);
+
+	return 0;
+}
+
+STATIC int
+xfs_refcountbt_get_minrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_refc_mnr[level != 0];
+}
+
+STATIC int
+xfs_refcountbt_get_maxrecs(
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	return cur->bc_mp->m_refc_mxr[level != 0];
+}
+
+STATIC void
+xfs_refcountbt_init_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	key->refc.rc_startblock = rec->refc.rc_startblock;
+}
+
+STATIC void
+xfs_refcountbt_init_high_key_from_rec(
+	union xfs_btree_key	*key,
+	union xfs_btree_rec	*rec)
+{
+	__u32			x;
+
+	x = be32_to_cpu(rec->refc.rc_startblock);
+	x += be32_to_cpu(rec->refc.rc_blockcount) - 1;
+	key->refc.rc_startblock = cpu_to_be32(x);
+}
+
+STATIC void
+xfs_refcountbt_init_rec_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*rec)
+{
+	rec->refc.rc_startblock = cpu_to_be32(cur->bc_rec.rc.rc_startblock);
+	rec->refc.rc_blockcount = cpu_to_be32(cur->bc_rec.rc.rc_blockcount);
+	rec->refc.rc_refcount = cpu_to_be32(cur->bc_rec.rc.rc_refcount);
+}
+
+STATIC void
+xfs_refcountbt_init_ptr_from_cur(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*ptr)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(cur->bc_private.a.agbp);
+
+	ASSERT(cur->bc_private.a.agno == be32_to_cpu(agf->agf_seqno));
+	ASSERT(agf->agf_refcount_root != 0);
+
+	ptr->s = agf->agf_refcount_root;
+}
+
+STATIC __int64_t
+xfs_refcountbt_key_diff(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*key)
+{
+	struct xfs_refcount_irec	*rec = &cur->bc_rec.rc;
+	struct xfs_refcount_key		*kp = &key->refc;
+
+	return (__int64_t)be32_to_cpu(kp->rc_startblock) - rec->rc_startblock;
+}
+
+STATIC __int64_t
+xfs_refcountbt_diff_two_keys(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return (__int64_t)be32_to_cpu(k1->refc.rc_startblock) -
+			  be32_to_cpu(k2->refc.rc_startblock);
+}
+
 STATIC bool
 xfs_refcountbt_verify(
 	struct xfs_buf		*bp)
@@ -106,12 +279,50 @@ const struct xfs_buf_ops xfs_refcountbt_buf_ops = {
 	.verify_write		= xfs_refcountbt_write_verify,
 };
 
+#if defined(DEBUG) || defined(XFS_WARN)
+STATIC int
+xfs_refcountbt_keys_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_key	*k1,
+	union xfs_btree_key	*k2)
+{
+	return be32_to_cpu(k1->refc.rc_startblock) <
+	       be32_to_cpu(k2->refc.rc_startblock);
+}
+
+STATIC int
+xfs_refcountbt_recs_inorder(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_rec	*r1,
+	union xfs_btree_rec	*r2)
+{
+	return  be32_to_cpu(r1->refc.rc_startblock) +
+		be32_to_cpu(r1->refc.rc_blockcount) <=
+		be32_to_cpu(r2->refc.rc_startblock);
+}
+#endif
+
 static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.rec_len		= sizeof(struct xfs_refcount_rec),
 	.key_len		= sizeof(struct xfs_refcount_key),
 
 	.dup_cursor		= xfs_refcountbt_dup_cursor,
+	.set_root		= xfs_refcountbt_set_root,
+	.alloc_block		= xfs_refcountbt_alloc_block,
+	.free_block		= xfs_refcountbt_free_block,
+	.get_minrecs		= xfs_refcountbt_get_minrecs,
+	.get_maxrecs		= xfs_refcountbt_get_maxrecs,
+	.init_key_from_rec	= xfs_refcountbt_init_key_from_rec,
+	.init_high_key_from_rec	= xfs_refcountbt_init_high_key_from_rec,
+	.init_rec_from_cur	= xfs_refcountbt_init_rec_from_cur,
+	.init_ptr_from_cur	= xfs_refcountbt_init_ptr_from_cur,
+	.key_diff		= xfs_refcountbt_key_diff,
 	.buf_ops		= &xfs_refcountbt_buf_ops,
+	.diff_two_keys		= xfs_refcountbt_diff_two_keys,
+#if defined(DEBUG) || defined(XFS_WARN)
+	.keys_inorder		= xfs_refcountbt_keys_inorder,
+	.recs_inorder		= xfs_refcountbt_recs_inorder,
+#endif
 };
 
 /*


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 11/63] xfs: create refcount update intent log items
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 10/63] xfs: add refcount btree operations Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 12/63] xfs: log refcount intent items Darrick J. Wong
                   ` (52 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Create refcount update intent/done log items to record redo
information in the log.  Because we need to roll transactions between
updating the bmbt mapping and updating the reverse mapping, we also
have to track the status of the metadata updates that will be recorded
in the post-roll transactions, just in case we crash before committing
the final transaction.  This mechanism enables log recovery to finish
what was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_log_format.h |   58 ++++++
 fs/xfs/xfs_refcount_item.c     |  383 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_refcount_item.h     |  100 ++++++++++
 fs/xfs/xfs_super.c             |   18 ++
 5 files changed, 558 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_refcount_item.c
 create mode 100644 fs/xfs/xfs_refcount_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 98b2427..d6429fd 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -106,6 +106,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_extfree_item.o \
 				   xfs_icreate_item.o \
 				   xfs_inode_item.o \
+				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index fc5eef8..3659f04 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -112,7 +112,9 @@ static inline uint xlog_get_cycle(char *ptr)
 #define XLOG_REG_TYPE_ICREATE		20
 #define XLOG_REG_TYPE_RUI_FORMAT	21
 #define XLOG_REG_TYPE_RUD_FORMAT	22
-#define XLOG_REG_TYPE_MAX		22
+#define XLOG_REG_TYPE_CUI_FORMAT	23
+#define XLOG_REG_TYPE_CUD_FORMAT	24
+#define XLOG_REG_TYPE_MAX		24
 
 /*
  * Flags to log operation header
@@ -231,6 +233,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_ICREATE		0x123f
 #define	XFS_LI_RUI		0x1240	/* rmap update intent */
 #define	XFS_LI_RUD		0x1241
+#define	XFS_LI_CUI		0x1242	/* refcount update intent */
+#define	XFS_LI_CUD		0x1243
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -242,7 +246,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_QUOTAOFF,	"XFS_LI_QUOTAOFF" }, \
 	{ XFS_LI_ICREATE,	"XFS_LI_ICREATE" }, \
 	{ XFS_LI_RUI,		"XFS_LI_RUI" }, \
-	{ XFS_LI_RUD,		"XFS_LI_RUD" }
+	{ XFS_LI_RUD,		"XFS_LI_RUD" }, \
+	{ XFS_LI_CUI,		"XFS_LI_CUI" }, \
+	{ XFS_LI_CUD,		"XFS_LI_CUD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -671,6 +677,54 @@ struct xfs_rud_log_format {
 };
 
 /*
+ * CUI/CUD (refcount update) log format definitions
+ */
+struct xfs_phys_extent {
+	__uint64_t		pe_startblock;
+	__uint32_t		pe_len;
+	__uint32_t		pe_flags;
+};
+
+/* refcount pe_flags: upper bits are flags, lower byte is type code */
+/* Type codes are taken directly from enum xfs_refcount_intent_type. */
+#define XFS_REFCOUNT_EXTENT_TYPE_MASK	0xFF
+
+#define XFS_REFCOUNT_EXTENT_FLAGS	(XFS_REFCOUNT_EXTENT_TYPE_MASK)
+
+/*
+ * This is the structure used to lay out a cui log item in the
+ * log.  The cui_extents field is a variable size array whose
+ * size is given by cui_nextents.
+ */
+struct xfs_cui_log_format {
+	__uint16_t		cui_type;	/* cui log item type */
+	__uint16_t		cui_size;	/* size of this item */
+	__uint32_t		cui_nextents;	/* # extents to free */
+	__uint64_t		cui_id;		/* cui identifier */
+	struct xfs_phys_extent	cui_extents[];	/* array of extents */
+};
+
+static inline size_t
+xfs_cui_log_format_sizeof(
+	unsigned int		nr)
+{
+	return sizeof(struct xfs_cui_log_format) +
+			nr * sizeof(struct xfs_phys_extent);
+}
+
+/*
+ * This is the structure used to lay out a cud log item in the
+ * log.  The cud_extents array is a variable size array whose
+ * size is given by cud_nextents;
+ */
+struct xfs_cud_log_format {
+	__uint16_t		cud_type;	/* cud log item type */
+	__uint16_t		cud_size;	/* size of this item */
+	__uint32_t		__pad;
+	__uint64_t		cud_cui_id;	/* id of corresponding cui */
+};
+
+/*
  * Dquot Log format definitions.
  *
  * The first two fields must be the type and size fitting into
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
new file mode 100644
index 0000000..f9ad055
--- /dev/null
+++ b/fs/xfs/xfs_refcount_item.c
@@ -0,0 +1,383 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_buf_item.h"
+#include "xfs_refcount_item.h"
+#include "xfs_log.h"
+
+
+kmem_zone_t	*xfs_cui_zone;
+kmem_zone_t	*xfs_cud_zone;
+
+static inline struct xfs_cui_log_item *CUI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_cui_log_item, cui_item);
+}
+
+void
+xfs_cui_item_free(
+	struct xfs_cui_log_item	*cuip)
+{
+	if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS)
+		kmem_free(cuip);
+	else
+		kmem_zone_free(xfs_cui_zone, cuip);
+}
+
+STATIC void
+xfs_cui_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	struct xfs_cui_log_item	*cuip = CUI_ITEM(lip);
+
+	*nvecs += 1;
+	*nbytes += xfs_cui_log_format_sizeof(cuip->cui_format.cui_nextents);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given cui log item. We use only 1 iovec, and we point that
+ * at the cui_log_format structure embedded in the cui item.
+ * It is at this point that we assert that all of the extent
+ * slots in the cui item have been filled.
+ */
+STATIC void
+xfs_cui_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_cui_log_item	*cuip = CUI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(atomic_read(&cuip->cui_next_extent) ==
+			cuip->cui_format.cui_nextents);
+
+	cuip->cui_format.cui_type = XFS_LI_CUI;
+	cuip->cui_format.cui_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_CUI_FORMAT, &cuip->cui_format,
+			xfs_cui_log_format_sizeof(cuip->cui_format.cui_nextents));
+}
+
+/*
+ * Pinning has no meaning for an cui item, so just return.
+ */
+STATIC void
+xfs_cui_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * The unpin operation is the last place an CUI is manipulated in the log. It is
+ * either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the CUI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the CUI to either construct
+ * and commit the CUD or drop the CUD's reference in the event of error. Simply
+ * drop the log's CUI reference now that the log is done with it.
+ */
+STATIC void
+xfs_cui_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_cui_log_item	*cuip = CUI_ITEM(lip);
+
+	xfs_cui_release(cuip);
+}
+
+/*
+ * CUI items have no locking or pushing.  However, since CUIs are pulled from
+ * the AIL when their corresponding CUDs are committed to disk, their situation
+ * is very similar to being pinned.  Return XFS_ITEM_PINNED so that the caller
+ * will eventually flush the log.  This should help in getting the CUI out of
+ * the AIL.
+ */
+STATIC uint
+xfs_cui_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The CUI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an CUD isn't going to be
+ * constructed and thus we free the CUI here directly.
+ */
+STATIC void
+xfs_cui_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	if (lip->li_flags & XFS_LI_ABORTED)
+		xfs_cui_item_free(CUI_ITEM(lip));
+}
+
+/*
+ * The CUI is logged only once and cannot be moved in the log, so simply return
+ * the lsn at which it's been logged.
+ */
+STATIC xfs_lsn_t
+xfs_cui_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	return lsn;
+}
+
+/*
+ * The CUI dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_cui_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all cui log items.
+ */
+static const struct xfs_item_ops xfs_cui_item_ops = {
+	.iop_size	= xfs_cui_item_size,
+	.iop_format	= xfs_cui_item_format,
+	.iop_pin	= xfs_cui_item_pin,
+	.iop_unpin	= xfs_cui_item_unpin,
+	.iop_unlock	= xfs_cui_item_unlock,
+	.iop_committed	= xfs_cui_item_committed,
+	.iop_push	= xfs_cui_item_push,
+	.iop_committing = xfs_cui_item_committing,
+};
+
+/*
+ * Allocate and initialize an cui item with the given number of extents.
+ */
+struct xfs_cui_log_item *
+xfs_cui_init(
+	struct xfs_mount		*mp,
+	uint				nextents)
+
+{
+	struct xfs_cui_log_item		*cuip;
+
+	ASSERT(nextents > 0);
+	if (nextents > XFS_CUI_MAX_FAST_EXTENTS)
+		cuip = kmem_zalloc(xfs_cui_log_item_sizeof(nextents),
+				KM_SLEEP);
+	else
+		cuip = kmem_zone_zalloc(xfs_cui_zone, KM_SLEEP);
+
+	xfs_log_item_init(mp, &cuip->cui_item, XFS_LI_CUI, &xfs_cui_item_ops);
+	cuip->cui_format.cui_nextents = nextents;
+	cuip->cui_format.cui_id = (uintptr_t)(void *)cuip;
+	atomic_set(&cuip->cui_next_extent, 0);
+	atomic_set(&cuip->cui_refcount, 2);
+
+	return cuip;
+}
+
+/*
+ * Freeing the CUI requires that we remove it from the AIL if it has already
+ * been placed there. However, the CUI may not yet have been placed in the AIL
+ * when called by xfs_cui_release() from CUD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the CUI.
+ */
+void
+xfs_cui_release(
+	struct xfs_cui_log_item	*cuip)
+{
+	if (atomic_dec_and_test(&cuip->cui_refcount)) {
+		xfs_trans_ail_remove(&cuip->cui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_cui_item_free(cuip);
+	}
+}
+
+static inline struct xfs_cud_log_item *CUD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_cud_log_item, cud_item);
+}
+
+STATIC void
+xfs_cud_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_cud_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given cud log item. We use only 1 iovec, and we point that
+ * at the cud_log_format structure embedded in the cud item.
+ * It is at this point that we assert that all of the extent
+ * slots in the cud item have been filled.
+ */
+STATIC void
+xfs_cud_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	cudp->cud_format.cud_type = XFS_LI_CUD;
+	cudp->cud_format.cud_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_CUD_FORMAT, &cudp->cud_format,
+			sizeof(struct xfs_cud_log_format));
+}
+
+/*
+ * Pinning has no meaning for an cud item, so just return.
+ */
+STATIC void
+xfs_cud_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * Since pinning has no meaning for an cud item, unpinning does
+ * not either.
+ */
+STATIC void
+xfs_cud_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+}
+
+/*
+ * There isn't much you can do to push on an cud item.  It is simply stuck
+ * waiting for the log to be flushed to disk.
+ */
+STATIC uint
+xfs_cud_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The CUD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the CUI and free the
+ * CUD.
+ */
+STATIC void
+xfs_cud_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
+
+	if (lip->li_flags & XFS_LI_ABORTED) {
+		xfs_cui_release(cudp->cud_cuip);
+		kmem_zone_free(xfs_cud_zone, cudp);
+	}
+}
+
+/*
+ * When the cud item is committed to disk, all we need to do is delete our
+ * reference to our partner cui item and then free ourselves. Since we're
+ * freeing ourselves we must return -1 to keep the transaction code from
+ * further referencing this item.
+ */
+STATIC xfs_lsn_t
+xfs_cud_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	struct xfs_cud_log_item	*cudp = CUD_ITEM(lip);
+
+	/*
+	 * Drop the CUI reference regardless of whether the CUD has been
+	 * aborted. Once the CUD transaction is constructed, it is the sole
+	 * responsibility of the CUD to release the CUI (even if the CUI is
+	 * aborted due to log I/O error).
+	 */
+	xfs_cui_release(cudp->cud_cuip);
+	kmem_zone_free(xfs_cud_zone, cudp);
+
+	return (xfs_lsn_t)-1;
+}
+
+/*
+ * The CUD dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_cud_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all cud log items.
+ */
+static const struct xfs_item_ops xfs_cud_item_ops = {
+	.iop_size	= xfs_cud_item_size,
+	.iop_format	= xfs_cud_item_format,
+	.iop_pin	= xfs_cud_item_pin,
+	.iop_unpin	= xfs_cud_item_unpin,
+	.iop_unlock	= xfs_cud_item_unlock,
+	.iop_committed	= xfs_cud_item_committed,
+	.iop_push	= xfs_cud_item_push,
+	.iop_committing = xfs_cud_item_committing,
+};
+
+/*
+ * Allocate and initialize an cud item with the given number of extents.
+ */
+struct xfs_cud_log_item *
+xfs_cud_init(
+	struct xfs_mount		*mp,
+	struct xfs_cui_log_item		*cuip)
+
+{
+	struct xfs_cud_log_item	*cudp;
+
+	cudp = kmem_zone_zalloc(xfs_cud_zone, KM_SLEEP);
+	xfs_log_item_init(mp, &cudp->cud_item, XFS_LI_CUD, &xfs_cud_item_ops);
+	cudp->cud_cuip = cuip;
+	cudp->cud_format.cud_cui_id = cuip->cui_format.cui_id;
+
+	return cudp;
+}
diff --git a/fs/xfs/xfs_refcount_item.h b/fs/xfs/xfs_refcount_item.h
new file mode 100644
index 0000000..34b6f7a
--- /dev/null
+++ b/fs/xfs/xfs_refcount_item.h
@@ -0,0 +1,100 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef	__XFS_REFCOUNT_ITEM_H__
+#define	__XFS_REFCOUNT_ITEM_H__
+
+/*
+ * There are (currently) two pairs of refcount btree redo item types:
+ * increase and decrease.  The log items for these are CUI (refcount
+ * update intent) and CUD (refcount update done).  The redo item type
+ * is encoded in the flags field of each xfs_map_extent.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same
+ * transaction that records the associated refcountbt updates.
+ *
+ * Should the system crash after the commit of the first transaction
+ * but before the commit of the final transaction in a series, log
+ * recovery will use the redo information recorded by the intent items
+ * to replay the refcountbt metadata updates.
+ */
+
+/* kernel only CUI/CUD definitions */
+
+struct xfs_mount;
+struct kmem_zone;
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_CUI_MAX_FAST_EXTENTS	16
+
+/*
+ * Define CUI flag bits. Manipulated by set/clear/test_bit operators.
+ */
+#define	XFS_CUI_RECOVERED		1
+
+/*
+ * This is the "refcount update intent" log item.  It is used to log
+ * the fact that some reverse mappings need to change.  It is used in
+ * conjunction with the "refcount update done" log item described
+ * below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item;
+ * see the comments about that structure (in xfs_extfree_item.h) for
+ * more details.
+ */
+struct xfs_cui_log_item {
+	struct xfs_log_item		cui_item;
+	atomic_t			cui_refcount;
+	atomic_t			cui_next_extent;
+	unsigned long			cui_flags;	/* misc flags */
+	struct xfs_cui_log_format	cui_format;
+};
+
+static inline size_t
+xfs_cui_log_item_sizeof(
+	unsigned int		nr)
+{
+	return offsetof(struct xfs_cui_log_item, cui_format) +
+			xfs_cui_log_format_sizeof(nr);
+}
+
+/*
+ * This is the "refcount update done" log item.  It is used to log the
+ * fact that some refcountbt updates mentioned in an earlier cui item
+ * have been performed.
+ */
+struct xfs_cud_log_item {
+	struct xfs_log_item		cud_item;
+	struct xfs_cui_log_item		*cud_cuip;
+	struct xfs_cud_log_format	cud_format;
+};
+
+extern struct kmem_zone	*xfs_cui_zone;
+extern struct kmem_zone	*xfs_cud_zone;
+
+struct xfs_cui_log_item *xfs_cui_init(struct xfs_mount *, uint);
+struct xfs_cud_log_item *xfs_cud_init(struct xfs_mount *,
+		struct xfs_cui_log_item *);
+void xfs_cui_item_free(struct xfs_cui_log_item *);
+void xfs_cui_release(struct xfs_cui_log_item *);
+
+#endif	/* __XFS_REFCOUNT_ITEM_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2d092f9..abe69c6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -47,6 +47,7 @@
 #include "xfs_sysfs.h"
 #include "xfs_ondisk.h"
 #include "xfs_rmap_item.h"
+#include "xfs_refcount_item.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1788,8 +1789,23 @@ xfs_init_zones(void)
 	if (!xfs_rui_zone)
 		goto out_destroy_rud_zone;
 
+	xfs_cud_zone = kmem_zone_init(sizeof(struct xfs_cud_log_item),
+			"xfs_cud_item");
+	if (!xfs_cud_zone)
+		goto out_destroy_rui_zone;
+
+	xfs_cui_zone = kmem_zone_init(
+			xfs_cui_log_item_sizeof(XFS_CUI_MAX_FAST_EXTENTS),
+			"xfs_cui_item");
+	if (!xfs_cui_zone)
+		goto out_destroy_cud_zone;
+
 	return 0;
 
+ out_destroy_cud_zone:
+	kmem_zone_destroy(xfs_cud_zone);
+ out_destroy_rui_zone:
+	kmem_zone_destroy(xfs_rui_zone);
  out_destroy_rud_zone:
 	kmem_zone_destroy(xfs_rud_zone);
  out_destroy_icreate_zone:
@@ -1832,6 +1848,8 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_zone_destroy(xfs_cui_zone);
+	kmem_zone_destroy(xfs_cud_zone);
 	kmem_zone_destroy(xfs_rui_zone);
 	kmem_zone_destroy(xfs_rud_zone);
 	kmem_zone_destroy(xfs_icreate_zone);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 12/63] xfs: log refcount intent items
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 11/63] xfs: create refcount update intent log items Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  3:06 ` [PATCH 13/63] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
                   ` (51 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Provide a mechanism for higher levels to create CUI/CUD items, submit
them to the log, and a stub function to deal with recovered CUI items.
These parts will be connected to the refcountbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/libxfs/xfs_refcount.h |   15 ++++
 fs/xfs/xfs_log_recover.c     |  175 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_refcount_item.c   |   60 ++++++++++++++
 fs/xfs/xfs_refcount_item.h   |    1 
 fs/xfs/xfs_trace.h           |   30 +++++++
 fs/xfs/xfs_trans.h           |   11 +++
 fs/xfs/xfs_trans_refcount.c  |   80 +++++++++++++++++++
 8 files changed, 373 insertions(+)
 create mode 100644 fs/xfs/xfs_trans_refcount.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d6429fd..6a9ea9e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -113,6 +113,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_trans_buf.o \
 				   xfs_trans_extfree.o \
 				   xfs_trans_inode.o \
+				   xfs_trans_refcount.o \
 				   xfs_trans_rmap.o \
 
 # optional features
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 4dc335a..d362f0b 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -27,4 +27,19 @@ extern int xfs_refcount_lookup_ge(struct xfs_btree_cur *cur,
 extern int xfs_refcount_get_rec(struct xfs_btree_cur *cur,
 		struct xfs_refcount_irec *irec, int *stat);
 
+/* The XFS_REFCOUNT_EXTENT_* in xfs_log_format.h must match these. */
+enum xfs_refcount_intent_type {
+	XFS_REFCOUNT_INCREASE = 1,
+	XFS_REFCOUNT_DECREASE,
+	XFS_REFCOUNT_ALLOC_COW,
+	XFS_REFCOUNT_FREE_COW,
+};
+
+struct xfs_refcount_intent {
+	struct list_head			ri_list;
+	enum xfs_refcount_intent_type		ri_type;
+	xfs_fsblock_t				ri_startblock;
+	xfs_extlen_t				ri_blockcount;
+};
+
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 846483d..7def672 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -45,6 +45,7 @@
 #include "xfs_dir2.h"
 #include "xfs_rmap_item.h"
 #include "xfs_buf_item.h"
+#include "xfs_refcount_item.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -1924,6 +1925,8 @@ xlog_recover_reorder_trans(
 		case XFS_LI_EFI:
 		case XFS_LI_RUI:
 		case XFS_LI_RUD:
+		case XFS_LI_CUI:
+		case XFS_LI_CUD:
 			trace_xfs_log_recover_item_reorder_tail(log,
 							trans, item, pass);
 			list_move_tail(&item->ri_list, &inode_list);
@@ -3547,6 +3550,123 @@ xlog_recover_rud_pass2(
 }
 
 /*
+ * Copy an CUI format buffer from the given buf, and into the destination
+ * CUI format structure.  The CUI/CUD items were designed not to need any
+ * special alignment handling.
+ */
+static int
+xfs_cui_copy_format(
+	struct xfs_log_iovec		*buf,
+	struct xfs_cui_log_format	*dst_cui_fmt)
+{
+	struct xfs_cui_log_format	*src_cui_fmt;
+	uint				len;
+
+	src_cui_fmt = buf->i_addr;
+	len = xfs_cui_log_format_sizeof(src_cui_fmt->cui_nextents);
+
+	if (buf->i_len == len) {
+		memcpy(dst_cui_fmt, src_cui_fmt, len);
+		return 0;
+	}
+	return -EFSCORRUPTED;
+}
+
+/*
+ * This routine is called to create an in-core extent refcount update
+ * item from the cui format structure which was logged on disk.
+ * It allocates an in-core cui, copies the extents from the format
+ * structure into it, and adds the cui to the AIL with the given
+ * LSN.
+ */
+STATIC int
+xlog_recover_cui_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	int				error;
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_cui_log_item		*cuip;
+	struct xfs_cui_log_format	*cui_formatp;
+
+	cui_formatp = item->ri_buf[0].i_addr;
+
+	cuip = xfs_cui_init(mp, cui_formatp->cui_nextents);
+	error = xfs_cui_copy_format(&item->ri_buf[0], &cuip->cui_format);
+	if (error) {
+		xfs_cui_item_free(cuip);
+		return error;
+	}
+	atomic_set(&cuip->cui_next_extent, cui_formatp->cui_nextents);
+
+	spin_lock(&log->l_ailp->xa_lock);
+	/*
+	 * The CUI has two references. One for the CUD and one for CUI to ensure
+	 * it makes it into the AIL. Insert the CUI into the AIL directly and
+	 * drop the CUI reference. Note that xfs_trans_ail_update() drops the
+	 * AIL lock.
+	 */
+	xfs_trans_ail_update(log->l_ailp, &cuip->cui_item, lsn);
+	xfs_cui_release(cuip);
+	return 0;
+}
+
+
+/*
+ * This routine is called when an CUD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding CUI if it
+ * was still in the log. To do this it searches the AIL for the CUI with an id
+ * equal to that in the CUD format structure. If we find it we drop the CUD
+ * reference, which removes the CUI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_cud_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item)
+{
+	struct xfs_cud_log_format	*cud_formatp;
+	struct xfs_cui_log_item		*cuip = NULL;
+	struct xfs_log_item		*lip;
+	__uint64_t			cui_id;
+	struct xfs_ail_cursor		cur;
+	struct xfs_ail			*ailp = log->l_ailp;
+
+	cud_formatp = item->ri_buf[0].i_addr;
+	if (item->ri_buf[0].i_len != sizeof(struct xfs_cud_log_format))
+		return -EFSCORRUPTED;
+	cui_id = cud_formatp->cud_cui_id;
+
+	/*
+	 * Search for the CUI with the id in the CUD format structure in the
+	 * AIL.
+	 */
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		if (lip->li_type == XFS_LI_CUI) {
+			cuip = (struct xfs_cui_log_item *)lip;
+			if (cuip->cui_format.cui_id == cui_id) {
+				/*
+				 * Drop the CUD reference to the CUI. This
+				 * removes the CUI from the AIL and frees it.
+				 */
+				spin_unlock(&ailp->xa_lock);
+				xfs_cui_release(cuip);
+				spin_lock(&ailp->xa_lock);
+				break;
+			}
+		}
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+
+	return 0;
+}
+
+/*
  * This routine is called when an inode create format structure is found in a
  * committed transaction in the log.  It's purpose is to initialise the inodes
  * being allocated on disk. This requires us to get inode cluster buffers that
@@ -3773,6 +3893,8 @@ xlog_recover_ra_pass2(
 	case XFS_LI_QUOTAOFF:
 	case XFS_LI_RUI:
 	case XFS_LI_RUD:
+	case XFS_LI_CUI:
+	case XFS_LI_CUD:
 	default:
 		break;
 	}
@@ -3798,6 +3920,8 @@ xlog_recover_commit_pass1(
 	case XFS_LI_ICREATE:
 	case XFS_LI_RUI:
 	case XFS_LI_RUD:
+	case XFS_LI_CUI:
+	case XFS_LI_CUD:
 		/* nothing to do in pass 1 */
 		return 0;
 	default:
@@ -3832,6 +3956,10 @@ xlog_recover_commit_pass2(
 		return xlog_recover_rui_pass2(log, item, trans->r_lsn);
 	case XFS_LI_RUD:
 		return xlog_recover_rud_pass2(log, item);
+	case XFS_LI_CUI:
+		return xlog_recover_cui_pass2(log, item, trans->r_lsn);
+	case XFS_LI_CUD:
+		return xlog_recover_cud_pass2(log, item);
 	case XFS_LI_DQUOT:
 		return xlog_recover_dquot_pass2(log, buffer_list, item,
 						trans->r_lsn);
@@ -4419,12 +4547,53 @@ xlog_recover_cancel_rui(
 	spin_lock(&ailp->xa_lock);
 }
 
+/* Recover the CUI if necessary. */
+STATIC int
+xlog_recover_process_cui(
+	struct xfs_mount		*mp,
+	struct xfs_ail			*ailp,
+	struct xfs_log_item		*lip)
+{
+	struct xfs_cui_log_item		*cuip;
+	int				error;
+
+	/*
+	 * Skip CUIs that we've already processed.
+	 */
+	cuip = container_of(lip, struct xfs_cui_log_item, cui_item);
+	if (test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags))
+		return 0;
+
+	spin_unlock(&ailp->xa_lock);
+	error = xfs_cui_recover(mp, cuip);
+	spin_lock(&ailp->xa_lock);
+
+	return error;
+}
+
+/* Release the CUI since we're cancelling everything. */
+STATIC void
+xlog_recover_cancel_cui(
+	struct xfs_mount		*mp,
+	struct xfs_ail			*ailp,
+	struct xfs_log_item		*lip)
+{
+	struct xfs_cui_log_item		*cuip;
+
+	cuip = container_of(lip, struct xfs_cui_log_item, cui_item);
+
+	spin_unlock(&ailp->xa_lock);
+	xfs_cui_release(cuip);
+	spin_lock(&ailp->xa_lock);
+}
+
 /* Is this log item a deferred action intent? */
 static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
 {
 	switch (lip->li_type) {
 	case XFS_LI_EFI:
 	case XFS_LI_RUI:
+	case XFS_LI_CUI:
 		return true;
 	default:
 		return false;
@@ -4488,6 +4657,9 @@ xlog_recover_process_intents(
 		case XFS_LI_RUI:
 			error = xlog_recover_process_rui(log->l_mp, ailp, lip);
 			break;
+		case XFS_LI_CUI:
+			error = xlog_recover_process_cui(log->l_mp, ailp, lip);
+			break;
 		}
 		if (error)
 			goto out;
@@ -4535,6 +4707,9 @@ xlog_recover_cancel_intents(
 		case XFS_LI_RUI:
 			xlog_recover_cancel_rui(log->l_mp, ailp, lip);
 			break;
+		case XFS_LI_CUI:
+			xlog_recover_cancel_cui(log->l_mp, ailp, lip);
+			break;
 		}
 
 		lip = xfs_trans_ail_cursor_next(ailp, &cur);
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index f9ad055..599a8d2 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -22,12 +22,15 @@
 #include "xfs_format.h"
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
+#include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_refcount_item.h"
 #include "xfs_log.h"
+#include "xfs_refcount.h"
 
 
 kmem_zone_t	*xfs_cui_zone;
@@ -381,3 +384,60 @@ xfs_cud_init(
 
 	return cudp;
 }
+
+/*
+ * Process a refcount update intent item that was recovered from the log.
+ * We need to update the refcountbt.
+ */
+int
+xfs_cui_recover(
+	struct xfs_mount		*mp,
+	struct xfs_cui_log_item		*cuip)
+{
+	int				i;
+	int				error = 0;
+	struct xfs_phys_extent		*refc;
+	xfs_fsblock_t			startblock_fsb;
+	bool				op_ok;
+
+	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
+
+	/*
+	 * First check the validity of the extents described by the
+	 * CUI.  If any are bad, then assume that all are bad and
+	 * just toss the CUI.
+	 */
+	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
+		refc = &cuip->cui_format.cui_extents[i];
+		startblock_fsb = XFS_BB_TO_FSB(mp,
+				   XFS_FSB_TO_DADDR(mp, refc->pe_startblock));
+		switch (refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK) {
+		case XFS_REFCOUNT_INCREASE:
+		case XFS_REFCOUNT_DECREASE:
+		case XFS_REFCOUNT_ALLOC_COW:
+		case XFS_REFCOUNT_FREE_COW:
+			op_ok = true;
+			break;
+		default:
+			op_ok = false;
+			break;
+		}
+		if (!op_ok || startblock_fsb == 0 ||
+		    refc->pe_len == 0 ||
+		    startblock_fsb >= mp->m_sb.sb_dblocks ||
+		    refc->pe_len >= mp->m_sb.sb_agblocks ||
+		    (refc->pe_flags & ~XFS_REFCOUNT_EXTENT_FLAGS)) {
+			/*
+			 * This will pull the CUI from the AIL and
+			 * free the memory associated with it.
+			 */
+			set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
+			xfs_cui_release(cuip);
+			return -EIO;
+		}
+	}
+
+	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
+	xfs_cui_release(cuip);
+	return error;
+}
diff --git a/fs/xfs/xfs_refcount_item.h b/fs/xfs/xfs_refcount_item.h
index 34b6f7a..5b74ddd 100644
--- a/fs/xfs/xfs_refcount_item.h
+++ b/fs/xfs/xfs_refcount_item.h
@@ -96,5 +96,6 @@ struct xfs_cud_log_item *xfs_cud_init(struct xfs_mount *,
 		struct xfs_cui_log_item *);
 void xfs_cui_item_free(struct xfs_cui_log_item *);
 void xfs_cui_release(struct xfs_cui_log_item *);
+int xfs_cui_recover(struct xfs_mount *mp, struct xfs_cui_log_item *cuip);
 
 #endif	/* __XFS_REFCOUNT_ITEM_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c7b9853..fed1906 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2932,6 +2932,36 @@ DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
 
+TRACE_EVENT(xfs_refcount_finish_one_leftover,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 int type, xfs_agblock_t agbno,
+		 xfs_extlen_t len, xfs_extlen_t adjusted),
+	TP_ARGS(mp, agno, type, agbno, len, adjusted),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(int, type)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_extlen_t, len)
+		__field(xfs_extlen_t, adjusted)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->type = type;
+		__entry->agbno = agbno;
+		__entry->len = len;
+		__entry->adjusted = adjusted;
+	),
+	TP_printk("dev %d:%d type %d agno %u agbno %u len %u adjusted %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->len,
+		  __entry->adjusted)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index e2bf86a..fe69e20 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -36,6 +36,7 @@ struct xfs_busy_extent;
 struct xfs_rud_log_item;
 struct xfs_rui_log_item;
 struct xfs_btree_cur;
+struct xfs_cui_log_item;
 
 typedef struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -248,4 +249,14 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
 		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
 		xfs_exntst_t state, struct xfs_btree_cur **pcur);
 
+/* refcount updates */
+enum xfs_refcount_intent_type;
+
+struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
+		struct xfs_cui_log_item *cuip);
+int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
+		struct xfs_cud_log_item *cudp,
+		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
+		xfs_extlen_t blockcount, struct xfs_btree_cur **pcur);
+
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
new file mode 100644
index 0000000..b18d548
--- /dev/null
+++ b/fs/xfs/xfs_trans_refcount.c
@@ -0,0 +1,80 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_refcount_item.h"
+#include "xfs_alloc.h"
+#include "xfs_refcount.h"
+
+/*
+ * This routine is called to allocate a "refcount update done"
+ * log item.
+ */
+struct xfs_cud_log_item *
+xfs_trans_get_cud(
+	struct xfs_trans		*tp,
+	struct xfs_cui_log_item		*cuip)
+{
+	struct xfs_cud_log_item		*cudp;
+
+	cudp = xfs_cud_init(tp->t_mountp, cuip);
+	xfs_trans_add_item(tp, &cudp->cud_item);
+	return cudp;
+}
+
+/*
+ * Finish an refcount update and log it to the CUD. Note that the
+ * transaction is marked dirty regardless of whether the refcount
+ * update succeeds or fails to support the CUI/CUD lifecycle rules.
+ */
+int
+xfs_trans_log_finish_refcount_update(
+	struct xfs_trans		*tp,
+	struct xfs_cud_log_item		*cudp,
+	enum xfs_refcount_intent_type	type,
+	xfs_fsblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	struct xfs_btree_cur		**pcur)
+{
+	int				error;
+
+	/* XXX: leave this empty for now */
+	error = -EFSCORRUPTED;
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the CUI and frees the CUD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	cudp->cud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	return error;
+}


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 13/63] xfs: adjust refcount of an extent of blocks in refcount btree
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 12/63] xfs: log refcount intent items Darrick J. Wong
@ 2016-09-30  3:06 ` Darrick J. Wong
  2016-09-30  7:11   ` Christoph Hellwig
  2016-09-30  3:07 ` [PATCH 14/63] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
                   ` (50 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:06 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Provide functions to adjust the reference counts for an extent of
physical blocks stored in the refcount btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Refactor the left/right split code into a single function.  Track
the number of btree shape changes and record updates during a refcount
update so that we can decide if we need to get a fresh transaction to
continue.
---
 fs/xfs/libxfs/xfs_refcount.c |  797 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_error.h           |    4 
 2 files changed, 800 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index de13406..36946fc 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -37,6 +37,12 @@
 #include "xfs_bit.h"
 #include "xfs_refcount.h"
 
+/* Allowable refcount adjustment amounts. */
+enum xfs_refc_adjust_op {
+	XFS_REFCOUNT_ADJUST_INCREASE	= 1,
+	XFS_REFCOUNT_ADJUST_DECREASE	= -1,
+};
+
 /*
  * Look up the first record less than or equal to [bno, len] in the btree
  * given by cur.
@@ -175,3 +181,794 @@ xfs_refcount_delete(
 				cur->bc_private.a.agno, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Adjusting the Reference Count
+ *
+ * As stated elsewhere, the reference count btree (refcbt) stores
+ * >1 reference counts for extents of physical blocks.  In this
+ * operation, we're either raising or lowering the reference count of
+ * some subrange stored in the tree:
+ *
+ *      <------ adjustment range ------>
+ * ----+   +---+-----+ +--+--------+---------
+ *  2  |   | 3 |  4  | |17|   55   |   10
+ * ----+   +---+-----+ +--+--------+---------
+ * X axis is physical blocks number;
+ * reference counts are the numbers inside the rectangles
+ *
+ * The first thing we need to do is to ensure that there are no
+ * refcount extents crossing either boundary of the range to be
+ * adjusted.  For any extent that does cross a boundary, split it into
+ * two extents so that we can increment the refcount of one of the
+ * pieces later:
+ *
+ *      <------ adjustment range ------>
+ * ----+   +---+-----+ +--+--------+----+----
+ *  2  |   | 3 |  2  | |17|   55   | 10 | 10
+ * ----+   +---+-----+ +--+--------+----+----
+ *
+ * For this next step, let's assume that all the physical blocks in
+ * the adjustment range are mapped to a file and are therefore in use
+ * at least once.  Therefore, we can infer that any gap in the
+ * refcount tree within the adjustment range represents a physical
+ * extent with refcount == 1:
+ *
+ *      <------ adjustment range ------>
+ * ----+---+---+-----+-+--+--------+----+----
+ *  2  |"1"| 3 |  2  |1|17|   55   | 10 | 10
+ * ----+---+---+-----+-+--+--------+----+----
+ *      ^
+ *
+ * For each extent that falls within the interval range, figure out
+ * which extent is to the left or the right of that extent.  Now we
+ * have a left, current, and right extent.  If the new reference count
+ * of the center extent enables us to merge left, center, and right
+ * into one record covering all three, do so.  If the center extent is
+ * at the left end of the range, abuts the left extent, and its new
+ * reference count matches the left extent's record, then merge them.
+ * If the center extent is at the right end of the range, abuts the
+ * right extent, and the reference counts match, merge those.  In the
+ * example, we can left merge (assuming an increment operation):
+ *
+ *      <------ adjustment range ------>
+ * --------+---+-----+-+--+--------+----+----
+ *    2    | 3 |  2  |1|17|   55   | 10 | 10
+ * --------+---+-----+-+--+--------+----+----
+ *          ^
+ *
+ * For all other extents within the range, adjust the reference count
+ * or delete it if the refcount falls below 2.  If we were
+ * incrementing, the end result looks like this:
+ *
+ *      <------ adjustment range ------>
+ * --------+---+-----+-+--+--------+----+----
+ *    2    | 4 |  3  |2|18|   56   | 11 | 10
+ * --------+---+-----+-+--+--------+----+----
+ *
+ * The result of a decrement operation looks as such:
+ *
+ *      <------ adjustment range ------>
+ * ----+   +---+       +--+--------+----+----
+ *  2  |   | 2 |       |16|   54   |  9 | 10
+ * ----+   +---+       +--+--------+----+----
+ *      DDDD    111111DD
+ *
+ * The blocks marked "D" are freed; the blocks marked "1" are only
+ * referenced once and therefore the record is removed from the
+ * refcount btree.
+ */
+
+#define RCNEXT(rc)	((rc).rc_startblock + (rc).rc_blockcount)
+/*
+ * Split a refcount extent that crosses agbno.
+ */
+STATIC int
+xfs_refcount_split_extent(
+	struct xfs_btree_cur		*cur,
+	xfs_agblock_t			agbno,
+	bool				*shape_changed)
+{
+	struct xfs_refcount_irec	rcext, tmp;
+	int				found_rec;
+	int				error;
+
+	*shape_changed = false;
+	error = xfs_refcount_lookup_le(cur, agbno, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec)
+		return 0;
+
+	error = xfs_refcount_get_rec(cur, &rcext, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+	if (rcext.rc_startblock == agbno || RCNEXT(rcext) <= agbno)
+		return 0;
+
+	*shape_changed = true;
+	trace_xfs_refcount_split_extent(cur->bc_mp, cur->bc_private.a.agno,
+			&rcext, agbno);
+
+	/* Establish the right extent. */
+	tmp = rcext;
+	tmp.rc_startblock = agbno;
+	tmp.rc_blockcount -= (agbno - rcext.rc_startblock);
+	error = xfs_refcount_update(cur, &tmp);
+	if (error)
+		goto out_error;
+
+	/* Insert the left extent. */
+	tmp = rcext;
+	tmp.rc_blockcount = agbno - rcext.rc_startblock;
+	error = xfs_refcount_insert(cur, &tmp, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+	return error;
+
+out_error:
+	trace_xfs_refcount_split_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Merge the left, center, and right extents.
+ */
+STATIC int
+xfs_refcount_merge_center_extents(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*left,
+	struct xfs_refcount_irec	*center,
+	struct xfs_refcount_irec	*right,
+	unsigned long long		extlen,
+	xfs_agblock_t			*agbno,
+	xfs_extlen_t			*aglen)
+{
+	int				error;
+	int				found_rec;
+
+	trace_xfs_refcount_merge_center_extents(cur->bc_mp,
+			cur->bc_private.a.agno, left, center, right);
+
+	error = xfs_refcount_lookup_ge(cur, center->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	error = xfs_refcount_delete(cur, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	/*
+	 * If the center extent wasn't synthesized, remove it.
+	 * See the comments in _reflink_find_left_extents explaining
+	 * when this is possible.
+	 */
+	if (center->rc_refcount > 1) {
+		error = xfs_refcount_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+	}
+
+	error = xfs_refcount_lookup_le(cur, left->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	left->rc_blockcount = extlen;
+	error = xfs_refcount_update(cur, left);
+	if (error)
+		goto out_error;
+
+	*aglen = 0;
+	return error;
+
+out_error:
+	trace_xfs_refcount_merge_center_extents_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Merge with the left extent.
+ */
+STATIC int
+xfs_refcount_merge_left_extent(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*left,
+	struct xfs_refcount_irec	*cleft,
+	xfs_agblock_t			*agbno,
+	xfs_extlen_t			*aglen)
+{
+	int				error;
+	int				found_rec;
+
+	trace_xfs_refcount_merge_left_extent(cur->bc_mp,
+			cur->bc_private.a.agno, left, cleft);
+
+	/* If the left extent wasn't synthesized, remove it. */
+	if (cleft->rc_refcount > 1) {
+		error = xfs_refcount_lookup_le(cur, cleft->rc_startblock,
+				&found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		error = xfs_refcount_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+	}
+
+	error = xfs_refcount_lookup_le(cur, left->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	left->rc_blockcount += cleft->rc_blockcount;
+	error = xfs_refcount_update(cur, left);
+	if (error)
+		goto out_error;
+
+	*agbno += cleft->rc_blockcount;
+	*aglen -= cleft->rc_blockcount;
+	return error;
+
+out_error:
+	trace_xfs_refcount_merge_left_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Merge with the right extent.
+ */
+STATIC int
+xfs_refcount_merge_right_extent(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*right,
+	struct xfs_refcount_irec	*cright,
+	xfs_agblock_t			*agbno,
+	xfs_extlen_t			*aglen)
+{
+	int				error;
+	int				found_rec;
+
+	trace_xfs_refcount_merge_right_extent(cur->bc_mp,
+			cur->bc_private.a.agno, cright, right);
+
+	/* If the right extent wasn't synthesized, remove it. */
+	if (cright->rc_refcount > 1) {
+		error = xfs_refcount_lookup_le(cur, cright->rc_startblock,
+			&found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		error = xfs_refcount_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+	}
+
+	error = xfs_refcount_lookup_le(cur, right->rc_startblock,
+			&found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	right->rc_startblock -= cright->rc_blockcount;
+	right->rc_blockcount += cright->rc_blockcount;
+	error = xfs_refcount_update(cur, right);
+	if (error)
+		goto out_error;
+
+	*aglen -= cright->rc_blockcount;
+	return error;
+
+out_error:
+	trace_xfs_refcount_merge_right_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Find the left extent and the one after it (cleft).  This function assumes
+ * that we've already split any extent crossing agbno.
+ */
+STATIC int
+xfs_refcount_find_left_extents(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*left,
+	struct xfs_refcount_irec	*cleft,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			aglen)
+{
+	struct xfs_refcount_irec	tmp;
+	int				error;
+	int				found_rec;
+
+	left->rc_startblock = cleft->rc_startblock = NULLAGBLOCK;
+	error = xfs_refcount_lookup_le(cur, agbno - 1, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec)
+		return 0;
+
+	error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	if (RCNEXT(tmp) != agbno)
+		return 0;
+	/* We have a left extent; retrieve (or invent) the next right one */
+	*left = tmp;
+
+	error = xfs_btree_increment(cur, 0, &found_rec);
+	if (error)
+		goto out_error;
+	if (found_rec) {
+		error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		/* if tmp starts at the end of our range, just use that */
+		if (tmp.rc_startblock == agbno)
+			*cleft = tmp;
+		else {
+			/*
+			 * There's a gap in the refcntbt at the start of the
+			 * range we're interested in (refcount == 1) so
+			 * synthesize the implied extent and pass it back.
+			 * We assume here that the agbno/aglen range was
+			 * passed in from a data fork extent mapping and
+			 * therefore is allocated to exactly one owner.
+			 */
+			cleft->rc_startblock = agbno;
+			cleft->rc_blockcount = min(aglen,
+					tmp.rc_startblock - agbno);
+			cleft->rc_refcount = 1;
+		}
+	} else {
+		/*
+		 * No extents, so pretend that there's one covering the whole
+		 * range.
+		 */
+		cleft->rc_startblock = agbno;
+		cleft->rc_blockcount = aglen;
+		cleft->rc_refcount = 1;
+	}
+	trace_xfs_refcount_find_left_extent(cur->bc_mp, cur->bc_private.a.agno,
+			left, cleft, agbno);
+	return error;
+
+out_error:
+	trace_xfs_refcount_find_left_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Find the right extent and the one before it (cright).  This function
+ * assumes that we've already split any extents crossing agbno + aglen.
+ */
+STATIC int
+xfs_refcount_find_right_extents(
+	struct xfs_btree_cur		*cur,
+	struct xfs_refcount_irec	*right,
+	struct xfs_refcount_irec	*cright,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			aglen)
+{
+	struct xfs_refcount_irec	tmp;
+	int				error;
+	int				found_rec;
+
+	right->rc_startblock = cright->rc_startblock = NULLAGBLOCK;
+	error = xfs_refcount_lookup_ge(cur, agbno + aglen, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec)
+		return 0;
+
+	error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1, out_error);
+
+	if (tmp.rc_startblock != agbno + aglen)
+		return 0;
+	/* We have a right extent; retrieve (or invent) the next left one */
+	*right = tmp;
+
+	error = xfs_btree_decrement(cur, 0, &found_rec);
+	if (error)
+		goto out_error;
+	if (found_rec) {
+		error = xfs_refcount_get_rec(cur, &tmp, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, found_rec == 1,
+				out_error);
+
+		/* if tmp ends at the end of our range, just use that */
+		if (RCNEXT(tmp) == agbno + aglen)
+			*cright = tmp;
+		else {
+			/*
+			 * There's a gap in the refcntbt at the end of the
+			 * range we're interested in (refcount == 1) so
+			 * create the implied extent and pass it back.
+			 * We assume here that the agbno/aglen range was
+			 * passed in from a data fork extent mapping and
+			 * therefore is allocated to exactly one owner.
+			 */
+			cright->rc_startblock = max(agbno, RCNEXT(tmp));
+			cright->rc_blockcount = right->rc_startblock -
+					cright->rc_startblock;
+			cright->rc_refcount = 1;
+		}
+	} else {
+		/*
+		 * No extents, so pretend that there's one covering the whole
+		 * range.
+		 */
+		cright->rc_startblock = agbno;
+		cright->rc_blockcount = aglen;
+		cright->rc_refcount = 1;
+	}
+	trace_xfs_refcount_find_right_extent(cur->bc_mp, cur->bc_private.a.agno,
+			cright, right, agbno + aglen);
+	return error;
+
+out_error:
+	trace_xfs_refcount_find_right_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+#undef RCNEXT
+
+#define IS_VALID_RCEXT(ext)	((ext).rc_startblock != NULLAGBLOCK)
+/*
+ * Try to merge with any extents on the boundaries of the adjustment range.
+ */
+STATIC int
+xfs_refcount_merge_extents(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		*agbno,
+	xfs_extlen_t		*aglen,
+	enum xfs_refc_adjust_op adjust,
+	bool			*shape_changed)
+{
+	struct xfs_refcount_irec	left = {0}, cleft = {0};
+	struct xfs_refcount_irec	cright = {0}, right = {0};
+	int				error;
+	unsigned long long		ulen;
+	bool				cequal;
+
+	*shape_changed = false;
+	/*
+	 * Find the extent just below agbno [left], just above agbno [cleft],
+	 * just below (agbno + aglen) [cright], and just above (agbno + aglen)
+	 * [right].
+	 */
+	error = xfs_refcount_find_left_extents(cur, &left, &cleft, *agbno,
+			*aglen);
+	if (error)
+		return error;
+	error = xfs_refcount_find_right_extents(cur, &right, &cright, *agbno,
+			*aglen);
+	if (error)
+		return error;
+
+	/* No left or right extent to merge; exit. */
+	if (!IS_VALID_RCEXT(left) && !IS_VALID_RCEXT(right))
+		return 0;
+
+	cequal = (cleft.rc_startblock == cright.rc_startblock) &&
+		 (cleft.rc_blockcount == cright.rc_blockcount);
+
+	/* Try to merge left, cleft, and right.  cleft must == cright. */
+	ulen = (unsigned long long)left.rc_blockcount + cleft.rc_blockcount +
+			right.rc_blockcount;
+	if (IS_VALID_RCEXT(left) && IS_VALID_RCEXT(right) &&
+	    IS_VALID_RCEXT(cleft) && IS_VALID_RCEXT(cright) && cequal &&
+	    left.rc_refcount == cleft.rc_refcount + adjust &&
+	    right.rc_refcount == cleft.rc_refcount + adjust &&
+	    ulen < MAXREFCEXTLEN) {
+		*shape_changed = true;
+		return xfs_refcount_merge_center_extents(cur, &left, &cleft,
+				&right, ulen, agbno, aglen);
+	}
+
+	/* Try to merge left and cleft. */
+	ulen = (unsigned long long)left.rc_blockcount + cleft.rc_blockcount;
+	if (IS_VALID_RCEXT(left) && IS_VALID_RCEXT(cleft) &&
+	    left.rc_refcount == cleft.rc_refcount + adjust &&
+	    ulen < MAXREFCEXTLEN) {
+		*shape_changed = true;
+		error = xfs_refcount_merge_left_extent(cur, &left, &cleft,
+				agbno, aglen);
+		if (error)
+			return error;
+
+		/*
+		 * If we just merged left + cleft and cleft == cright,
+		 * we no longer have a cright to merge with right.  We're done.
+		 */
+		if (cequal)
+			return 0;
+	}
+
+	/* Try to merge cright and right. */
+	ulen = (unsigned long long)right.rc_blockcount + cright.rc_blockcount;
+	if (IS_VALID_RCEXT(right) && IS_VALID_RCEXT(cright) &&
+	    right.rc_refcount == cright.rc_refcount + adjust &&
+	    ulen < MAXREFCEXTLEN) {
+		*shape_changed = true;
+		return xfs_refcount_merge_right_extent(cur, &right, &cright,
+				agbno, aglen);
+	}
+
+	return error;
+}
+#undef IS_VALID_RCEXT
+
+/*
+ * While we're adjusting the refcounts records of an extent, we have
+ * to keep an eye on the number of extents we're dirtying -- run too
+ * many in a single transaction and we'll exceed the transaction's
+ * reservation and crash the fs.  Each record adds 12 bytes to the
+ * log (plus any key updates) so we'll conservatively assume 24 bytes
+ * per record.  We must also leave space for btree splits on both ends
+ * of the range and space for the CUD and a new CUI.
+ *
+ * XXX: This is a pretty hand-wavy estimate.  The penalty for guessing
+ * true incorrectly is a shutdown FS; the penalty for guessing false
+ * incorrectly is more transaction rolls than might be necessary.
+ * Be conservative here.
+ */
+static bool
+xfs_refcount_still_have_space(
+	struct xfs_btree_cur		*cur)
+{
+	unsigned long			overhead;
+
+	overhead = cur->bc_private.a.priv.refc.shape_changes *
+			xfs_allocfree_log_count(cur->bc_mp, 1);
+	overhead *= cur->bc_mp->m_sb.sb_blocksize;
+
+	/*
+	 * Only allow 2 refcount extent updates per transaction if the
+	 * refcount continue update "error" has been injected.
+	 */
+	if (cur->bc_private.a.priv.refc.nr_ops > 2 &&
+	    XFS_TEST_ERROR(false, cur->bc_mp,
+			XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE,
+			XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE))
+		return false;
+
+	if (cur->bc_private.a.priv.refc.nr_ops == 0)
+		return true;
+	else if (overhead > cur->bc_tp->t_log_res)
+		return false;
+	return  cur->bc_tp->t_log_res - overhead >
+		cur->bc_private.a.priv.refc.nr_ops * 32;
+}
+
+/*
+ * Adjust the refcounts of middle extents.  At this point we should have
+ * split extents that crossed the adjustment range; merged with adjacent
+ * extents; and updated agbno/aglen to reflect the merges.  Therefore,
+ * all we have to do is update the extents inside [agbno, agbno + aglen].
+ */
+STATIC int
+xfs_refcount_adjust_extents(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	xfs_extlen_t		*adjusted,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_refcount_irec	ext, tmp;
+	int				error;
+	int				found_rec, found_tmp;
+	xfs_fsblock_t			fsbno;
+
+	/* Merging did all the work already. */
+	if (aglen == 0)
+		return 0;
+
+	error = xfs_refcount_lookup_ge(cur, agbno, &found_rec);
+	if (error)
+		goto out_error;
+
+	while (aglen > 0 && xfs_refcount_still_have_space(cur)) {
+		error = xfs_refcount_get_rec(cur, &ext, &found_rec);
+		if (error)
+			goto out_error;
+		if (!found_rec) {
+			ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
+			ext.rc_blockcount = 0;
+			ext.rc_refcount = 0;
+		}
+
+		/*
+		 * Deal with a hole in the refcount tree; if a file maps to
+		 * these blocks and there's no refcountbt record, pretend that
+		 * there is one with refcount == 1.
+		 */
+		if (ext.rc_startblock != agbno) {
+			tmp.rc_startblock = agbno;
+			tmp.rc_blockcount = min(aglen,
+					ext.rc_startblock - agbno);
+			tmp.rc_refcount = 1 + adj;
+			trace_xfs_refcount_modify_extent(cur->bc_mp,
+					cur->bc_private.a.agno, &tmp);
+
+			/*
+			 * Either cover the hole (increment) or
+			 * delete the range (decrement).
+			 */
+			if (tmp.rc_refcount) {
+				error = xfs_refcount_insert(cur, &tmp,
+						&found_tmp);
+				if (error)
+					goto out_error;
+				XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+						found_tmp == 1, out_error);
+				cur->bc_private.a.priv.refc.nr_ops++;
+			} else {
+				fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
+						cur->bc_private.a.agno,
+						tmp.rc_startblock);
+				xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
+						tmp.rc_blockcount, oinfo);
+			}
+
+			(*adjusted) += tmp.rc_blockcount;
+			agbno += tmp.rc_blockcount;
+			aglen -= tmp.rc_blockcount;
+
+			error = xfs_refcount_lookup_ge(cur, agbno,
+					&found_rec);
+			if (error)
+				goto out_error;
+		}
+
+		/* Stop if there's nothing left to modify */
+		if (aglen == 0 || !xfs_refcount_still_have_space(cur))
+			break;
+
+		/*
+		 * Adjust the reference count and either update the tree
+		 * (incr) or free the blocks (decr).
+		 */
+		if (ext.rc_refcount == MAXREFCOUNT)
+			goto skip;
+		ext.rc_refcount += adj;
+		trace_xfs_refcount_modify_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &ext);
+		if (ext.rc_refcount > 1) {
+			error = xfs_refcount_update(cur, &ext);
+			if (error)
+				goto out_error;
+			cur->bc_private.a.priv.refc.nr_ops++;
+		} else if (ext.rc_refcount == 1) {
+			error = xfs_refcount_delete(cur, &found_rec);
+			if (error)
+				goto out_error;
+			XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+					found_rec == 1, out_error);
+			cur->bc_private.a.priv.refc.nr_ops++;
+			goto advloop;
+		} else {
+			fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
+					cur->bc_private.a.agno,
+					ext.rc_startblock);
+			xfs_bmap_add_free(cur->bc_mp, dfops, fsbno,
+					ext.rc_blockcount, oinfo);
+		}
+
+skip:
+		error = xfs_btree_increment(cur, 0, &found_rec);
+		if (error)
+			goto out_error;
+
+advloop:
+		(*adjusted) += ext.rc_blockcount;
+		agbno += ext.rc_blockcount;
+		aglen -= ext.rc_blockcount;
+	}
+
+	return error;
+out_error:
+	trace_xfs_refcount_modify_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/* Adjust the reference count of a range of AG blocks. */
+STATIC int
+xfs_refcount_adjust(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	xfs_extlen_t		*adjusted,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_owner_info	*oinfo)
+{
+	xfs_extlen_t		orig_aglen;
+	bool			shape_changed;
+	int			shape_changes = 0;
+	int			error;
+
+	*adjusted = 0;
+	if (adj == XFS_REFCOUNT_ADJUST_INCREASE)
+		trace_xfs_refcount_increase(cur->bc_mp, cur->bc_private.a.agno,
+				agbno, aglen);
+	else
+		trace_xfs_refcount_decrease(cur->bc_mp, cur->bc_private.a.agno,
+				agbno, aglen);
+
+	/*
+	 * Ensure that no rcextents cross the boundary of the adjustment range.
+	 */
+	error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
+	if (error)
+		goto out_error;
+	if (shape_changed)
+		shape_changes++;
+
+	error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
+	if (error)
+		goto out_error;
+	if (shape_changed)
+		shape_changes++;
+
+	/*
+	 * Try to merge with the left or right extents of the range.
+	 */
+	orig_aglen = aglen;
+	error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
+			&shape_changed);
+	if (error)
+		goto out_error;
+	if (shape_changed)
+		shape_changes++;
+	(*adjusted) += orig_aglen - aglen;
+	if (shape_changes)
+		cur->bc_private.a.priv.refc.shape_changes++;
+
+	/* Now that we've taken care of the ends, adjust the middle extents */
+	error = xfs_refcount_adjust_extents(cur, agbno, aglen, adjusted, adj,
+			dfops, oinfo);
+	if (error)
+		goto out_error;
+
+	return 0;
+
+out_error:
+	trace_xfs_refcount_adjust_error(cur->bc_mp, cur->bc_private.a.agno,
+			error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 3d22470..d9675c64 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -92,7 +92,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_BMAPIFORMAT				21
 #define XFS_ERRTAG_FREE_EXTENT				22
 #define XFS_ERRTAG_RMAP_FINISH_ONE			23
-#define XFS_ERRTAG_MAX					24
+#define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE		24
+#define XFS_ERRTAG_MAX					25
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -121,6 +122,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define	XFS_RANDOM_BMAPIFORMAT				XFS_RANDOM_DEFAULT
 #define XFS_RANDOM_FREE_EXTENT				1
 #define XFS_RANDOM_RMAP_FINISH_ONE			1
+#define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE		1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 14/63] xfs: connect refcount adjust functions to upper layers
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2016-09-30  3:06 ` [PATCH 13/63] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
@ 2016-09-30  3:07 ` Darrick J. Wong
  2016-09-30  7:13   ` Christoph Hellwig
  2016-09-30 16:21   ` Brian Foster
  2016-09-30  3:07 ` [PATCH 15/63] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
                   ` (49 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Plumb in the upper level interface to schedule and finish deferred
refcount operations via the deferred ops mechanism.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.h    |    1 
 fs/xfs/libxfs/xfs_refcount.c |  170 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h |   12 +++
 fs/xfs/xfs_error.h           |    4 +
 fs/xfs/xfs_refcount_item.c   |   73 ++++++++++++++++
 fs/xfs/xfs_super.c           |    1 
 fs/xfs/xfs_trace.h           |    3 +
 fs/xfs/xfs_trans.h           |    8 +-
 fs/xfs/xfs_trans_refcount.c  |  186 ++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 452 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index e96533d..4d94a86 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -51,6 +51,7 @@ struct xfs_defer_pending {
  * find all the space it needs.
  */
 enum xfs_defer_ops_type {
+	XFS_DEFER_OPS_TYPE_REFCOUNT,
 	XFS_DEFER_OPS_TYPE_RMAP,
 	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_MAX,
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 36946fc..49d8c6f 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -972,3 +972,173 @@ xfs_refcount_adjust(
 			error, _RET_IP_);
 	return error;
 }
+
+/* Clean up after calling xfs_refcount_finish_one. */
+void
+xfs_refcount_finish_one_cleanup(
+	struct xfs_trans	*tp,
+	struct xfs_btree_cur	*rcur,
+	int			error)
+{
+	struct xfs_buf		*agbp;
+
+	if (rcur == NULL)
+		return;
+	agbp = rcur->bc_private.a.agbp;
+	xfs_btree_del_cursor(rcur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	if (error)
+		xfs_trans_brelse(tp, agbp);
+}
+
+/*
+ * Process one of the deferred refcount operations.  We pass back the
+ * btree cursor to maintain our lock on the btree between calls.
+ * This saves time and eliminates a buffer deadlock between the
+ * superblock and the AGF because we'll always grab them in the same
+ * order.
+ */
+int
+xfs_refcount_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dfops,
+	enum xfs_refcount_intent_type	type,
+	xfs_fsblock_t			startblock,
+	xfs_extlen_t			blockcount,
+	xfs_extlen_t			*adjusted,
+	struct xfs_btree_cur		**pcur)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_btree_cur		*rcur;
+	struct xfs_buf			*agbp = NULL;
+	int				error = 0;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			bno;
+	unsigned long			nr_ops = 0;
+	int				shape_changes = 0;
+
+	agno = XFS_FSB_TO_AGNO(mp, startblock);
+	ASSERT(agno != NULLAGNUMBER);
+	bno = XFS_FSB_TO_AGBNO(mp, startblock);
+
+	trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, startblock),
+			type, XFS_FSB_TO_AGBNO(mp, startblock),
+			blockcount);
+
+	if (XFS_TEST_ERROR(false, mp,
+			XFS_ERRTAG_REFCOUNT_FINISH_ONE,
+			XFS_RANDOM_REFCOUNT_FINISH_ONE))
+		return -EIO;
+
+	/*
+	 * If we haven't gotten a cursor or the cursor AG doesn't match
+	 * the startblock, get one now.
+	 */
+	rcur = *pcur;
+	if (rcur != NULL && rcur->bc_private.a.agno != agno) {
+		nr_ops = rcur->bc_private.a.priv.refc.nr_ops;
+		shape_changes = rcur->bc_private.a.priv.refc.shape_changes;
+		xfs_refcount_finish_one_cleanup(tp, rcur, 0);
+		rcur = NULL;
+		*pcur = NULL;
+	}
+	if (rcur == NULL) {
+		error = xfs_alloc_read_agf(tp->t_mountp, tp, agno,
+				XFS_ALLOC_FLAG_FREEING, &agbp);
+		if (error)
+			return error;
+		if (!agbp)
+			return -EFSCORRUPTED;
+
+		rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, agno, dfops);
+		if (!rcur) {
+			error = -ENOMEM;
+			goto out_cur;
+		}
+		rcur->bc_private.a.priv.refc.nr_ops = nr_ops;
+		rcur->bc_private.a.priv.refc.shape_changes = shape_changes;
+	}
+	*pcur = rcur;
+
+	switch (type) {
+	case XFS_REFCOUNT_INCREASE:
+		error = xfs_refcount_adjust(rcur, bno, blockcount, adjusted,
+			XFS_REFCOUNT_ADJUST_INCREASE, dfops, NULL);
+		break;
+	case XFS_REFCOUNT_DECREASE:
+		error = xfs_refcount_adjust(rcur, bno, blockcount, adjusted,
+			XFS_REFCOUNT_ADJUST_DECREASE, dfops, NULL);
+		break;
+	default:
+		ASSERT(0);
+		error = -EFSCORRUPTED;
+	}
+	if (!error && *adjusted != blockcount)
+		trace_xfs_refcount_finish_one_leftover(mp, agno, type,
+				bno, blockcount, *adjusted);
+	return error;
+
+out_cur:
+	xfs_trans_brelse(tp, agbp);
+
+	return error;
+}
+
+/*
+ * Record a refcount intent for later processing.
+ */
+static int
+__xfs_refcount_add(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	enum xfs_refcount_intent_type	type,
+	xfs_fsblock_t			startblock,
+	xfs_extlen_t			blockcount)
+{
+	struct xfs_refcount_intent	*ri;
+
+	trace_xfs_refcount_defer(mp, XFS_FSB_TO_AGNO(mp, startblock),
+			type, XFS_FSB_TO_AGBNO(mp, startblock),
+			blockcount);
+
+	ri = kmem_alloc(sizeof(struct xfs_refcount_intent),
+			KM_SLEEP | KM_NOFS);
+	INIT_LIST_HEAD(&ri->ri_list);
+	ri->ri_type = type;
+	ri->ri_startblock = startblock;
+	ri->ri_blockcount = blockcount;
+
+	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_REFCOUNT, &ri->ri_list);
+	return 0;
+}
+
+/*
+ * Increase the reference count of the blocks backing a file's extent.
+ */
+int
+xfs_refcount_increase_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	struct xfs_bmbt_irec		*PREV)
+{
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_INCREASE,
+			PREV->br_startblock, PREV->br_blockcount);
+}
+
+/*
+ * Decrease the reference count of the blocks backing a file's extent.
+ */
+int
+xfs_refcount_decrease_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	struct xfs_bmbt_irec		*PREV)
+{
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_DECREASE,
+			PREV->br_startblock, PREV->br_blockcount);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index d362f0b..7e750a5 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -42,4 +42,16 @@ struct xfs_refcount_intent {
 	xfs_extlen_t				ri_blockcount;
 };
 
+extern int xfs_refcount_increase_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, struct xfs_bmbt_irec *irec);
+extern int xfs_refcount_decrease_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, struct xfs_bmbt_irec *irec);
+
+extern void xfs_refcount_finish_one_cleanup(struct xfs_trans *tp,
+		struct xfs_btree_cur *rcur, int error);
+extern int xfs_refcount_finish_one(struct xfs_trans *tp,
+		struct xfs_defer_ops *dfops, enum xfs_refcount_intent_type type,
+		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
+		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
+
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index d9675c64..641e090 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -93,7 +93,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_FREE_EXTENT				22
 #define XFS_ERRTAG_RMAP_FINISH_ONE			23
 #define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE		24
-#define XFS_ERRTAG_MAX					25
+#define XFS_ERRTAG_REFCOUNT_FINISH_ONE			25
+#define XFS_ERRTAG_MAX					26
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -123,6 +124,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_FREE_EXTENT				1
 #define XFS_RANDOM_RMAP_FINISH_ONE			1
 #define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE		1
+#define XFS_RANDOM_REFCOUNT_FINISH_ONE			1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 599a8d2..e44007a 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -396,9 +396,19 @@ xfs_cui_recover(
 {
 	int				i;
 	int				error = 0;
+	unsigned int			refc_type;
 	struct xfs_phys_extent		*refc;
 	xfs_fsblock_t			startblock_fsb;
 	bool				op_ok;
+	struct xfs_cud_log_item		*cudp;
+	struct xfs_trans		*tp;
+	struct xfs_btree_cur		*rcur = NULL;
+	enum xfs_refcount_intent_type	type;
+	xfs_fsblock_t			firstfsb;
+	xfs_extlen_t			adjusted;
+	struct xfs_bmbt_irec		irec;
+	struct xfs_defer_ops		dfops;
+	bool				requeue_only = false;
 
 	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
 
@@ -437,7 +447,68 @@ xfs_cui_recover(
 		}
 	}
 
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error)
+		return error;
+	cudp = xfs_trans_get_cud(tp, cuip);
+
+	xfs_defer_init(&dfops, &firstfsb);
+	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
+		refc = &cuip->cui_format.cui_extents[i];
+		refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
+		switch (refc_type) {
+		case XFS_REFCOUNT_INCREASE:
+		case XFS_REFCOUNT_DECREASE:
+		case XFS_REFCOUNT_ALLOC_COW:
+		case XFS_REFCOUNT_FREE_COW:
+			type = refc_type;
+			break;
+		default:
+			error = -EFSCORRUPTED;
+			goto abort_error;
+		}
+		if (requeue_only)
+			adjusted = 0;
+		else
+			error = xfs_trans_log_finish_refcount_update(tp, cudp,
+				&dfops, type, refc->pe_startblock, refc->pe_len,
+				&adjusted, &rcur);
+		if (error)
+			goto abort_error;
+
+		/* Requeue what we didn't finish. */
+		if (adjusted < refc->pe_len) {
+			irec.br_startblock = refc->pe_startblock + adjusted;
+			irec.br_blockcount = refc->pe_len - adjusted;
+			switch (type) {
+			case XFS_REFCOUNT_INCREASE:
+				error = xfs_refcount_increase_extent(
+						tp->t_mountp, &dfops, &irec);
+				break;
+			case XFS_REFCOUNT_DECREASE:
+				error = xfs_refcount_decrease_extent(
+						tp->t_mountp, &dfops, &irec);
+				break;
+			default:
+				ASSERT(0);
+			}
+			if (error)
+				goto abort_error;
+			requeue_only = true;
+		}
+	}
+
+	xfs_refcount_finish_one_cleanup(tp, rcur, error);
+	error = xfs_defer_finish(&tp, &dfops, NULL);
+	if (error)
+		goto abort_error;
 	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
-	xfs_cui_release(cuip);
+	error = xfs_trans_commit(tp);
+	return error;
+
+abort_error:
+	xfs_refcount_finish_one_cleanup(tp, rcur, error);
+	xfs_defer_cancel(&dfops);
+	xfs_trans_cancel(tp);
 	return error;
 }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index abe69c6..6234622 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1903,6 +1903,7 @@ init_xfs_fs(void)
 
 	xfs_extent_free_init_defer_op();
 	xfs_rmap_update_init_defer_op();
+	xfs_refcount_update_init_defer_op();
 
 	xfs_dir_startup();
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fed1906..195a168 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2931,6 +2931,9 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
+#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
+DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
+DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
 
 TRACE_EVENT(xfs_refcount_finish_one_leftover,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index fe69e20..a7a87d2 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -37,6 +37,8 @@ struct xfs_rud_log_item;
 struct xfs_rui_log_item;
 struct xfs_btree_cur;
 struct xfs_cui_log_item;
+struct xfs_cud_log_item;
+struct xfs_defer_ops;
 
 typedef struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -252,11 +254,13 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
 /* refcount updates */
 enum xfs_refcount_intent_type;
 
+void xfs_refcount_update_init_defer_op(void);
 struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
 		struct xfs_cui_log_item *cuip);
 int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
-		struct xfs_cud_log_item *cudp,
+		struct xfs_cud_log_item *cudp, struct xfs_defer_ops *dfops,
 		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
-		xfs_extlen_t blockcount, struct xfs_btree_cur **pcur);
+		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
+		struct xfs_btree_cur **pcur);
 
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
index b18d548..e3ac994 100644
--- a/fs/xfs/xfs_trans_refcount.c
+++ b/fs/xfs/xfs_trans_refcount.c
@@ -56,15 +56,17 @@ int
 xfs_trans_log_finish_refcount_update(
 	struct xfs_trans		*tp,
 	struct xfs_cud_log_item		*cudp,
+	struct xfs_defer_ops		*dop,
 	enum xfs_refcount_intent_type	type,
 	xfs_fsblock_t			startblock,
 	xfs_extlen_t			blockcount,
+	xfs_extlen_t			*adjusted,
 	struct xfs_btree_cur		**pcur)
 {
 	int				error;
 
-	/* XXX: leave this empty for now */
-	error = -EFSCORRUPTED;
+	error = xfs_refcount_finish_one(tp, dop, type, startblock,
+			blockcount, adjusted, pcur);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -78,3 +80,183 @@ xfs_trans_log_finish_refcount_update(
 
 	return error;
 }
+
+/* Sort refcount intents by AG. */
+static int
+xfs_refcount_update_diff_items(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_mount		*mp = priv;
+	struct xfs_refcount_intent	*ra;
+	struct xfs_refcount_intent	*rb;
+
+	ra = container_of(a, struct xfs_refcount_intent, ri_list);
+	rb = container_of(b, struct xfs_refcount_intent, ri_list);
+	return  XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
+		XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
+}
+
+/* Get an CUI. */
+STATIC void *
+xfs_refcount_update_create_intent(
+	struct xfs_trans		*tp,
+	unsigned int			count)
+{
+	struct xfs_cui_log_item		*cuip;
+
+	ASSERT(tp != NULL);
+	ASSERT(count > 0);
+
+	cuip = xfs_cui_init(tp->t_mountp, count);
+	ASSERT(cuip != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &cuip->cui_item);
+	return cuip;
+}
+
+/* Set the phys extent flags for this reverse mapping. */
+static void
+xfs_trans_set_refcount_flags(
+	struct xfs_phys_extent		*refc,
+	enum xfs_refcount_intent_type	type)
+{
+	refc->pe_flags = 0;
+	switch (type) {
+	case XFS_REFCOUNT_INCREASE:
+	case XFS_REFCOUNT_DECREASE:
+	case XFS_REFCOUNT_ALLOC_COW:
+	case XFS_REFCOUNT_FREE_COW:
+		refc->pe_flags |= type;
+		break;
+	default:
+		ASSERT(0);
+	}
+}
+
+/* Log refcount updates in the intent item. */
+STATIC void
+xfs_refcount_update_log_item(
+	struct xfs_trans		*tp,
+	void				*intent,
+	struct list_head		*item)
+{
+	struct xfs_cui_log_item		*cuip = intent;
+	struct xfs_refcount_intent	*refc;
+	uint				next_extent;
+	struct xfs_phys_extent		*ext;
+
+	refc = container_of(item, struct xfs_refcount_intent, ri_list);
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	cuip->cui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	/*
+	 * atomic_inc_return gives us the value after the increment;
+	 * we want to use it as an array index so we need to subtract 1 from
+	 * it.
+	 */
+	next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1;
+	ASSERT(next_extent < cuip->cui_format.cui_nextents);
+	ext = &cuip->cui_format.cui_extents[next_extent];
+	ext->pe_startblock = refc->ri_startblock;
+	ext->pe_len = refc->ri_blockcount;
+	xfs_trans_set_refcount_flags(ext, refc->ri_type);
+}
+
+/* Get an CUD so we can process all the deferred refcount updates. */
+STATIC void *
+xfs_refcount_update_create_done(
+	struct xfs_trans		*tp,
+	void				*intent,
+	unsigned int			count)
+{
+	return xfs_trans_get_cud(tp, intent);
+}
+
+/* Process a deferred refcount update. */
+STATIC int
+xfs_refcount_update_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop,
+	struct list_head		*item,
+	void				*done_item,
+	void				**state)
+{
+	struct xfs_refcount_intent	*refc;
+	xfs_extlen_t			adjusted;
+	int				error;
+
+	refc = container_of(item, struct xfs_refcount_intent, ri_list);
+	error = xfs_trans_log_finish_refcount_update(tp, done_item, dop,
+			refc->ri_type,
+			refc->ri_startblock,
+			refc->ri_blockcount,
+			&adjusted,
+			(struct xfs_btree_cur **)state);
+	/* Did we run out of reservation?  Requeue what we didn't finish. */
+	if (!error && adjusted < refc->ri_blockcount) {
+		ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE ||
+		       refc->ri_type == XFS_REFCOUNT_DECREASE);
+		refc->ri_startblock += adjusted;
+		refc->ri_blockcount -= adjusted;
+		return -EAGAIN;
+	}
+	kmem_free(refc);
+	return error;
+}
+
+/* Clean up after processing deferred refcounts. */
+STATIC void
+xfs_refcount_update_finish_cleanup(
+	struct xfs_trans	*tp,
+	void			*state,
+	int			error)
+{
+	struct xfs_btree_cur	*rcur = state;
+
+	xfs_refcount_finish_one_cleanup(tp, rcur, error);
+}
+
+/* Abort all pending CUIs. */
+STATIC void
+xfs_refcount_update_abort_intent(
+	void				*intent)
+{
+	xfs_cui_release(intent);
+}
+
+/* Cancel a deferred refcount update. */
+STATIC void
+xfs_refcount_update_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_refcount_intent	*refc;
+
+	refc = container_of(item, struct xfs_refcount_intent, ri_list);
+	kmem_free(refc);
+}
+
+static const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
+	.type		= XFS_DEFER_OPS_TYPE_REFCOUNT,
+	.max_items	= XFS_CUI_MAX_FAST_EXTENTS,
+	.diff_items	= xfs_refcount_update_diff_items,
+	.create_intent	= xfs_refcount_update_create_intent,
+	.abort_intent	= xfs_refcount_update_abort_intent,
+	.log_item	= xfs_refcount_update_log_item,
+	.create_done	= xfs_refcount_update_create_done,
+	.finish_item	= xfs_refcount_update_finish_item,
+	.finish_cleanup = xfs_refcount_update_finish_cleanup,
+	.cancel_item	= xfs_refcount_update_cancel_item,
+};
+
+/* Register the deferred op type. */
+void
+xfs_refcount_update_init_defer_op(void)
+{
+	xfs_defer_init_op_type(&xfs_refcount_update_defer_type);
+}


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 15/63] xfs: adjust refcount when unmapping file blocks
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (13 preceding siblings ...)
  2016-09-30  3:07 ` [PATCH 14/63] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
@ 2016-09-30  3:07 ` Darrick J. Wong
  2016-09-30  7:14   ` Christoph Hellwig
  2016-09-30  3:07 ` [PATCH 16/63] xfs: add refcount btree block detection to log recovery Darrick J. Wong
                   ` (48 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

When we're unmapping blocks from a reflinked file, decrease the
refcount of the affected blocks and free the extents that are no
longer in use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Use deferred ops system to avoid deadlocks and running out of
transaction reservation.
---
 fs/xfs/libxfs/xfs_bmap.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9d7f61d..907d7b8d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -48,6 +48,7 @@
 #include "xfs_filestream.h"
 #include "xfs_rmap.h"
 #include "xfs_ag_resv.h"
+#include "xfs_refcount.h"
 
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
@@ -4988,9 +4989,16 @@ xfs_bmap_del_extent(
 	/*
 	 * If we need to, add to list of extents to delete.
 	 */
-	if (do_fx)
-		xfs_bmap_add_free(mp, dfops, del->br_startblock,
-				del->br_blockcount, NULL);
+	if (do_fx) {
+		if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) {
+			error = xfs_refcount_decrease_extent(mp, dfops, del);
+			if (error)
+				goto done;
+		} else
+			xfs_bmap_add_free(mp, dfops, del->br_startblock,
+					del->br_blockcount, NULL);
+	}
+
 	/*
 	 * Adjust inode # blocks in the file.
 	 */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 16/63] xfs: add refcount btree block detection to log recovery
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (14 preceding siblings ...)
  2016-09-30  3:07 ` [PATCH 15/63] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
@ 2016-09-30  3:07 ` Darrick J. Wong
  2016-09-30  7:15   ` Christoph Hellwig
  2016-09-30  3:07 ` [PATCH 17/63] xfs: refcount btree requires more reserved space Darrick J. Wong
                   ` (47 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Identify refcountbt blocks in the log correctly so that we can
validate them during log recovery.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_log_recover.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 7def672..622881a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2245,6 +2245,7 @@ xlog_recover_get_buf_lsn(
 	case XFS_ABTB_MAGIC:
 	case XFS_ABTC_MAGIC:
 	case XFS_RMAP_CRC_MAGIC:
+	case XFS_REFC_CRC_MAGIC:
 	case XFS_IBT_CRC_MAGIC:
 	case XFS_IBT_MAGIC: {
 		struct xfs_btree_block *btb = blk;
@@ -2418,6 +2419,9 @@ xlog_recover_validate_buf_type(
 		case XFS_RMAP_CRC_MAGIC:
 			bp->b_ops = &xfs_rmapbt_buf_ops;
 			break;
+		case XFS_REFC_CRC_MAGIC:
+			bp->b_ops = &xfs_refcountbt_buf_ops;
+			break;
 		default:
 			warnmsg = "Bad btree block magic!";
 			break;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 17/63] xfs: refcount btree requires more reserved space
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (15 preceding siblings ...)
  2016-09-30  3:07 ` [PATCH 16/63] xfs: add refcount btree block detection to log recovery Darrick J. Wong
@ 2016-09-30  3:07 ` Darrick J. Wong
  2016-09-30  7:15   ` Christoph Hellwig
  2016-09-30 16:46   ` Brian Foster
  2016-09-30  3:07   ` Darrick J. Wong
                   ` (46 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

The reference count btree is allocated from the free space, which
means that we have to ensure that an AG can't run out of free space
while performing a refcount operation.  In the pathological case each
AG block has its own refcntbt record, so we have to keep that much
space available.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Calculate the maximum possible size of the rmap and refcount
btrees based on minimally-full btree blocks.  This increases the
per-AG block reservations to handle the worst case btree size.
---
 fs/xfs/libxfs/xfs_alloc.c          |    3 +++
 fs/xfs/libxfs/xfs_refcount_btree.c |   23 +++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount_btree.h |    4 ++++
 3 files changed, 30 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index be7e3fc..9d9a46e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -38,6 +38,7 @@
 #include "xfs_buf_item.h"
 #include "xfs_log.h"
 #include "xfs_ag_resv.h"
+#include "xfs_refcount_btree.h"
 
 struct workqueue_struct *xfs_alloc_wq;
 
@@ -128,6 +129,8 @@ xfs_alloc_ag_max_usable(
 		blocks++;		/* finobt root block */
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		blocks++; 		/* rmap root block */
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		blocks++;		/* refcount root block */
 
 	return mp->m_sb.sb_agblocks - blocks;
 }
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 81d58b0..6b5e82b9 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -387,3 +387,26 @@ xfs_refcountbt_compute_maxlevels(
 	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(mp,
 			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
 }
+
+/* Calculate the refcount btree size for some records. */
+xfs_extlen_t
+xfs_refcountbt_calc_size(
+	struct xfs_mount	*mp,
+	unsigned long long	len)
+{
+	return xfs_btree_calc_size(mp, mp->m_refc_mnr, len);
+}
+
+/*
+ * Calculate the maximum refcount btree size.
+ */
+xfs_extlen_t
+xfs_refcountbt_max_size(
+	struct xfs_mount	*mp)
+{
+	/* Bail out if we're uninitialized, which can happen in mkfs. */
+	if (mp->m_refc_mxr[0] == 0)
+		return 0;
+
+	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index 9e9ad7c..780b02f 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -64,4 +64,8 @@ extern int xfs_refcountbt_maxrecs(struct xfs_mount *mp, int blocklen,
 		bool leaf);
 extern void xfs_refcountbt_compute_maxlevels(struct xfs_mount *mp);
 
+extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
+		unsigned long long len);
+extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 18/63] xfs: introduce reflink utility functions
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
@ 2016-09-30  3:07   ` Darrick J. Wong
  2016-09-30  3:05 ` [PATCH 02/63] vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks Darrick J. Wong
                     ` (62 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Darrick J. Wong

These functions will be used by the other reflink functions to find
the maximum length of a range of shared blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
---
 fs/xfs/libxfs/xfs_refcount.c |  100 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h |    4 ++
 2 files changed, 104 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 49d8c6f..0748c9c 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1142,3 +1142,103 @@ xfs_refcount_decrease_extent(
 	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_DECREASE,
 			PREV->br_startblock, PREV->br_blockcount);
 }
+
+/*
+ * Given an AG extent, find the lowest-numbered run of shared blocks within
+ * that range and return the range in fbno/flen.  If find_maximal is set,
+ * return the longest extent of shared blocks; if not, just return the first
+ * extent we find.  If no shared blocks are found, flen will be set to zero.
+ */
+int
+xfs_refcount_find_shared(
+	struct xfs_btree_cur		*cur,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			aglen,
+	xfs_agblock_t			*fbno,
+	xfs_extlen_t			*flen,
+	bool				find_maximal)
+{
+	struct xfs_refcount_irec	tmp;
+	int				i;
+	int				have;
+	int				error;
+
+	trace_xfs_refcount_find_shared(cur->bc_mp, cur->bc_private.a.agno,
+			agbno, aglen);
+
+	/* By default, skip the whole range */
+	*fbno = agbno + aglen;
+	*flen = 0;
+
+	/* Try to find a refcount extent that crosses the start */
+	error = xfs_refcount_lookup_le(cur, agbno, &have);
+	if (error)
+		goto out_error;
+	if (!have) {
+		/* No left extent, look at the next one */
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			goto done;
+	}
+	error = xfs_refcount_get_rec(cur, &tmp, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
+
+	/* If the extent ends before the start, look at the next one */
+	if (tmp.rc_startblock + tmp.rc_blockcount <= agbno) {
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			goto done;
+		error = xfs_refcount_get_rec(cur, &tmp, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
+	}
+
+	/* If the extent ends after the range we want, bail out */
+	if (tmp.rc_startblock >= agbno + aglen)
+		goto done;
+
+	/* We found the start of a shared extent! */
+	if (tmp.rc_startblock < agbno) {
+		tmp.rc_blockcount -= (agbno - tmp.rc_startblock);
+		tmp.rc_startblock = agbno;
+	}
+
+	*fbno = tmp.rc_startblock;
+	*flen = min(tmp.rc_blockcount, agbno + aglen - *fbno);
+	if (!find_maximal)
+		goto done;
+
+	/* Otherwise, find the end of this shared extent */
+	while (*fbno + *flen < agbno + aglen) {
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			break;
+		error = xfs_refcount_get_rec(cur, &tmp, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
+		if (tmp.rc_startblock >= agbno + aglen ||
+		    tmp.rc_startblock != *fbno + *flen)
+			break;
+		*flen = min(*flen + tmp.rc_blockcount, agbno + aglen - *fbno);
+	}
+
+done:
+	trace_xfs_refcount_find_shared_result(cur->bc_mp,
+			cur->bc_private.a.agno, *fbno, *flen);
+
+out_error:
+	if (error)
+		trace_xfs_refcount_find_shared_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 7e750a5..48c576c 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -54,4 +54,8 @@ extern int xfs_refcount_finish_one(struct xfs_trans *tp,
 		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
 		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
 
+extern int xfs_refcount_find_shared(struct xfs_btree_cur *cur,
+		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
+		xfs_extlen_t *flen, bool find_maximal);
+
 #endif	/* __XFS_REFCOUNT_H__ */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 18/63] xfs: introduce reflink utility functions
@ 2016-09-30  3:07   ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

These functions will be used by the other reflink functions to find
the maximum length of a range of shared blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
---
 fs/xfs/libxfs/xfs_refcount.c |  100 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount.h |    4 ++
 2 files changed, 104 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 49d8c6f..0748c9c 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1142,3 +1142,103 @@ xfs_refcount_decrease_extent(
 	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_DECREASE,
 			PREV->br_startblock, PREV->br_blockcount);
 }
+
+/*
+ * Given an AG extent, find the lowest-numbered run of shared blocks within
+ * that range and return the range in fbno/flen.  If find_maximal is set,
+ * return the longest extent of shared blocks; if not, just return the first
+ * extent we find.  If no shared blocks are found, flen will be set to zero.
+ */
+int
+xfs_refcount_find_shared(
+	struct xfs_btree_cur		*cur,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			aglen,
+	xfs_agblock_t			*fbno,
+	xfs_extlen_t			*flen,
+	bool				find_maximal)
+{
+	struct xfs_refcount_irec	tmp;
+	int				i;
+	int				have;
+	int				error;
+
+	trace_xfs_refcount_find_shared(cur->bc_mp, cur->bc_private.a.agno,
+			agbno, aglen);
+
+	/* By default, skip the whole range */
+	*fbno = agbno + aglen;
+	*flen = 0;
+
+	/* Try to find a refcount extent that crosses the start */
+	error = xfs_refcount_lookup_le(cur, agbno, &have);
+	if (error)
+		goto out_error;
+	if (!have) {
+		/* No left extent, look at the next one */
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			goto done;
+	}
+	error = xfs_refcount_get_rec(cur, &tmp, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
+
+	/* If the extent ends before the start, look at the next one */
+	if (tmp.rc_startblock + tmp.rc_blockcount <= agbno) {
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			goto done;
+		error = xfs_refcount_get_rec(cur, &tmp, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
+	}
+
+	/* If the extent ends after the range we want, bail out */
+	if (tmp.rc_startblock >= agbno + aglen)
+		goto done;
+
+	/* We found the start of a shared extent! */
+	if (tmp.rc_startblock < agbno) {
+		tmp.rc_blockcount -= (agbno - tmp.rc_startblock);
+		tmp.rc_startblock = agbno;
+	}
+
+	*fbno = tmp.rc_startblock;
+	*flen = min(tmp.rc_blockcount, agbno + aglen - *fbno);
+	if (!find_maximal)
+		goto done;
+
+	/* Otherwise, find the end of this shared extent */
+	while (*fbno + *flen < agbno + aglen) {
+		error = xfs_btree_increment(cur, 0, &have);
+		if (error)
+			goto out_error;
+		if (!have)
+			break;
+		error = xfs_refcount_get_rec(cur, &tmp, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
+		if (tmp.rc_startblock >= agbno + aglen ||
+		    tmp.rc_startblock != *fbno + *flen)
+			break;
+		*flen = min(*flen + tmp.rc_blockcount, agbno + aglen - *fbno);
+	}
+
+done:
+	trace_xfs_refcount_find_shared_result(cur->bc_mp,
+			cur->bc_private.a.agno, *fbno, *flen);
+
+out_error:
+	if (error)
+		trace_xfs_refcount_find_shared_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 7e750a5..48c576c 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -54,4 +54,8 @@ extern int xfs_refcount_finish_one(struct xfs_trans *tp,
 		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
 		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
 
+extern int xfs_refcount_find_shared(struct xfs_btree_cur *cur,
+		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
+		xfs_extlen_t *flen, bool find_maximal);
+
 #endif	/* __XFS_REFCOUNT_H__ */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 19/63] xfs: create bmbt update intent log items
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (17 preceding siblings ...)
  2016-09-30  3:07   ` Darrick J. Wong
@ 2016-09-30  3:07 ` Darrick J. Wong
  2016-09-30  7:24   ` Christoph Hellwig
  2016-09-30  3:07 ` [PATCH 20/63] xfs: log bmap intent items Darrick J. Wong
                   ` (44 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Create bmbt update intent/done log items to record redo information in
the log.  Because we roll transactions multiple times for reflink
operations, we also have to track the status of the metadata updates
that will be recorded in the post-roll transactions in case we crash
before committing the final transaction.  This mechanism enables log
recovery to finish what was already started.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Only allow one item per BUI to simpify inode locking during
recovery and to avoid exceeding the transaction reservation.
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_log_format.h |   58 ++++++
 fs/xfs/xfs_bmap_item.c         |  374 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_item.h         |   97 ++++++++++
 fs/xfs/xfs_super.c             |   18 ++
 5 files changed, 546 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_bmap_item.c
 create mode 100644 fs/xfs/xfs_bmap_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 6a9ea9e..b850961 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -102,6 +102,7 @@ xfs-y				+= xfs_aops.o \
 # low-level transaction/log code
 xfs-y				+= xfs_log.o \
 				   xfs_log_cil.o \
+				   xfs_bmap_item.o \
 				   xfs_buf_item.o \
 				   xfs_extfree_item.o \
 				   xfs_icreate_item.o \
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 3659f04..5dd7c2e 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -114,7 +114,9 @@ static inline uint xlog_get_cycle(char *ptr)
 #define XLOG_REG_TYPE_RUD_FORMAT	22
 #define XLOG_REG_TYPE_CUI_FORMAT	23
 #define XLOG_REG_TYPE_CUD_FORMAT	24
-#define XLOG_REG_TYPE_MAX		24
+#define XLOG_REG_TYPE_BUI_FORMAT	25
+#define XLOG_REG_TYPE_BUD_FORMAT	26
+#define XLOG_REG_TYPE_MAX		26
 
 /*
  * Flags to log operation header
@@ -235,6 +237,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_RUD		0x1241
 #define	XFS_LI_CUI		0x1242	/* refcount update intent */
 #define	XFS_LI_CUD		0x1243
+#define	XFS_LI_BUI		0x1244	/* bmbt update intent */
+#define	XFS_LI_BUD		0x1245
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -248,7 +252,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_RUI,		"XFS_LI_RUI" }, \
 	{ XFS_LI_RUD,		"XFS_LI_RUD" }, \
 	{ XFS_LI_CUI,		"XFS_LI_CUI" }, \
-	{ XFS_LI_CUD,		"XFS_LI_CUD" }
+	{ XFS_LI_CUD,		"XFS_LI_CUD" }, \
+	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
+	{ XFS_LI_BUD,		"XFS_LI_BUD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -725,6 +731,54 @@ struct xfs_cud_log_format {
 };
 
 /*
+ * BUI/BUD (inode block mapping) log format definitions
+ */
+
+/* bmbt me_flags: upper bits are flags, lower byte is type code */
+/* Type codes are taken directly from enum xfs_bmap_intent_type. */
+#define XFS_BMAP_EXTENT_TYPE_MASK	0xFF
+
+#define XFS_BMAP_EXTENT_ATTR_FORK	(1U << 31)
+#define XFS_BMAP_EXTENT_UNWRITTEN	(1U << 30)
+
+#define XFS_BMAP_EXTENT_FLAGS		(XFS_BMAP_EXTENT_TYPE_MASK | \
+					 XFS_BMAP_EXTENT_ATTR_FORK | \
+					 XFS_BMAP_EXTENT_UNWRITTEN)
+
+/*
+ * This is the structure used to lay out an bui log item in the
+ * log.  The bui_extents field is a variable size array whose
+ * size is given by bui_nextents.
+ */
+struct xfs_bui_log_format {
+	__uint16_t		bui_type;	/* bui log item type */
+	__uint16_t		bui_size;	/* size of this item */
+	__uint32_t		bui_nextents;	/* # extents to free */
+	__uint64_t		bui_id;		/* bui identifier */
+	struct xfs_map_extent	bui_extents[];	/* array of extents to bmap */
+};
+
+static inline size_t
+xfs_bui_log_format_sizeof(
+	unsigned int		nr)
+{
+	return sizeof(struct xfs_bui_log_format) +
+			nr * sizeof(struct xfs_map_extent);
+}
+
+/*
+ * This is the structure used to lay out an bud log item in the
+ * log.  The bud_extents array is a variable size array whose
+ * size is given by bud_nextents;
+ */
+struct xfs_bud_log_format {
+	__uint16_t		bud_type;	/* bud log item type */
+	__uint16_t		bud_size;	/* size of this item */
+	__uint32_t		__pad;
+	__uint64_t		bud_bui_id;	/* id of corresponding bui */
+};
+
+/*
  * Dquot Log format definitions.
  *
  * The first two fields must be the type and size fitting into
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
new file mode 100644
index 0000000..ea736af
--- /dev/null
+++ b/fs/xfs/xfs_bmap_item.c
@@ -0,0 +1,374 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_buf_item.h"
+#include "xfs_bmap_item.h"
+#include "xfs_log.h"
+
+
+kmem_zone_t	*xfs_bui_zone;
+kmem_zone_t	*xfs_bud_zone;
+
+static inline struct xfs_bui_log_item *BUI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_bui_log_item, bui_item);
+}
+
+void
+xfs_bui_item_free(
+	struct xfs_bui_log_item	*buip)
+{
+	kmem_zone_free(xfs_bui_zone, buip);
+}
+
+STATIC void
+xfs_bui_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	struct xfs_bui_log_item	*buip = BUI_ITEM(lip);
+
+	*nvecs += 1;
+	*nbytes += xfs_bui_log_format_sizeof(buip->bui_format.bui_nextents);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given bui log item. We use only 1 iovec, and we point that
+ * at the bui_log_format structure embedded in the bui item.
+ * It is at this point that we assert that all of the extent
+ * slots in the bui item have been filled.
+ */
+STATIC void
+xfs_bui_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_bui_log_item	*buip = BUI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ASSERT(atomic_read(&buip->bui_next_extent) ==
+			buip->bui_format.bui_nextents);
+
+	buip->bui_format.bui_type = XFS_LI_BUI;
+	buip->bui_format.bui_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_BUI_FORMAT, &buip->bui_format,
+			xfs_bui_log_format_sizeof(buip->bui_format.bui_nextents));
+}
+
+/*
+ * Pinning has no meaning for an bui item, so just return.
+ */
+STATIC void
+xfs_bui_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * The unpin operation is the last place an BUI is manipulated in the log. It is
+ * either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the BUI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the BUI to either construct
+ * and commit the BUD or drop the BUD's reference in the event of error. Simply
+ * drop the log's BUI reference now that the log is done with it.
+ */
+STATIC void
+xfs_bui_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_bui_log_item	*buip = BUI_ITEM(lip);
+
+	xfs_bui_release(buip);
+}
+
+/*
+ * BUI items have no locking or pushing.  However, since BUIs are pulled from
+ * the AIL when their corresponding BUDs are committed to disk, their situation
+ * is very similar to being pinned.  Return XFS_ITEM_PINNED so that the caller
+ * will eventually flush the log.  This should help in getting the BUI out of
+ * the AIL.
+ */
+STATIC uint
+xfs_bui_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The BUI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an BUD isn't going to be
+ * constructed and thus we free the BUI here directly.
+ */
+STATIC void
+xfs_bui_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	if (lip->li_flags & XFS_LI_ABORTED)
+		xfs_bui_item_free(BUI_ITEM(lip));
+}
+
+/*
+ * The BUI is logged only once and cannot be moved in the log, so simply return
+ * the lsn at which it's been logged.
+ */
+STATIC xfs_lsn_t
+xfs_bui_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	return lsn;
+}
+
+/*
+ * The BUI dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_bui_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all bui log items.
+ */
+static const struct xfs_item_ops xfs_bui_item_ops = {
+	.iop_size	= xfs_bui_item_size,
+	.iop_format	= xfs_bui_item_format,
+	.iop_pin	= xfs_bui_item_pin,
+	.iop_unpin	= xfs_bui_item_unpin,
+	.iop_unlock	= xfs_bui_item_unlock,
+	.iop_committed	= xfs_bui_item_committed,
+	.iop_push	= xfs_bui_item_push,
+	.iop_committing = xfs_bui_item_committing,
+};
+
+/*
+ * Allocate and initialize an bui item with the given number of extents.
+ */
+struct xfs_bui_log_item *
+xfs_bui_init(
+	struct xfs_mount		*mp)
+
+{
+	struct xfs_bui_log_item		*buip;
+
+	buip = kmem_zone_zalloc(xfs_bui_zone, KM_SLEEP);
+
+	xfs_log_item_init(mp, &buip->bui_item, XFS_LI_BUI, &xfs_bui_item_ops);
+	buip->bui_format.bui_nextents = XFS_BUI_MAX_FAST_EXTENTS;
+	buip->bui_format.bui_id = (uintptr_t)(void *)buip;
+	atomic_set(&buip->bui_next_extent, 0);
+	atomic_set(&buip->bui_refcount, 2);
+
+	return buip;
+}
+
+/*
+ * Freeing the BUI requires that we remove it from the AIL if it has already
+ * been placed there. However, the BUI may not yet have been placed in the AIL
+ * when called by xfs_bui_release() from BUD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the BUI.
+ */
+void
+xfs_bui_release(
+	struct xfs_bui_log_item	*buip)
+{
+	if (atomic_dec_and_test(&buip->bui_refcount)) {
+		xfs_trans_ail_remove(&buip->bui_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_bui_item_free(buip);
+	}
+}
+
+static inline struct xfs_bud_log_item *BUD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_bud_log_item, bud_item);
+}
+
+STATIC void
+xfs_bud_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_bud_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given bud log item. We use only 1 iovec, and we point that
+ * at the bud_log_format structure embedded in the bud item.
+ * It is at this point that we assert that all of the extent
+ * slots in the bud item have been filled.
+ */
+STATIC void
+xfs_bud_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	budp->bud_format.bud_type = XFS_LI_BUD;
+	budp->bud_format.bud_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_BUD_FORMAT, &budp->bud_format,
+			sizeof(struct xfs_bud_log_format));
+}
+
+/*
+ * Pinning has no meaning for an bud item, so just return.
+ */
+STATIC void
+xfs_bud_item_pin(
+	struct xfs_log_item	*lip)
+{
+}
+
+/*
+ * Since pinning has no meaning for an bud item, unpinning does
+ * not either.
+ */
+STATIC void
+xfs_bud_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+}
+
+/*
+ * There isn't much you can do to push on an bud item.  It is simply stuck
+ * waiting for the log to be flushed to disk.
+ */
+STATIC uint
+xfs_bud_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
+{
+	return XFS_ITEM_PINNED;
+}
+
+/*
+ * The BUD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the BUI and free the
+ * BUD.
+ */
+STATIC void
+xfs_bud_item_unlock(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
+
+	if (lip->li_flags & XFS_LI_ABORTED) {
+		xfs_bui_release(budp->bud_buip);
+		kmem_zone_free(xfs_bud_zone, budp);
+	}
+}
+
+/*
+ * When the bud item is committed to disk, all we need to do is delete our
+ * reference to our partner bui item and then free ourselves. Since we're
+ * freeing ourselves we must return -1 to keep the transaction code from
+ * further referencing this item.
+ */
+STATIC xfs_lsn_t
+xfs_bud_item_committed(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+	struct xfs_bud_log_item	*budp = BUD_ITEM(lip);
+
+	/*
+	 * Drop the BUI reference regardless of whether the BUD has been
+	 * aborted. Once the BUD transaction is constructed, it is the sole
+	 * responsibility of the BUD to release the BUI (even if the BUI is
+	 * aborted due to log I/O error).
+	 */
+	xfs_bui_release(budp->bud_buip);
+	kmem_zone_free(xfs_bud_zone, budp);
+
+	return (xfs_lsn_t)-1;
+}
+
+/*
+ * The BUD dependency tracking op doesn't do squat.  It can't because
+ * it doesn't know where the free extent is coming from.  The dependency
+ * tracking has to be handled by the "enclosing" metadata object.  For
+ * example, for inodes, the inode is locked throughout the extent freeing
+ * so the dependency should be recorded there.
+ */
+STATIC void
+xfs_bud_item_committing(
+	struct xfs_log_item	*lip,
+	xfs_lsn_t		lsn)
+{
+}
+
+/*
+ * This is the ops vector shared by all bud log items.
+ */
+static const struct xfs_item_ops xfs_bud_item_ops = {
+	.iop_size	= xfs_bud_item_size,
+	.iop_format	= xfs_bud_item_format,
+	.iop_pin	= xfs_bud_item_pin,
+	.iop_unpin	= xfs_bud_item_unpin,
+	.iop_unlock	= xfs_bud_item_unlock,
+	.iop_committed	= xfs_bud_item_committed,
+	.iop_push	= xfs_bud_item_push,
+	.iop_committing = xfs_bud_item_committing,
+};
+
+/*
+ * Allocate and initialize an bud item with the given number of extents.
+ */
+struct xfs_bud_log_item *
+xfs_bud_init(
+	struct xfs_mount		*mp,
+	struct xfs_bui_log_item		*buip)
+
+{
+	struct xfs_bud_log_item	*budp;
+
+	budp = kmem_zone_zalloc(xfs_bud_zone, KM_SLEEP);
+	xfs_log_item_init(mp, &budp->bud_item, XFS_LI_BUD, &xfs_bud_item_ops);
+	budp->bud_buip = buip;
+	budp->bud_format.bud_bui_id = buip->bui_format.bui_id;
+
+	return budp;
+}
diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
new file mode 100644
index 0000000..57c13d3
--- /dev/null
+++ b/fs/xfs/xfs_bmap_item.h
@@ -0,0 +1,97 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef	__XFS_BMAP_ITEM_H__
+#define	__XFS_BMAP_ITEM_H__
+
+/*
+ * There are (currently) two pairs of bmap btree redo item types: map & unmap.
+ * The common abbreviations for these are BUI (bmap update intent) and BUD
+ * (bmap update done).  The redo item type is encoded in the flags field of
+ * each xfs_map_extent.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated bmbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * bmbt metadata updates in the non-first transaction.
+ */
+
+/* kernel only BUI/BUD definitions */
+
+struct xfs_mount;
+struct kmem_zone;
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_BUI_MAX_FAST_EXTENTS	1
+
+/*
+ * Define BUI flag bits. Manipulated by set/clear/test_bit operators.
+ */
+#define	XFS_BUI_RECOVERED		1
+
+/*
+ * This is the "bmap update intent" log item.  It is used to log the fact that
+ * some reverse mappings need to change.  It is used in conjunction with the
+ * "bmap update done" log item described below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_bui_log_item {
+	struct xfs_log_item		bui_item;
+	atomic_t			bui_refcount;
+	atomic_t			bui_next_extent;
+	unsigned long			bui_flags;	/* misc flags */
+	struct xfs_bui_log_format	bui_format;
+};
+
+static inline size_t
+xfs_bui_log_item_sizeof(
+	unsigned int		nr)
+{
+	return offsetof(struct xfs_bui_log_item, bui_format) +
+			xfs_bui_log_format_sizeof(nr);
+}
+
+/*
+ * This is the "bmap update done" log item.  It is used to log the fact that
+ * some bmbt updates mentioned in an earlier bui item have been performed.
+ */
+struct xfs_bud_log_item {
+	struct xfs_log_item		bud_item;
+	struct xfs_bui_log_item		*bud_buip;
+	struct xfs_bud_log_format	bud_format;
+};
+
+extern struct kmem_zone	*xfs_bui_zone;
+extern struct kmem_zone	*xfs_bud_zone;
+
+struct xfs_bui_log_item *xfs_bui_init(struct xfs_mount *);
+struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *,
+		struct xfs_bui_log_item *);
+void xfs_bui_item_free(struct xfs_bui_log_item *);
+void xfs_bui_release(struct xfs_bui_log_item *);
+
+#endif	/* __XFS_BMAP_ITEM_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 6234622..071bae0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -48,6 +48,7 @@
 #include "xfs_ondisk.h"
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
+#include "xfs_bmap_item.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -1800,8 +1801,23 @@ xfs_init_zones(void)
 	if (!xfs_cui_zone)
 		goto out_destroy_cud_zone;
 
+	xfs_bud_zone = kmem_zone_init(sizeof(struct xfs_bud_log_item),
+			"xfs_bud_item");
+	if (!xfs_bud_zone)
+		goto out_destroy_cui_zone;
+
+	xfs_bui_zone = kmem_zone_init(
+			xfs_bui_log_item_sizeof(XFS_BUI_MAX_FAST_EXTENTS),
+			"xfs_bui_item");
+	if (!xfs_bui_zone)
+		goto out_destroy_bud_zone;
+
 	return 0;
 
+ out_destroy_bud_zone:
+	kmem_zone_destroy(xfs_bud_zone);
+ out_destroy_cui_zone:
+	kmem_zone_destroy(xfs_cui_zone);
  out_destroy_cud_zone:
 	kmem_zone_destroy(xfs_cud_zone);
  out_destroy_rui_zone:
@@ -1848,6 +1864,8 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_zone_destroy(xfs_bui_zone);
+	kmem_zone_destroy(xfs_bud_zone);
 	kmem_zone_destroy(xfs_cui_zone);
 	kmem_zone_destroy(xfs_cud_zone);
 	kmem_zone_destroy(xfs_rui_zone);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 20/63] xfs: log bmap intent items
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (18 preceding siblings ...)
  2016-09-30  3:07 ` [PATCH 19/63] xfs: create bmbt update intent log items Darrick J. Wong
@ 2016-09-30  3:07 ` Darrick J. Wong
  2016-09-30  7:26   ` Christoph Hellwig
  2016-09-30 19:22   ` Brian Foster
  2016-09-30  3:07 ` [PATCH 21/63] xfs: map an inode's offset to an exact physical block Darrick J. Wong
                   ` (43 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Provide a mechanism for higher levels to create BUI/BUD items, submit
them to the log, and a stub function to deal with recovered BUI items.
These parts will be connected to the rmapbt in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Only support one item per BUI.
---
 fs/xfs/Makefile          |    1 
 fs/xfs/libxfs/xfs_bmap.h |   14 ++++
 fs/xfs/xfs_bmap_item.c   |   69 ++++++++++++++++++
 fs/xfs/xfs_bmap_item.h   |    1 
 fs/xfs/xfs_log_recover.c |  177 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.h       |   13 +++
 fs/xfs/xfs_trans_bmap.c  |   84 ++++++++++++++++++++++
 7 files changed, 359 insertions(+)
 create mode 100644 fs/xfs/xfs_trans_bmap.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b850961..6afb228 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -111,6 +111,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
+				   xfs_trans_bmap.o \
 				   xfs_trans_buf.o \
 				   xfs_trans_extfree.o \
 				   xfs_trans_inode.o \
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 8395f6e..fcdb094 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -207,4 +207,18 @@ int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, xfs_fileoff_t aoff,
 		xfs_filblks_t len, struct xfs_bmbt_irec *got,
 		struct xfs_bmbt_irec *prev, xfs_extnum_t *lastx, int eof);
 
+/* The XFS_BMAP_EXTENT_* in xfs_log_format.h must match these. */
+enum xfs_bmap_intent_type {
+	XFS_BMAP_MAP = 1,
+	XFS_BMAP_UNMAP,
+};
+
+struct xfs_bmap_intent {
+	struct list_head			bi_list;
+	enum xfs_bmap_intent_type		bi_type;
+	struct xfs_inode			*bi_owner;
+	int					bi_whichfork;
+	struct xfs_bmbt_irec			bi_bmap;
+};
+
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index ea736af..4e46b63 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -22,12 +22,18 @@
 #include "xfs_format.h"
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
+#include "xfs_bit.h"
 #include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_log.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_trace.h"
 
 
 kmem_zone_t	*xfs_bui_zone;
@@ -372,3 +378,66 @@ xfs_bud_init(
 
 	return budp;
 }
+
+/*
+ * Process a bmap update intent item that was recovered from the log.
+ * We need to update some inode's bmbt.
+ */
+int
+xfs_bui_recover(
+	struct xfs_mount		*mp,
+	struct xfs_bui_log_item		*buip)
+{
+	int				error = 0;
+	struct xfs_map_extent		*bmap;
+	xfs_fsblock_t			startblock_fsb;
+	xfs_fsblock_t			inode_fsb;
+	bool				op_ok;
+
+	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
+
+	/* Only one mapping operation per BUI... */
+	if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) {
+		set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
+		xfs_bui_release(buip);
+		return -EIO;
+	}
+
+	/*
+	 * First check the validity of the extent described by the
+	 * BUI.  If anything is bad, then toss the BUI.
+	 */
+	bmap = &buip->bui_format.bui_extents[0];
+	startblock_fsb = XFS_BB_TO_FSB(mp,
+			   XFS_FSB_TO_DADDR(mp, bmap->me_startblock));
+	inode_fsb = XFS_BB_TO_FSB(mp, XFS_FSB_TO_DADDR(mp,
+			XFS_INO_TO_FSB(mp, bmap->me_owner)));
+	switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
+	case XFS_BMAP_MAP:
+	case XFS_BMAP_UNMAP:
+		op_ok = true;
+		break;
+	default:
+		op_ok = false;
+		break;
+	}
+	if (!op_ok || startblock_fsb == 0 ||
+	    bmap->me_len == 0 ||
+	    inode_fsb == 0 ||
+	    startblock_fsb >= mp->m_sb.sb_dblocks ||
+	    bmap->me_len >= mp->m_sb.sb_agblocks ||
+	    inode_fsb >= mp->m_sb.sb_agblocks ||
+	    (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) {
+		/*
+		 * This will pull the BUI from the AIL and
+		 * free the memory associated with it.
+		 */
+		set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
+		xfs_bui_release(buip);
+		return -EIO;
+	}
+
+	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
+	xfs_bui_release(buip);
+	return error;
+}
diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
index 57c13d3..c867daa 100644
--- a/fs/xfs/xfs_bmap_item.h
+++ b/fs/xfs/xfs_bmap_item.h
@@ -93,5 +93,6 @@ struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *,
 		struct xfs_bui_log_item *);
 void xfs_bui_item_free(struct xfs_bui_log_item *);
 void xfs_bui_release(struct xfs_bui_log_item *);
+int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip);
 
 #endif	/* __XFS_BMAP_ITEM_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 622881a..9697e94 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -46,6 +46,7 @@
 #include "xfs_rmap_item.h"
 #include "xfs_buf_item.h"
 #include "xfs_refcount_item.h"
+#include "xfs_bmap_item.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -1927,6 +1928,8 @@ xlog_recover_reorder_trans(
 		case XFS_LI_RUD:
 		case XFS_LI_CUI:
 		case XFS_LI_CUD:
+		case XFS_LI_BUI:
+		case XFS_LI_BUD:
 			trace_xfs_log_recover_item_reorder_tail(log,
 							trans, item, pass);
 			list_move_tail(&item->ri_list, &inode_list);
@@ -3671,6 +3674,125 @@ xlog_recover_cud_pass2(
 }
 
 /*
+ * Copy an BUI format buffer from the given buf, and into the destination
+ * BUI format structure.  The BUI/BUD items were designed not to need any
+ * special alignment handling.
+ */
+static int
+xfs_bui_copy_format(
+	struct xfs_log_iovec		*buf,
+	struct xfs_bui_log_format	*dst_bui_fmt)
+{
+	struct xfs_bui_log_format	*src_bui_fmt;
+	uint				len;
+
+	src_bui_fmt = buf->i_addr;
+	len = xfs_bui_log_format_sizeof(src_bui_fmt->bui_nextents);
+
+	if (buf->i_len == len) {
+		memcpy(dst_bui_fmt, src_bui_fmt, len);
+		return 0;
+	}
+	return -EFSCORRUPTED;
+}
+
+/*
+ * This routine is called to create an in-core extent bmap update
+ * item from the bui format structure which was logged on disk.
+ * It allocates an in-core bui, copies the extents from the format
+ * structure into it, and adds the bui to the AIL with the given
+ * LSN.
+ */
+STATIC int
+xlog_recover_bui_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	int				error;
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_bui_log_item		*buip;
+	struct xfs_bui_log_format	*bui_formatp;
+
+	bui_formatp = item->ri_buf[0].i_addr;
+
+	if (bui_formatp->bui_nextents != XFS_BUI_MAX_FAST_EXTENTS)
+		return -EFSCORRUPTED;
+	buip = xfs_bui_init(mp);
+	error = xfs_bui_copy_format(&item->ri_buf[0], &buip->bui_format);
+	if (error) {
+		xfs_bui_item_free(buip);
+		return error;
+	}
+	atomic_set(&buip->bui_next_extent, bui_formatp->bui_nextents);
+
+	spin_lock(&log->l_ailp->xa_lock);
+	/*
+	 * The RUI has two references. One for the RUD and one for RUI to ensure
+	 * it makes it into the AIL. Insert the RUI into the AIL directly and
+	 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
+	 * AIL lock.
+	 */
+	xfs_trans_ail_update(log->l_ailp, &buip->bui_item, lsn);
+	xfs_bui_release(buip);
+	return 0;
+}
+
+
+/*
+ * This routine is called when an BUD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding BUI if it
+ * was still in the log. To do this it searches the AIL for the BUI with an id
+ * equal to that in the BUD format structure. If we find it we drop the BUD
+ * reference, which removes the BUI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_bud_pass2(
+	struct xlog			*log,
+	struct xlog_recover_item	*item)
+{
+	struct xfs_bud_log_format	*bud_formatp;
+	struct xfs_bui_log_item		*buip = NULL;
+	struct xfs_log_item		*lip;
+	__uint64_t			bui_id;
+	struct xfs_ail_cursor		cur;
+	struct xfs_ail			*ailp = log->l_ailp;
+
+	bud_formatp = item->ri_buf[0].i_addr;
+	if (item->ri_buf[0].i_len != sizeof(struct xfs_bud_log_format))
+		return -EFSCORRUPTED;
+	bui_id = bud_formatp->bud_bui_id;
+
+	/*
+	 * Search for the BUI with the id in the BUD format structure in the
+	 * AIL.
+	 */
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
+	while (lip != NULL) {
+		if (lip->li_type == XFS_LI_BUI) {
+			buip = (struct xfs_bui_log_item *)lip;
+			if (buip->bui_format.bui_id == bui_id) {
+				/*
+				 * Drop the BUD reference to the BUI. This
+				 * removes the BUI from the AIL and frees it.
+				 */
+				spin_unlock(&ailp->xa_lock);
+				xfs_bui_release(buip);
+				spin_lock(&ailp->xa_lock);
+				break;
+			}
+		}
+		lip = xfs_trans_ail_cursor_next(ailp, &cur);
+	}
+
+	xfs_trans_ail_cursor_done(&cur);
+	spin_unlock(&ailp->xa_lock);
+
+	return 0;
+}
+
+/*
  * This routine is called when an inode create format structure is found in a
  * committed transaction in the log.  It's purpose is to initialise the inodes
  * being allocated on disk. This requires us to get inode cluster buffers that
@@ -3899,6 +4021,8 @@ xlog_recover_ra_pass2(
 	case XFS_LI_RUD:
 	case XFS_LI_CUI:
 	case XFS_LI_CUD:
+	case XFS_LI_BUI:
+	case XFS_LI_BUD:
 	default:
 		break;
 	}
@@ -3926,6 +4050,8 @@ xlog_recover_commit_pass1(
 	case XFS_LI_RUD:
 	case XFS_LI_CUI:
 	case XFS_LI_CUD:
+	case XFS_LI_BUI:
+	case XFS_LI_BUD:
 		/* nothing to do in pass 1 */
 		return 0;
 	default:
@@ -3964,6 +4090,10 @@ xlog_recover_commit_pass2(
 		return xlog_recover_cui_pass2(log, item, trans->r_lsn);
 	case XFS_LI_CUD:
 		return xlog_recover_cud_pass2(log, item);
+	case XFS_LI_BUI:
+		return xlog_recover_bui_pass2(log, item, trans->r_lsn);
+	case XFS_LI_BUD:
+		return xlog_recover_bud_pass2(log, item);
 	case XFS_LI_DQUOT:
 		return xlog_recover_dquot_pass2(log, buffer_list, item,
 						trans->r_lsn);
@@ -4591,6 +4721,46 @@ xlog_recover_cancel_cui(
 	spin_lock(&ailp->xa_lock);
 }
 
+/* Recover the BUI if necessary. */
+STATIC int
+xlog_recover_process_bui(
+	struct xfs_mount		*mp,
+	struct xfs_ail			*ailp,
+	struct xfs_log_item		*lip)
+{
+	struct xfs_bui_log_item		*buip;
+	int				error;
+
+	/*
+	 * Skip BUIs that we've already processed.
+	 */
+	buip = container_of(lip, struct xfs_bui_log_item, bui_item);
+	if (test_bit(XFS_BUI_RECOVERED, &buip->bui_flags))
+		return 0;
+
+	spin_unlock(&ailp->xa_lock);
+	error = xfs_bui_recover(mp, buip);
+	spin_lock(&ailp->xa_lock);
+
+	return error;
+}
+
+/* Release the BUI since we're cancelling everything. */
+STATIC void
+xlog_recover_cancel_bui(
+	struct xfs_mount		*mp,
+	struct xfs_ail			*ailp,
+	struct xfs_log_item		*lip)
+{
+	struct xfs_bui_log_item		*buip;
+
+	buip = container_of(lip, struct xfs_bui_log_item, bui_item);
+
+	spin_unlock(&ailp->xa_lock);
+	xfs_bui_release(buip);
+	spin_lock(&ailp->xa_lock);
+}
+
 /* Is this log item a deferred action intent? */
 static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
 {
@@ -4598,6 +4768,7 @@ static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
 	case XFS_LI_EFI:
 	case XFS_LI_RUI:
 	case XFS_LI_CUI:
+	case XFS_LI_BUI:
 		return true;
 	default:
 		return false;
@@ -4664,6 +4835,9 @@ xlog_recover_process_intents(
 		case XFS_LI_CUI:
 			error = xlog_recover_process_cui(log->l_mp, ailp, lip);
 			break;
+		case XFS_LI_BUI:
+			error = xlog_recover_process_bui(log->l_mp, ailp, lip);
+			break;
 		}
 		if (error)
 			goto out;
@@ -4714,6 +4888,9 @@ xlog_recover_cancel_intents(
 		case XFS_LI_CUI:
 			xlog_recover_cancel_cui(log->l_mp, ailp, lip);
 			break;
+		case XFS_LI_BUI:
+			xlog_recover_cancel_bui(log->l_mp, ailp, lip);
+			break;
 		}
 
 		lip = xfs_trans_ail_cursor_next(ailp, &cur);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index a7a87d2..7cf02d3 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -39,6 +39,7 @@ struct xfs_btree_cur;
 struct xfs_cui_log_item;
 struct xfs_cud_log_item;
 struct xfs_defer_ops;
+struct xfs_bui_log_item;
 
 typedef struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
@@ -263,4 +264,16 @@ int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
 		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
 		struct xfs_btree_cur **pcur);
 
+/* mapping updates */
+enum xfs_bmap_intent_type;
+
+void xfs_bmap_update_init_defer_op(void);
+struct xfs_bud_log_item *xfs_trans_get_bud(struct xfs_trans *tp,
+		struct xfs_bui_log_item *buip);
+int xfs_trans_log_finish_bmap_update(struct xfs_trans *tp,
+		struct xfs_bud_log_item *rudp, struct xfs_defer_ops *dfops,
+		enum xfs_bmap_intent_type type, struct xfs_inode *ip,
+		int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
+		xfs_filblks_t blockcount, xfs_exntst_t state);
+
 #endif	/* __XFS_TRANS_H__ */
diff --git a/fs/xfs/xfs_trans_bmap.c b/fs/xfs/xfs_trans_bmap.c
new file mode 100644
index 0000000..656d669
--- /dev/null
+++ b/fs/xfs/xfs_trans_bmap.c
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_bmap_item.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_inode.h"
+
+/*
+ * This routine is called to allocate a "bmap update done"
+ * log item.
+ */
+struct xfs_bud_log_item *
+xfs_trans_get_bud(
+	struct xfs_trans		*tp,
+	struct xfs_bui_log_item		*buip)
+{
+	struct xfs_bud_log_item		*budp;
+
+	budp = xfs_bud_init(tp->t_mountp, buip);
+	xfs_trans_add_item(tp, &budp->bud_item);
+	return budp;
+}
+
+/*
+ * Finish an bmap update and log it to the BUD. Note that the
+ * transaction is marked dirty regardless of whether the bmap update
+ * succeeds or fails to support the BUI/BUD lifecycle rules.
+ */
+int
+xfs_trans_log_finish_bmap_update(
+	struct xfs_trans		*tp,
+	struct xfs_bud_log_item		*budp,
+	struct xfs_defer_ops		*dop,
+	enum xfs_bmap_intent_type	type,
+	struct xfs_inode		*ip,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state)
+{
+	int				error;
+
+	error = -EFSCORRUPTED;
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the BUI and frees the BUD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	budp->bud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	return error;
+}


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 21/63] xfs: map an inode's offset to an exact physical block
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (19 preceding siblings ...)
  2016-09-30  3:07 ` [PATCH 20/63] xfs: log bmap intent items Darrick J. Wong
@ 2016-09-30  3:07 ` Darrick J. Wong
  2016-09-30  7:31   ` Christoph Hellwig
  2016-10-03 19:03   ` Brian Foster
  2016-09-30  3:08 ` [PATCH 22/63] xfs: pass bmapi flags through to bmap_del_extent Darrick J. Wong
                   ` (42 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:07 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Teach the bmap routine to know how to map a range of file blocks to a
specific range of physical blocks, instead of simply allocating fresh
blocks.  This enables reflink to map a file to blocks that are already
in use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h |   10 +++++++
 fs/xfs/xfs_trace.h       |   54 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 126 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 907d7b8d..9f145ed 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3877,6 +3877,55 @@ xfs_bmap_btalloc(
 }
 
 /*
+ * For a remap operation, just "allocate" an extent at the address that the
+ * caller passed in, and ensure that the AGFL is the right size.  The caller
+ * will then map the "allocated" extent into the file somewhere.
+ */
+STATIC int
+xfs_bmap_remap_alloc(
+	struct xfs_bmalloca	*ap)
+{
+	struct xfs_trans	*tp = ap->tp;
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_agblock_t		bno;
+	struct xfs_alloc_arg	args;
+	int			error;
+
+	/*
+	 * validate that the block number is legal - the enables us to detect
+	 * and handle a silent filesystem corruption rather than crashing.
+	 */
+	memset(&args, 0, sizeof(struct xfs_alloc_arg));
+	args.tp = ap->tp;
+	args.mp = ap->tp->t_mountp;
+	bno = *ap->firstblock;
+	args.agno = XFS_FSB_TO_AGNO(mp, bno);
+	ASSERT(args.agno < mp->m_sb.sb_agcount);
+	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
+	ASSERT(args.agbno < mp->m_sb.sb_agblocks);
+
+	/* "Allocate" the extent from the range we passed in. */
+	trace_xfs_bmap_remap_alloc(ap->ip, *ap->firstblock, ap->length);
+	ap->blkno = bno;
+	ap->ip->i_d.di_nblocks += ap->length;
+	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
+
+	/* Fix the freelist, like a real allocator does. */
+	args.datatype = ap->datatype;
+	args.pag = xfs_perag_get(args.mp, args.agno);
+	ASSERT(args.pag);
+
+	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
+	if (error)
+		goto error0;
+error0:
+	xfs_perag_put(args.pag);
+	if (error)
+		trace_xfs_bmap_remap_alloc_error(ap->ip, error, _RET_IP_);
+	return error;
+}
+
+/*
  * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file.
  * It figures out where to ask the underlying allocator to put the new extent.
  */
@@ -3884,6 +3933,8 @@ STATIC int
 xfs_bmap_alloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
 {
+	if (ap->flags & XFS_BMAPI_REMAP)
+		return xfs_bmap_remap_alloc(ap);
 	if (XFS_IS_REALTIME_INODE(ap->ip) &&
 	    xfs_alloc_is_userdata(ap->datatype))
 		return xfs_bmap_rtalloc(ap);
@@ -4442,6 +4493,12 @@ xfs_bmapi_write(
 	ASSERT(len > 0);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	if (whichfork == XFS_ATTR_FORK)
+		ASSERT(!(flags & XFS_BMAPI_REMAP));
+	if (flags & XFS_BMAPI_REMAP) {
+		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
+		ASSERT(!(flags & XFS_BMAPI_CONVERT));
+	}
 
 	/* zeroing is for currently only for data extents, not metadata */
 	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
@@ -4503,6 +4560,12 @@ xfs_bmapi_write(
 		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
 
 		/*
+		 * Make sure we only reflink into a hole.
+		 */
+		if (flags & XFS_BMAPI_REMAP)
+			ASSERT(inhole);
+
+		/*
 		 * First, deal with the hole before the allocated space
 		 * that we found, if any.
 		 */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index fcdb094..877b6f9 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -97,6 +97,13 @@ struct xfs_extent_free_item
  */
 #define XFS_BMAPI_ZERO		0x080
 
+/*
+ * Map the inode offset to the block given in ap->firstblock.  Primarily
+ * used for reflink.  The range must be in a hole, and this flag cannot be
+ * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
+ */
+#define XFS_BMAPI_REMAP		0x100
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -105,7 +112,8 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
-	{ XFS_BMAPI_ZERO,	"ZERO" }
+	{ XFS_BMAPI_ZERO,	"ZERO" }, \
+	{ XFS_BMAPI_REMAP,	"REMAP" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 195a168..8485984 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2965,6 +2965,60 @@ TRACE_EVENT(xfs_refcount_finish_one_leftover,
 		  __entry->adjusted)
 );
 
+/* simple inode-based error/%ip tracepoint class */
+DECLARE_EVENT_CLASS(xfs_inode_error_class,
+	TP_PROTO(struct xfs_inode *ip, int error, unsigned long caller_ip),
+	TP_ARGS(ip, error, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(int, error)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->error = error;
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d ino %llx error %d caller %ps",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->error,
+		  (char *)__entry->caller_ip)
+);
+
+#define DEFINE_INODE_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_inode_error_class, name, \
+	TP_PROTO(struct xfs_inode *ip, int error, \
+		 unsigned long caller_ip), \
+	TP_ARGS(ip, error, caller_ip))
+
+/* reflink allocator */
+TRACE_EVENT(xfs_bmap_remap_alloc,
+	TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t fsbno,
+		 xfs_extlen_t len),
+	TP_ARGS(ip, fsbno, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fsblock_t, fsbno)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->fsbno = fsbno;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d ino 0x%llx fsbno 0x%llx len %x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->fsbno,
+		  __entry->len)
+);
+DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 22/63] xfs: pass bmapi flags through to bmap_del_extent
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (20 preceding siblings ...)
  2016-09-30  3:07 ` [PATCH 21/63] xfs: map an inode's offset to an exact physical block Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:16   ` Christoph Hellwig
  2016-09-30  3:08 ` [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
                   ` (41 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend
BMAPI_REMAP (which means "don't touch the allocator or the quota
accounting") to apply to bunmapi as well.  This will be used to
implement the unmap operation, which will be used by swapext.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |    9 +++++----
 fs/xfs/libxfs/xfs_bmap.h |    3 +++
 2 files changed, 8 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9f145ed..51124e6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4760,7 +4760,8 @@ xfs_bmap_del_extent(
 	xfs_btree_cur_t		*cur,	/* if null, not a btree */
 	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
 	int			*logflagsp, /* inode logging flags */
-	int			whichfork) /* data or attr fork */
+	int			whichfork, /* data or attr fork */
+	int			bflags)	/* bmapi flags */
 {
 	xfs_filblks_t		da_new;	/* new delay-alloc indirect blocks */
 	xfs_filblks_t		da_old;	/* old delay-alloc indirect blocks */
@@ -5052,7 +5053,7 @@ xfs_bmap_del_extent(
 	/*
 	 * If we need to, add to list of extents to delete.
 	 */
-	if (do_fx) {
+	if (do_fx && !(bflags & XFS_BMAPI_REMAP)) {
 		if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) {
 			error = xfs_refcount_decrease_extent(mp, dfops, del);
 			if (error)
@@ -5070,7 +5071,7 @@ xfs_bmap_del_extent(
 	/*
 	 * Adjust quota data.
 	 */
-	if (qfield)
+	if (qfield && !(bflags & XFS_BMAPI_REMAP))
 		xfs_trans_mod_dquot_byino(tp, ip, qfield, (long)-nblks);
 
 	/*
@@ -5395,7 +5396,7 @@ xfs_bunmapi(
 			cur->bc_private.b.flags &= ~XFS_BTCUR_BPRV_WASDEL;
 
 		error = xfs_bmap_del_extent(ip, tp, &lastx, dfops, cur, &del,
-				&tmp_logflags, whichfork);
+				&tmp_logflags, whichfork, flags);
 		logflags |= tmp_logflags;
 		if (error)
 			goto error0;
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 877b6f9..89c9c8e 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -101,6 +101,9 @@ struct xfs_extent_free_item
  * Map the inode offset to the block given in ap->firstblock.  Primarily
  * used for reflink.  The range must be in a hole, and this flag cannot be
  * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
+ *
+ * For bunmapi, this flag unmaps the range without adjusting quota, reducing
+ * refcount, or freeing the blocks.
  */
 #define XFS_BMAPI_REMAP		0x100
 


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (21 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 22/63] xfs: pass bmapi flags through to bmap_del_extent Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:34   ` Christoph Hellwig
  2016-09-30  3:08 ` [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped Darrick J. Wong
                   ` (40 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Implement deferred versions of the inode block map/unmap functions.
These will be used in subsequent patches to make reflink operations
atomic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Only allow one item per BUI, and implement unmap too.
---
 fs/xfs/libxfs/xfs_bmap.c  |  141 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h  |   11 +++
 fs/xfs/libxfs/xfs_defer.h |    1 
 fs/xfs/xfs_bmap_item.c    |   65 +++++++++++++++++-
 fs/xfs/xfs_error.h        |    4 +
 fs/xfs/xfs_super.c        |    1 
 fs/xfs/xfs_trace.h        |    5 +
 fs/xfs/xfs_trans.h        |    1 
 fs/xfs/xfs_trans_bmap.c   |  167 +++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 393 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 51124e6..a5e429e 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6057,3 +6057,144 @@ xfs_bmap_split_extent(
 	xfs_trans_cancel(tp);
 	return error;
 }
+
+/* Deferred mapping is only for real extents in the data fork. */
+static bool
+xfs_bmap_is_update_needed(
+	int			whichfork,
+	struct xfs_bmbt_irec	*bmap)
+{
+	ASSERT(whichfork == XFS_DATA_FORK);
+
+	return  bmap->br_startblock != HOLESTARTBLOCK &&
+		bmap->br_startblock != DELAYSTARTBLOCK;
+}
+
+/* Record a bmap intent. */
+static int
+__xfs_bmap_add(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	enum xfs_bmap_intent_type	type,
+	struct xfs_inode		*ip,
+	int				whichfork,
+	struct xfs_bmbt_irec		*bmap)
+{
+	int				error;
+	struct xfs_bmap_intent		*bi;
+
+	trace_xfs_bmap_defer(mp,
+			XFS_FSB_TO_AGNO(mp, bmap->br_startblock),
+			type,
+			XFS_FSB_TO_AGBNO(mp, bmap->br_startblock),
+			ip->i_ino, whichfork,
+			bmap->br_startoff,
+			bmap->br_blockcount,
+			bmap->br_state);
+
+	bi = kmem_alloc(sizeof(struct xfs_bmap_intent), KM_SLEEP | KM_NOFS);
+	INIT_LIST_HEAD(&bi->bi_list);
+	bi->bi_type = type;
+	bi->bi_owner = ip;
+	bi->bi_whichfork = whichfork;
+	bi->bi_bmap = *bmap;
+
+	error = xfs_defer_join(dfops, bi->bi_owner);
+	if (error) {
+		kmem_free(bi);
+		return error;
+	}
+
+	xfs_defer_add(dfops, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list);
+	return 0;
+}
+
+/* Map an extent into a file. */
+int
+xfs_bmap_map_extent(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*PREV)
+{
+	if (!xfs_bmap_is_update_needed(whichfork, PREV))
+		return 0;
+
+	return __xfs_bmap_add(mp, dfops, XFS_BMAP_MAP, ip, whichfork, PREV);
+}
+
+/* Unmap an extent out of a file. */
+int
+xfs_bmap_unmap_extent(
+	struct xfs_mount	*mp,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*PREV)
+{
+	if (!xfs_bmap_is_update_needed(whichfork, PREV))
+		return 0;
+
+	return __xfs_bmap_add(mp, dfops, XFS_BMAP_UNMAP, ip, whichfork, PREV);
+}
+
+/*
+ * Process one of the deferred bmap operations.  We pass back the
+ * btree cursor to maintain our lock on the bmapbt between calls.
+ */
+int
+xfs_bmap_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dfops,
+	struct xfs_inode		*ip,
+	enum xfs_bmap_intent_type	type,
+	int				whichfork,
+	xfs_fileoff_t			startoff,
+	xfs_fsblock_t			startblock,
+	xfs_filblks_t			blockcount,
+	xfs_exntst_t			state)
+{
+	struct xfs_bmbt_irec		bmap;
+	int				nimaps = 1;
+	xfs_fsblock_t			firstfsb;
+	int				done;
+	int				error = 0;
+
+	bmap.br_startblock = startblock;
+	bmap.br_startoff = startoff;
+	bmap.br_blockcount = blockcount;
+	bmap.br_state = state;
+
+	trace_xfs_bmap_deferred(tp->t_mountp,
+			XFS_FSB_TO_AGNO(tp->t_mountp, startblock), type,
+			XFS_FSB_TO_AGBNO(tp->t_mountp, startblock),
+			ip->i_ino, whichfork, startoff, blockcount, state);
+
+	if (XFS_TEST_ERROR(false, tp->t_mountp,
+			XFS_ERRTAG_BMAP_FINISH_ONE,
+			XFS_RANDOM_BMAP_FINISH_ONE))
+		return -EIO;
+
+	switch (type) {
+	case XFS_BMAP_MAP:
+		firstfsb = bmap.br_startblock;
+		error = xfs_bmapi_write(tp, ip, bmap.br_startoff,
+					bmap.br_blockcount,
+					XFS_BMAPI_REMAP, &firstfsb,
+					bmap.br_blockcount, &bmap, &nimaps,
+					dfops);
+		break;
+	case XFS_BMAP_UNMAP:
+		error = xfs_bunmapi(tp, ip, bmap.br_startoff,
+				bmap.br_blockcount, XFS_BMAPI_REMAP, 1, &firstfsb,
+				dfops, &done);
+		ASSERT(done);
+		break;
+	default:
+		ASSERT(0);
+		error = -EFSCORRUPTED;
+	}
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 89c9c8e..53970b1 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -232,4 +232,15 @@ struct xfs_bmap_intent {
 	struct xfs_bmbt_irec			bi_bmap;
 };
 
+int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, enum xfs_bmap_intent_type type,
+		int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
+		xfs_filblks_t blockcount, xfs_exntst_t state);
+int	xfs_bmap_map_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_bmbt_irec *imap);
+int	xfs_bmap_unmap_extent(struct xfs_mount *mp, struct xfs_defer_ops *dfops,
+		struct xfs_inode *ip, int whichfork,
+		struct xfs_bmbt_irec *imap);
+
 #endif	/* __XFS_BMAP_H__ */
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 4d94a86..f6e93ef 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -51,6 +51,7 @@ struct xfs_defer_pending {
  * find all the space it needs.
  */
 enum xfs_defer_ops_type {
+	XFS_DEFER_OPS_TYPE_BMAP,
 	XFS_DEFER_OPS_TYPE_REFCOUNT,
 	XFS_DEFER_OPS_TYPE_RMAP,
 	XFS_DEFER_OPS_TYPE_FREE,
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 4e46b63..ddda7c3 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -389,10 +389,19 @@ xfs_bui_recover(
 	struct xfs_bui_log_item		*buip)
 {
 	int				error = 0;
+	unsigned int			bui_type;
 	struct xfs_map_extent		*bmap;
 	xfs_fsblock_t			startblock_fsb;
 	xfs_fsblock_t			inode_fsb;
 	bool				op_ok;
+	struct xfs_bud_log_item		*budp;
+	enum xfs_bmap_intent_type	type;
+	int				whichfork;
+	xfs_exntst_t			state;
+	struct xfs_trans		*tp;
+	struct xfs_inode		*ip = NULL;
+	struct xfs_defer_ops		dfops;
+	xfs_fsblock_t			firstfsb;
 
 	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
 
@@ -437,7 +446,61 @@ xfs_bui_recover(
 		return -EIO;
 	}
 
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error)
+		return error;
+	budp = xfs_trans_get_bud(tp, buip);
+
+	/* Grab the inode. */
+	error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL, &ip);
+	if (error)
+		goto err_inode;
+
+	xfs_defer_init(&dfops, &firstfsb);
+
+	/* Process deferred bmap item. */
+	state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
+			XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
+	whichfork = (bmap->me_flags & XFS_BMAP_EXTENT_ATTR_FORK) ?
+			XFS_ATTR_FORK : XFS_DATA_FORK;
+	bui_type = bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK;
+	switch (bui_type) {
+	case XFS_BMAP_MAP:
+	case XFS_BMAP_UNMAP:
+		type = bui_type;
+		break;
+	default:
+		error = -EFSCORRUPTED;
+		goto err_dfops;
+	}
+	xfs_trans_ijoin(tp, ip, 0);
+
+	error = xfs_trans_log_finish_bmap_update(tp, budp, &dfops, type,
+			ip, whichfork, bmap->me_startoff,
+			bmap->me_startblock, bmap->me_len,
+			state);
+	if (error)
+		goto err_dfops;
+
+	/* Finish transaction, free inodes. */
+	error = xfs_defer_finish(&tp, &dfops, NULL);
+	if (error)
+		goto err_dfops;
+
 	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
-	xfs_bui_release(buip);
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	IRELE(ip);
+
+	return error;
+
+err_dfops:
+	xfs_defer_cancel(&dfops);
+err_inode:
+	xfs_trans_cancel(tp);
+	if (ip) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		IRELE(ip);
+	}
 	return error;
 }
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 641e090..8d8e1b07 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -94,7 +94,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_RMAP_FINISH_ONE			23
 #define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE		24
 #define XFS_ERRTAG_REFCOUNT_FINISH_ONE			25
-#define XFS_ERRTAG_MAX					26
+#define XFS_ERRTAG_BMAP_FINISH_ONE			26
+#define XFS_ERRTAG_MAX					27
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -125,6 +126,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_RMAP_FINISH_ONE			1
 #define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE		1
 #define XFS_RANDOM_REFCOUNT_FINISH_ONE			1
+#define XFS_RANDOM_BMAP_FINISH_ONE			1
 
 #ifdef DEBUG
 extern int xfs_error_test_active;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 071bae0..204b794 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1922,6 +1922,7 @@ init_xfs_fs(void)
 	xfs_extent_free_init_defer_op();
 	xfs_rmap_update_init_defer_op();
 	xfs_refcount_update_init_defer_op();
+	xfs_bmap_update_init_defer_op();
 
 	xfs_dir_startup();
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8485984..e17a4cf 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2587,6 +2587,11 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_right_neighbor_result);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);
 
+/* deferred bmbt updates */
+#define DEFINE_BMAP_DEFERRED_EVENT	DEFINE_RMAP_DEFERRED_EVENT
+DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_defer);
+DEFINE_BMAP_DEFERRED_EVENT(xfs_bmap_deferred);
+
 /* per-AG reservation */
 DECLARE_EVENT_CLASS(xfs_ag_resv_class,
 	TP_PROTO(struct xfs_perag *pag, enum xfs_ag_resv_type resv,
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 7cf02d3..7a4ea0c 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -40,6 +40,7 @@ struct xfs_cui_log_item;
 struct xfs_cud_log_item;
 struct xfs_defer_ops;
 struct xfs_bui_log_item;
+struct xfs_bud_log_item;
 
 typedef struct xfs_log_item {
 	struct list_head		li_ail;		/* AIL pointers */
diff --git a/fs/xfs/xfs_trans_bmap.c b/fs/xfs/xfs_trans_bmap.c
index 656d669..6408e7d 100644
--- a/fs/xfs/xfs_trans_bmap.c
+++ b/fs/xfs/xfs_trans_bmap.c
@@ -68,7 +68,8 @@ xfs_trans_log_finish_bmap_update(
 {
 	int				error;
 
-	error = -EFSCORRUPTED;
+	error = xfs_bmap_finish_one(tp, dop, ip, type, whichfork, startoff,
+			startblock, blockcount, state);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -82,3 +83,167 @@ xfs_trans_log_finish_bmap_update(
 
 	return error;
 }
+
+/* Sort bmap intents by inode. */
+static int
+xfs_bmap_update_diff_items(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_bmap_intent		*ba;
+	struct xfs_bmap_intent		*bb;
+
+	ba = container_of(a, struct xfs_bmap_intent, bi_list);
+	bb = container_of(b, struct xfs_bmap_intent, bi_list);
+	return ba->bi_owner->i_ino - bb->bi_owner->i_ino;
+}
+
+/* Get an BUI. */
+STATIC void *
+xfs_bmap_update_create_intent(
+	struct xfs_trans		*tp,
+	unsigned int			count)
+{
+	struct xfs_bui_log_item		*buip;
+
+	ASSERT(count == XFS_BUI_MAX_FAST_EXTENTS);
+	ASSERT(tp != NULL);
+
+	buip = xfs_bui_init(tp->t_mountp);
+	ASSERT(buip != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &buip->bui_item);
+	return buip;
+}
+
+/* Set the map extent flags for this mapping. */
+static void
+xfs_trans_set_bmap_flags(
+	struct xfs_map_extent		*bmap,
+	enum xfs_bmap_intent_type	type,
+	int				whichfork,
+	xfs_exntst_t			state)
+{
+	bmap->me_flags = 0;
+	switch (type) {
+	case XFS_BMAP_MAP:
+	case XFS_BMAP_UNMAP:
+		bmap->me_flags = type;
+		break;
+	default:
+		ASSERT(0);
+	}
+	if (state == XFS_EXT_UNWRITTEN)
+		bmap->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
+	if (whichfork == XFS_ATTR_FORK)
+		bmap->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
+}
+
+/* Log bmap updates in the intent item. */
+STATIC void
+xfs_bmap_update_log_item(
+	struct xfs_trans		*tp,
+	void				*intent,
+	struct list_head		*item)
+{
+	struct xfs_bui_log_item		*buip = intent;
+	struct xfs_bmap_intent		*bmap;
+	uint				next_extent;
+	struct xfs_map_extent		*map;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	buip->bui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
+
+	/*
+	 * atomic_inc_return gives us the value after the increment;
+	 * we want to use it as an array index so we need to subtract 1 from
+	 * it.
+	 */
+	next_extent = atomic_inc_return(&buip->bui_next_extent) - 1;
+	ASSERT(next_extent < buip->bui_format.bui_nextents);
+	map = &buip->bui_format.bui_extents[next_extent];
+	map->me_owner = bmap->bi_owner->i_ino;
+	map->me_startblock = bmap->bi_bmap.br_startblock;
+	map->me_startoff = bmap->bi_bmap.br_startoff;
+	map->me_len = bmap->bi_bmap.br_blockcount;
+	xfs_trans_set_bmap_flags(map, bmap->bi_type, bmap->bi_whichfork,
+			bmap->bi_bmap.br_state);
+}
+
+/* Get an BUD so we can process all the deferred rmap updates. */
+STATIC void *
+xfs_bmap_update_create_done(
+	struct xfs_trans		*tp,
+	void				*intent,
+	unsigned int			count)
+{
+	return xfs_trans_get_bud(tp, intent);
+}
+
+/* Process a deferred rmap update. */
+STATIC int
+xfs_bmap_update_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_defer_ops		*dop,
+	struct list_head		*item,
+	void				*done_item,
+	void				**state)
+{
+	struct xfs_bmap_intent		*bmap;
+	int				error;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+	error = xfs_trans_log_finish_bmap_update(tp, done_item, dop,
+			bmap->bi_type,
+			bmap->bi_owner, bmap->bi_whichfork,
+			bmap->bi_bmap.br_startoff,
+			bmap->bi_bmap.br_startblock,
+			bmap->bi_bmap.br_blockcount,
+			bmap->bi_bmap.br_state);
+	kmem_free(bmap);
+	return error;
+}
+
+/* Abort all pending BUIs. */
+STATIC void
+xfs_bmap_update_abort_intent(
+	void				*intent)
+{
+	xfs_bui_release(intent);
+}
+
+/* Cancel a deferred rmap update. */
+STATIC void
+xfs_bmap_update_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_bmap_intent		*bmap;
+
+	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
+	kmem_free(bmap);
+}
+
+static const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
+	.type		= XFS_DEFER_OPS_TYPE_BMAP,
+	.max_items	= XFS_BUI_MAX_FAST_EXTENTS,
+	.diff_items	= xfs_bmap_update_diff_items,
+	.create_intent	= xfs_bmap_update_create_intent,
+	.abort_intent	= xfs_bmap_update_abort_intent,
+	.log_item	= xfs_bmap_update_log_item,
+	.create_done	= xfs_bmap_update_create_done,
+	.finish_item	= xfs_bmap_update_finish_item,
+	.cancel_item	= xfs_bmap_update_cancel_item,
+};
+
+/* Register the deferred op type. */
+void
+xfs_bmap_update_init_defer_op(void)
+{
+	xfs_defer_init_op_type(&xfs_bmap_update_defer_type);
+}


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (22 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:35   ` Christoph Hellwig
  2016-10-03 19:04   ` Brian Foster
  2016-09-30  3:08 ` [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
                   ` (39 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Log recovery will iget an inode to replay BUI items and iput the inode
when it's done.  Unfortunately, the iput will see that i_nlink == 0
and decide to truncate & free the inode, which prevents us from
replaying subsequent BUIs.  We can't skip the BUIs because we have to
replay all the redo items to ensure that atomic operations complete.

Since unlinked inode recovery will reap the inode anyway, we can
safely introduce a new inode flag to indicate that an inode is in this
'unlinked recovery' state and should not be auto-reaped in the
drop_inode path.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_item.c   |    1 +
 fs/xfs/xfs_inode.c       |    8 ++++++++
 fs/xfs/xfs_inode.h       |    6 ++++++
 fs/xfs/xfs_log_recover.c |    1 +
 4 files changed, 16 insertions(+)


diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index ddda7c3..b1a220f 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -456,6 +456,7 @@ xfs_bui_recover(
 	if (error)
 		goto err_inode;
 
+	xfs_iflags_set(ip, XFS_IRECOVER_UNLINKED);
 	xfs_defer_init(&dfops, &firstfsb);
 
 	/* Process deferred bmap item. */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e08eaea..0c25a76 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1855,6 +1855,14 @@ xfs_inactive(
 	if (mp->m_flags & XFS_MOUNT_RDONLY)
 		return;
 
+	/*
+	 * If this unlinked inode is in the middle of recovery, don't
+	 * truncate and free the inode just yet; log recovery will take
+	 * care of that.  See the comment for this inode flag.
+	 */
+	if (xfs_iflags_test(ip, XFS_IRECOVER_UNLINKED))
+		return;
+
 	if (VFS_I(ip)->i_nlink != 0) {
 		/*
 		 * force is true because we are evicting an inode from the
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index a8658e6..46632f1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -222,6 +222,12 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IDONTCACHE		(1 << 9) /* don't cache the inode long term */
 #define XFS_IEOFBLOCKS		(1 << 10)/* has the preallocblocks tag set */
+/*
+ * If this unlinked inode is in the middle of recovery, don't let drop_inode
+ * truncate and free the inode.  This can happen if we iget the inode during
+ * log recovery to replay a bmap operation on the inode.
+ */
+#define XFS_IRECOVER_UNLINKED	(1 << 11)
 
 /*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 9697e94..b121f02 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -4969,6 +4969,7 @@ xlog_recover_process_one_iunlink(
 	if (error)
 		goto fail_iput;
 
+	xfs_iflags_clear(ip, XFS_IRECOVER_UNLINKED);
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (23 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:19   ` Christoph Hellwig
  2016-10-03 19:04   ` Brian Foster
  2016-09-30  3:08 ` [PATCH 26/63] xfs: define tracepoints for reflink activities Darrick J. Wong
                   ` (38 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Return the range of file blocks that bunmapi didn't free.  This hint
is used by CoW and reflink to figure out what part of an extent
actually got freed so that it can set up the appropriate atomic
remapping of just the freed range.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   36 ++++++++++++++++++++++++++++++------
 fs/xfs/libxfs/xfs_bmap.h |    4 ++++
 2 files changed, 34 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a5e429e..1e4f1a1 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5093,17 +5093,16 @@ xfs_bmap_del_extent(
  * *done is set.
  */
 int						/* error */
-xfs_bunmapi(
+__xfs_bunmapi(
 	xfs_trans_t		*tp,		/* transaction pointer */
 	struct xfs_inode	*ip,		/* incore inode */
 	xfs_fileoff_t		bno,		/* starting offset to unmap */
-	xfs_filblks_t		len,		/* length to unmap in file */
+	xfs_filblks_t		*rlen,		/* i/o: amount remaining */
 	int			flags,		/* misc flags */
 	xfs_extnum_t		nexts,		/* number of extents max */
 	xfs_fsblock_t		*firstblock,	/* first allocated block
 						   controls a.g. for allocs */
-	struct xfs_defer_ops	*dfops,		/* i/o: list extents to free */
-	int			*done)		/* set if not done yet */
+	struct xfs_defer_ops	*dfops)		/* i/o: deferred updates */
 {
 	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
 	xfs_bmbt_irec_t		del;		/* extent being deleted */
@@ -5125,6 +5124,7 @@ xfs_bunmapi(
 	int			wasdel;		/* was a delayed alloc extent */
 	int			whichfork;	/* data or attribute fork */
 	xfs_fsblock_t		sum;
+	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
 
 	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
 
@@ -5151,7 +5151,7 @@ xfs_bunmapi(
 		return error;
 	nextents = ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
 	if (nextents == 0) {
-		*done = 1;
+		*rlen = 0;
 		return 0;
 	}
 	XFS_STATS_INC(mp, xs_blk_unmap);
@@ -5422,7 +5422,10 @@ xfs_bunmapi(
 			extno++;
 		}
 	}
-	*done = bno == (xfs_fileoff_t)-1 || bno < start || lastx < 0;
+	if (bno == (xfs_fileoff_t)-1 || bno < start || lastx < 0)
+		*rlen = 0;
+	else
+		*rlen = bno - start + 1;
 
 	/*
 	 * Convert to a btree if necessary.
@@ -5478,6 +5481,27 @@ xfs_bunmapi(
 	return error;
 }
 
+/* Unmap a range of a file. */
+int
+xfs_bunmapi(
+	xfs_trans_t		*tp,
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		bno,
+	xfs_filblks_t		len,
+	int			flags,
+	xfs_extnum_t		nexts,
+	xfs_fsblock_t		*firstblock,
+	struct xfs_defer_ops	*dfops,
+	int			*done)
+{
+	int			error;
+
+	error = __xfs_bunmapi(tp, ip, bno, &len, flags, nexts, firstblock,
+			dfops);
+	*done = (len == 0);
+	return error;
+}
+
 /*
  * Determine whether an extent shift can be accomplished by a merge with the
  * extent that precedes the target hole of the shift.
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 53970b1..48ba3ed 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -197,6 +197,10 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fsblock_t *firstblock, xfs_extlen_t total,
 		struct xfs_bmbt_irec *mval, int *nmap,
 		struct xfs_defer_ops *dfops);
+int	__xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
+		xfs_fileoff_t bno, xfs_filblks_t *rlen, int flags,
+		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
+		struct xfs_defer_ops *dfops);
 int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 26/63] xfs: define tracepoints for reflink activities
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (24 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:20   ` Christoph Hellwig
  2016-09-30  3:08 ` [PATCH 27/63] xfs: add reflink feature flag to geometry Darrick J. Wong
                   ` (37 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Define all the tracepoints we need to inspect the runtime operation
of reflink/dedupe/copy-on-write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_trace.h |  333 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 333 insertions(+)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e17a4cf..5403199 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3024,6 +3024,339 @@ TRACE_EVENT(xfs_bmap_remap_alloc,
 );
 DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
 
+/* reflink tracepoint classes */
+
+/* two-file io tracepoint class */
+DECLARE_EVENT_CLASS(xfs_double_io_class,
+	TP_PROTO(struct xfs_inode *src, xfs_off_t soffset, xfs_off_t len,
+		 struct xfs_inode *dest, xfs_off_t doffset),
+	TP_ARGS(src, soffset, len, dest, doffset),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, src_ino)
+		__field(loff_t, src_isize)
+		__field(loff_t, src_disize)
+		__field(loff_t, src_offset)
+		__field(size_t, len)
+		__field(xfs_ino_t, dest_ino)
+		__field(loff_t, dest_isize)
+		__field(loff_t, dest_disize)
+		__field(loff_t, dest_offset)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(src)->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_isize = VFS_I(src)->i_size;
+		__entry->src_disize = src->i_d.di_size;
+		__entry->src_offset = soffset;
+		__entry->len = len;
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_isize = VFS_I(dest)->i_size;
+		__entry->dest_disize = dest->i_d.di_size;
+		__entry->dest_offset = doffset;
+	),
+	TP_printk("dev %d:%d count %zd "
+		  "ino 0x%llx isize 0x%llx disize 0x%llx offset 0x%llx -> "
+		  "ino 0x%llx isize 0x%llx disize 0x%llx offset 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->len,
+		  __entry->src_ino,
+		  __entry->src_isize,
+		  __entry->src_disize,
+		  __entry->src_offset,
+		  __entry->dest_ino,
+		  __entry->dest_isize,
+		  __entry->dest_disize,
+		  __entry->dest_offset)
+)
+
+#define DEFINE_DOUBLE_IO_EVENT(name)	\
+DEFINE_EVENT(xfs_double_io_class, name,	\
+	TP_PROTO(struct xfs_inode *src, xfs_off_t soffset, xfs_off_t len, \
+		 struct xfs_inode *dest, xfs_off_t doffset), \
+	TP_ARGS(src, soffset, len, dest, doffset))
+
+/* two-file vfs io tracepoint class */
+DECLARE_EVENT_CLASS(xfs_double_vfs_io_class,
+	TP_PROTO(struct inode *src, u64 soffset, u64 len,
+		 struct inode *dest, u64 doffset),
+	TP_ARGS(src, soffset, len, dest, doffset),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, src_ino)
+		__field(loff_t, src_isize)
+		__field(loff_t, src_offset)
+		__field(size_t, len)
+		__field(unsigned long, dest_ino)
+		__field(loff_t, dest_isize)
+		__field(loff_t, dest_offset)
+	),
+	TP_fast_assign(
+		__entry->dev = src->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_isize = i_size_read(src);
+		__entry->src_offset = soffset;
+		__entry->len = len;
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_isize = i_size_read(dest);
+		__entry->dest_offset = doffset;
+	),
+	TP_printk("dev %d:%d count %zd "
+		  "ino 0x%lx isize 0x%llx offset 0x%llx -> "
+		  "ino 0x%lx isize 0x%llx offset 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->len,
+		  __entry->src_ino,
+		  __entry->src_isize,
+		  __entry->src_offset,
+		  __entry->dest_ino,
+		  __entry->dest_isize,
+		  __entry->dest_offset)
+)
+
+#define DEFINE_DOUBLE_VFS_IO_EVENT(name)	\
+DEFINE_EVENT(xfs_double_vfs_io_class, name,	\
+	TP_PROTO(struct inode *src, u64 soffset, u64 len, \
+		 struct inode *dest, u64 doffset), \
+	TP_ARGS(src, soffset, len, dest, doffset))
+
+/* CoW write tracepoint */
+DECLARE_EVENT_CLASS(xfs_copy_on_write_class,
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk, xfs_fsblock_t pblk,
+		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
+	TP_ARGS(ip, lblk, pblk, len, new_pblk),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_fsblock_t, pblk)
+		__field(xfs_extlen_t, len)
+		__field(xfs_fsblock_t, new_pblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = lblk;
+		__entry->pblk = pblk;
+		__entry->len = len;
+		__entry->new_pblk = new_pblk;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx pblk 0x%llx "
+		  "len 0x%x new_pblk %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->pblk,
+		  __entry->len,
+		  __entry->new_pblk)
+)
+
+#define DEFINE_COW_EVENT(name)	\
+DEFINE_EVENT(xfs_copy_on_write_class, name,	\
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk, xfs_fsblock_t pblk, \
+		 xfs_extlen_t len, xfs_fsblock_t new_pblk), \
+	TP_ARGS(ip, lblk, pblk, len, new_pblk))
+
+/* inode/irec events */
+DECLARE_EVENT_CLASS(xfs_inode_irec_class,
+	TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec),
+	TP_ARGS(ip, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_extlen_t, len)
+		__field(xfs_fsblock_t, pblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = irec->br_startoff;
+		__entry->len = irec->br_blockcount;
+		__entry->pblk = irec->br_startblock;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x pblk %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->len,
+		  __entry->pblk)
+);
+#define DEFINE_INODE_IREC_EVENT(name) \
+DEFINE_EVENT(xfs_inode_irec_class, name, \
+	TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec), \
+	TP_ARGS(ip, irec))
+
+/* refcount/reflink tracepoint definitions */
+
+/* reflink tracepoints */
+DEFINE_INODE_EVENT(xfs_reflink_set_inode_flag);
+DEFINE_INODE_EVENT(xfs_reflink_unset_inode_flag);
+DEFINE_ITRUNC_EVENT(xfs_reflink_update_inode_size);
+DEFINE_IOMAP_EVENT(xfs_reflink_remap_imap);
+TRACE_EVENT(xfs_reflink_remap_blocks_loop,
+	TP_PROTO(struct xfs_inode *src, xfs_fileoff_t soffset,
+		 xfs_filblks_t len, struct xfs_inode *dest,
+		 xfs_fileoff_t doffset),
+	TP_ARGS(src, soffset, len, dest, doffset),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, src_ino)
+		__field(xfs_fileoff_t, src_lblk)
+		__field(xfs_filblks_t, len)
+		__field(xfs_ino_t, dest_ino)
+		__field(xfs_fileoff_t, dest_lblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(src)->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_lblk = soffset;
+		__entry->len = len;
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_lblk = doffset;
+	),
+	TP_printk("dev %d:%d len 0x%llx "
+		  "ino 0x%llx offset 0x%llx blocks -> "
+		  "ino 0x%llx offset 0x%llx blocks",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->len,
+		  __entry->src_ino,
+		  __entry->src_lblk,
+		  __entry->dest_ino,
+		  __entry->dest_lblk)
+);
+TRACE_EVENT(xfs_reflink_punch_range,
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
+		 xfs_extlen_t len),
+	TP_ARGS(ip, lblk, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = lblk;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->len)
+);
+TRACE_EVENT(xfs_reflink_remap,
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t lblk,
+		 xfs_extlen_t len, xfs_fsblock_t new_pblk),
+	TP_ARGS(ip, lblk, len, new_pblk),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fileoff_t, lblk)
+		__field(xfs_extlen_t, len)
+		__field(xfs_fsblock_t, new_pblk)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->lblk = lblk;
+		__entry->len = len;
+		__entry->new_pblk = new_pblk;
+	),
+	TP_printk("dev %d:%d ino 0x%llx lblk 0x%llx len 0x%x new_pblk %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->lblk,
+		  __entry->len,
+		  __entry->new_pblk)
+);
+DEFINE_DOUBLE_IO_EVENT(xfs_reflink_remap_range);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_set_inode_flag_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_update_inode_size_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_reflink_main_loop_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_read_iomap_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
+
+/* dedupe tracepoints */
+DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_compare_extents_error);
+
+/* ioctl tracepoints */
+DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_reflink);
+DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_clone_range);
+DEFINE_DOUBLE_VFS_IO_EVENT(xfs_ioctl_file_extent_same);
+TRACE_EVENT(xfs_ioctl_clone,
+	TP_PROTO(struct inode *src, struct inode *dest),
+	TP_ARGS(src, dest),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, src_ino)
+		__field(loff_t, src_isize)
+		__field(unsigned long, dest_ino)
+		__field(loff_t, dest_isize)
+	),
+	TP_fast_assign(
+		__entry->dev = src->i_sb->s_dev;
+		__entry->src_ino = src->i_ino;
+		__entry->src_isize = i_size_read(src);
+		__entry->dest_ino = dest->i_ino;
+		__entry->dest_isize = i_size_read(dest);
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%lx isize 0x%llx -> "
+		  "ino 0x%lx isize 0x%llx\n",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->src_ino,
+		  __entry->src_isize,
+		  __entry->dest_ino,
+		  __entry->dest_isize)
+);
+
+/* unshare tracepoints */
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_unshare);
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cow_eof_block);
+DEFINE_PAGE_EVENT(xfs_reflink_unshare_page);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_unshare_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_cow_eof_block_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_dirty_page_error);
+
+/* copy on write */
+DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
+
+DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_reserve_cow_extent);
+DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
+
+DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
+DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
+DEFINE_SIMPLE_IO_EVENT(xfs_iomap_cow_delay);
+
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
+DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap_piece);
+
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_reserve_cow_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_reserve_cow_extent_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_allocate_cow_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_cow_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
+
+DEFINE_COW_EVENT(xfs_reflink_fork_buf);
+DEFINE_COW_EVENT(xfs_reflink_finish_fork_buf);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_fork_buf_error);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_finish_fork_buf_error);
+
+DEFINE_INODE_EVENT(xfs_reflink_cancel_pending_cow);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
+DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_pending_cow_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 27/63] xfs: add reflink feature flag to geometry
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (25 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 26/63] xfs: define tracepoints for reflink activities Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:20   ` Christoph Hellwig
  2016-09-30  3:08 ` [PATCH 28/63] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
                   ` (36 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Report the reflink feature in the XFS geometry so that xfs_info and
friends know the filesystem has this feature.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |    3 ++-
 fs/xfs/xfs_fsops.c     |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 7945505..6f4f2c3 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -206,7 +206,8 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
 #define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
 #define XFS_FSOP_GEOM_FLAGS_SPINODES	0x40000	/* sparse inode chunks	*/
-#define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* Reverse mapping btree */
+#define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* reverse mapping btree */
+#define XFS_FSOP_GEOM_FLAGS_REFLINK	0x100000 /* files can share blocks */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 4b4059b..3acbf4e0 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -108,7 +108,9 @@ xfs_fs_geometry(
 			(xfs_sb_version_hassparseinodes(&mp->m_sb) ?
 				XFS_FSOP_GEOM_FLAGS_SPINODES : 0) |
 			(xfs_sb_version_hasrmapbt(&mp->m_sb) ?
-				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0);
+				XFS_FSOP_GEOM_FLAGS_RMAPBT : 0) |
+			(xfs_sb_version_hasreflink(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_REFLINK : 0);
 		geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
 				mp->m_sb.sb_logsectsize : BBSIZE;
 		geo->rtsectsize = mp->m_sb.sb_blocksize;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 28/63] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (26 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 27/63] xfs: add reflink feature flag to geometry Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:20   ` Christoph Hellwig
  2016-09-30  3:08 ` [PATCH 29/63] xfs: introduce the CoW fork Darrick J. Wong
                   ` (35 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Only non-rt files can be reflinked, so check that when we load an
inode.  Also, don't leak the attr fork if there's a failure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_fork.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index bbcc8c7..7699a03 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -121,6 +121,26 @@ xfs_iformat_fork(
 		return -EFSCORRUPTED;
 	}
 
+	if (unlikely(xfs_is_reflink_inode(ip) &&
+	    (VFS_I(ip)->i_mode & S_IFMT) != S_IFREG)) {
+		xfs_warn(ip->i_mount,
+			"corrupt dinode %llu, wrong file type for reflink.",
+			ip->i_ino);
+		XFS_CORRUPTION_ERROR("xfs_iformat(reflink)",
+				     XFS_ERRLEVEL_LOW, ip->i_mount, dip);
+		return -EFSCORRUPTED;
+	}
+
+	if (unlikely(xfs_is_reflink_inode(ip) &&
+	    (ip->i_d.di_flags & XFS_DIFLAG_REALTIME))) {
+		xfs_warn(ip->i_mount,
+			"corrupt dinode %llu, has reflink+realtime flag set.",
+			ip->i_ino);
+		XFS_CORRUPTION_ERROR("xfs_iformat(reflink)",
+				     XFS_ERRLEVEL_LOW, ip->i_mount, dip);
+		return -EFSCORRUPTED;
+	}
+
 	switch (VFS_I(ip)->i_mode & S_IFMT) {
 	case S_IFIFO:
 	case S_IFCHR:
@@ -208,7 +228,8 @@ xfs_iformat_fork(
 			XFS_CORRUPTION_ERROR("xfs_iformat(8)",
 					     XFS_ERRLEVEL_LOW,
 					     ip->i_mount, dip);
-			return -EFSCORRUPTED;
+			error = -EFSCORRUPTED;
+			break;
 		}
 
 		error = xfs_iformat_local(ip, dip, XFS_ATTR_FORK, size);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 29/63] xfs: introduce the CoW fork
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (27 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 28/63] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:39   ` Christoph Hellwig
  2016-09-30  3:08 ` [PATCH 30/63] xfs: support bmapping delalloc extents in " Darrick J. Wong
                   ` (34 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Introduce a new in-core fork for storing copy-on-write delalloc
reservations and allocated extents that are in the process of being
written out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: fix up bmapi_read so that we can query the CoW fork, and have it
return a "hole" extent if there's no CoW fork.
---
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_bmap.c       |   27 +++++++--
 fs/xfs/libxfs/xfs_bmap.h       |   22 +++++++-
 fs/xfs/libxfs/xfs_bmap_btree.c |    1 
 fs/xfs/libxfs/xfs_inode_fork.c |   47 +++++++++++++++-
 fs/xfs/libxfs/xfs_inode_fork.h |   28 ++++++++--
 fs/xfs/libxfs/xfs_rmap.c       |   15 +++--
 fs/xfs/libxfs/xfs_types.h      |    1 
 fs/xfs/xfs_icache.c            |    5 ++
 fs/xfs/xfs_inode.h             |    4 +
 fs/xfs/xfs_reflink.c           |  114 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h           |   23 ++++++++
 fs/xfs/xfs_trace.h             |    4 +
 13 files changed, 264 insertions(+), 28 deletions(-)
 create mode 100644 fs/xfs/xfs_reflink.c
 create mode 100644 fs/xfs/xfs_reflink.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 6afb228..26ef195 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -90,6 +90,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_reflink.o \
 				   xfs_stats.o \
 				   xfs_super.o \
 				   xfs_symlink.o \
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 1e4f1a1..3388058 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2924,6 +2924,7 @@ xfs_bmap_add_extent_hole_real(
 	ASSERT(!isnullstartblock(new->br_startblock));
 	ASSERT(!bma->cur ||
 	       !(bma->cur->bc_private.b.flags & XFS_BTCUR_BPRV_WASDEL));
+	ASSERT(whichfork != XFS_COW_FORK);
 
 	XFS_STATS_INC(mp, xs_add_exlist);
 
@@ -4064,12 +4065,11 @@ xfs_bmapi_read(
 	int			error;
 	int			eof;
 	int			n = 0;
-	int			whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-						XFS_ATTR_FORK : XFS_DATA_FORK;
+	int			whichfork = xfs_bmapi_whichfork(flags);
 
 	ASSERT(*nmap >= 1);
 	ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK|XFS_BMAPI_ENTIRE|
-			   XFS_BMAPI_IGSTATE)));
+			   XFS_BMAPI_IGSTATE|XFS_BMAPI_COWFORK)));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL));
 
 	if (unlikely(XFS_TEST_ERROR(
@@ -4087,6 +4087,16 @@ xfs_bmapi_read(
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 
+	/* No CoW fork?  Return a hole. */
+	if (whichfork == XFS_COW_FORK && !ifp) {
+		mval->br_startoff = bno;
+		mval->br_startblock = HOLESTARTBLOCK;
+		mval->br_blockcount = len;
+		mval->br_state = XFS_EXT_NORM;
+		*nmap = 1;
+		return 0;
+	}
+
 	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
 		error = xfs_iread_extents(NULL, ip, whichfork);
 		if (error)
@@ -4360,8 +4370,7 @@ xfs_bmapi_convert_unwritten(
 	xfs_filblks_t		len,
 	int			flags)
 {
-	int			whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-						XFS_ATTR_FORK : XFS_DATA_FORK;
+	int			whichfork = xfs_bmapi_whichfork(flags);
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(bma->ip, whichfork);
 	int			tmp_logflags = 0;
 	int			error;
@@ -4377,6 +4386,8 @@ xfs_bmapi_convert_unwritten(
 			(XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT))
 		return 0;
 
+	ASSERT(whichfork != XFS_COW_FORK);
+
 	/*
 	 * Modify (by adding) the state flag, if writing.
 	 */
@@ -4790,6 +4801,8 @@ xfs_bmap_del_extent(
 
 	if (whichfork == XFS_ATTR_FORK)
 		state |= BMAP_ATTRFORK;
+	else if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	ASSERT((*idx >= 0) && (*idx < ifp->if_bytes /
@@ -5128,8 +5141,8 @@ __xfs_bunmapi(
 
 	trace_xfs_bunmap(ip, bno, len, flags, _RET_IP_);
 
-	whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-		XFS_ATTR_FORK : XFS_DATA_FORK;
+	whichfork = xfs_bmapi_whichfork(flags);
+	ASSERT(whichfork != XFS_COW_FORK);
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	if (unlikely(
 	    XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 48ba3ed..adb64fb 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -107,6 +107,9 @@ struct xfs_extent_free_item
  */
 #define XFS_BMAPI_REMAP		0x100
 
+/* Map something in the CoW fork. */
+#define XFS_BMAPI_COWFORK	0x200
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -116,12 +119,23 @@ struct xfs_extent_free_item
 	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
 	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
 	{ XFS_BMAPI_ZERO,	"ZERO" }, \
-	{ XFS_BMAPI_REMAP,	"REMAP" }
+	{ XFS_BMAPI_REMAP,	"REMAP" }, \
+	{ XFS_BMAPI_COWFORK,	"COWFORK" }
 
 
 static inline int xfs_bmapi_aflag(int w)
 {
-	return (w == XFS_ATTR_FORK ? XFS_BMAPI_ATTRFORK : 0);
+	return (w == XFS_ATTR_FORK ? XFS_BMAPI_ATTRFORK :
+	       (w == XFS_COW_FORK ? XFS_BMAPI_COWFORK : 0));
+}
+
+static inline int xfs_bmapi_whichfork(int bmapi_flags)
+{
+	if (bmapi_flags & XFS_BMAPI_COWFORK)
+		return XFS_COW_FORK;
+	else if (bmapi_flags & XFS_BMAPI_ATTRFORK)
+		return XFS_ATTR_FORK;
+	return XFS_DATA_FORK;
 }
 
 /*
@@ -142,13 +156,15 @@ static inline int xfs_bmapi_aflag(int w)
 #define BMAP_LEFT_VALID		(1 << 6)
 #define BMAP_RIGHT_VALID	(1 << 7)
 #define BMAP_ATTRFORK		(1 << 8)
+#define BMAP_COWFORK		(1 << 9)
 
 #define XFS_BMAP_EXT_FLAGS \
 	{ BMAP_LEFT_CONTIG,	"LC" }, \
 	{ BMAP_RIGHT_CONTIG,	"RC" }, \
 	{ BMAP_LEFT_FILLING,	"LF" }, \
 	{ BMAP_RIGHT_FILLING,	"RF" }, \
-	{ BMAP_ATTRFORK,	"ATTR" }
+	{ BMAP_ATTRFORK,	"ATTR" }, \
+	{ BMAP_COWFORK,		"COW" }
 
 
 /*
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index cd85274..37f0d9d 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -777,6 +777,7 @@ xfs_bmbt_init_cursor(
 {
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	struct xfs_btree_cur	*cur;
+	ASSERT(whichfork != XFS_COW_FORK);
 
 	cur = kmem_zone_zalloc(xfs_btree_cur_zone, KM_SLEEP);
 
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 7699a03..d29954a 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -206,9 +206,14 @@ xfs_iformat_fork(
 		XFS_ERROR_REPORT("xfs_iformat(7)", XFS_ERRLEVEL_LOW, ip->i_mount);
 		return -EFSCORRUPTED;
 	}
-	if (error) {
+	if (error)
 		return error;
+
+	if (xfs_is_reflink_inode(ip)) {
+		ASSERT(ip->i_cowfp == NULL);
+		xfs_ifork_init_cow(ip);
 	}
+
 	if (!XFS_DFORK_Q(dip))
 		return 0;
 
@@ -247,6 +252,9 @@ xfs_iformat_fork(
 	if (error) {
 		kmem_zone_free(xfs_ifork_zone, ip->i_afp);
 		ip->i_afp = NULL;
+		if (ip->i_cowfp)
+			kmem_zone_free(xfs_ifork_zone, ip->i_cowfp);
+		ip->i_cowfp = NULL;
 		xfs_idestroy_fork(ip, XFS_DATA_FORK);
 	}
 	return error;
@@ -761,6 +769,9 @@ xfs_idestroy_fork(
 	if (whichfork == XFS_ATTR_FORK) {
 		kmem_zone_free(xfs_ifork_zone, ip->i_afp);
 		ip->i_afp = NULL;
+	} else if (whichfork == XFS_COW_FORK) {
+		kmem_zone_free(xfs_ifork_zone, ip->i_cowfp);
+		ip->i_cowfp = NULL;
 	}
 }
 
@@ -948,6 +959,19 @@ xfs_iext_get_ext(
 	}
 }
 
+/* XFS_IEXT_STATE_TO_FORK() -- Convert BMAP state flags to an inode fork. */
+xfs_ifork_t *
+XFS_IEXT_STATE_TO_FORK(
+	struct xfs_inode	*ip,
+	int			state)
+{
+	if (state & BMAP_COWFORK)
+		return ip->i_cowfp;
+	else if (state & BMAP_ATTRFORK)
+		return ip->i_afp;
+	return &ip->i_df;
+}
+
 /*
  * Insert new item(s) into the extent records for incore inode
  * fork 'ifp'.  'count' new items are inserted at index 'idx'.
@@ -960,7 +984,7 @@ xfs_iext_insert(
 	xfs_bmbt_irec_t	*new,		/* items to insert */
 	int		state)		/* type of extent conversion */
 {
-	xfs_ifork_t	*ifp = (state & BMAP_ATTRFORK) ? ip->i_afp : &ip->i_df;
+	xfs_ifork_t	*ifp = XFS_IEXT_STATE_TO_FORK(ip, state);
 	xfs_extnum_t	i;		/* extent record index */
 
 	trace_xfs_iext_insert(ip, idx, new, state, _RET_IP_);
@@ -1210,7 +1234,7 @@ xfs_iext_remove(
 	int		ext_diff,	/* number of extents to remove */
 	int		state)		/* type of extent conversion */
 {
-	xfs_ifork_t	*ifp = (state & BMAP_ATTRFORK) ? ip->i_afp : &ip->i_df;
+	xfs_ifork_t	*ifp = XFS_IEXT_STATE_TO_FORK(ip, state);
 	xfs_extnum_t	nextents;	/* number of extents in file */
 	int		new_size;	/* size of extents after removal */
 
@@ -1955,3 +1979,20 @@ xfs_iext_irec_update_extoffs(
 		ifp->if_u1.if_ext_irec[i].er_extoff += ext_diff;
 	}
 }
+
+/*
+ * Initialize an inode's copy-on-write fork.
+ */
+void
+xfs_ifork_init_cow(
+	struct xfs_inode	*ip)
+{
+	if (ip->i_cowfp)
+		return;
+
+	ip->i_cowfp = kmem_zone_zalloc(xfs_ifork_zone,
+				       KM_SLEEP | KM_NOFS);
+	ip->i_cowfp->if_flags = XFS_IFEXTENTS;
+	ip->i_cformat = XFS_DINODE_FMT_EXTENTS;
+	ip->i_cnextents = 0;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index f95e072..44d38eb 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -92,7 +92,9 @@ typedef struct xfs_ifork {
 #define XFS_IFORK_PTR(ip,w)		\
 	((w) == XFS_DATA_FORK ? \
 		&(ip)->i_df : \
-		(ip)->i_afp)
+		((w) == XFS_ATTR_FORK ? \
+			(ip)->i_afp : \
+			(ip)->i_cowfp))
 #define XFS_IFORK_DSIZE(ip) \
 	(XFS_IFORK_Q(ip) ? \
 		XFS_IFORK_BOFF(ip) : \
@@ -105,26 +107,38 @@ typedef struct xfs_ifork {
 #define XFS_IFORK_SIZE(ip,w) \
 	((w) == XFS_DATA_FORK ? \
 		XFS_IFORK_DSIZE(ip) : \
-		XFS_IFORK_ASIZE(ip))
+		((w) == XFS_ATTR_FORK ? \
+			XFS_IFORK_ASIZE(ip) : \
+			0))
 #define XFS_IFORK_FORMAT(ip,w) \
 	((w) == XFS_DATA_FORK ? \
 		(ip)->i_d.di_format : \
-		(ip)->i_d.di_aformat)
+		((w) == XFS_ATTR_FORK ? \
+			(ip)->i_d.di_aformat : \
+			(ip)->i_cformat))
 #define XFS_IFORK_FMT_SET(ip,w,n) \
 	((w) == XFS_DATA_FORK ? \
 		((ip)->i_d.di_format = (n)) : \
-		((ip)->i_d.di_aformat = (n)))
+		((w) == XFS_ATTR_FORK ? \
+			((ip)->i_d.di_aformat = (n)) : \
+			((ip)->i_cformat = (n))))
 #define XFS_IFORK_NEXTENTS(ip,w) \
 	((w) == XFS_DATA_FORK ? \
 		(ip)->i_d.di_nextents : \
-		(ip)->i_d.di_anextents)
+		((w) == XFS_ATTR_FORK ? \
+			(ip)->i_d.di_anextents : \
+			(ip)->i_cnextents))
 #define XFS_IFORK_NEXT_SET(ip,w,n) \
 	((w) == XFS_DATA_FORK ? \
 		((ip)->i_d.di_nextents = (n)) : \
-		((ip)->i_d.di_anextents = (n)))
+		((w) == XFS_ATTR_FORK ? \
+			((ip)->i_d.di_anextents = (n)) : \
+			((ip)->i_cnextents = (n))))
 #define XFS_IFORK_MAXEXT(ip, w) \
 	(XFS_IFORK_SIZE(ip, w) / sizeof(xfs_bmbt_rec_t))
 
+xfs_ifork_t	*XFS_IEXT_STATE_TO_FORK(struct xfs_inode *ip, int state);
+
 int		xfs_iformat_fork(struct xfs_inode *, struct xfs_dinode *);
 void		xfs_iflush_fork(struct xfs_inode *, struct xfs_dinode *,
 				struct xfs_inode_log_item *, int);
@@ -169,4 +183,6 @@ void		xfs_iext_irec_update_extoffs(struct xfs_ifork *, int, int);
 
 extern struct kmem_zone	*xfs_ifork_zone;
 
+extern void xfs_ifork_init_cow(struct xfs_inode *ip);
+
 #endif	/* __XFS_INODE_FORK_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 73d0540..1c40b85 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -1263,9 +1263,10 @@ xfs_rmap_finish_one(
  */
 static bool
 xfs_rmap_update_is_needed(
-	struct xfs_mount	*mp)
+	struct xfs_mount	*mp,
+	int			whichfork)
 {
-	return xfs_sb_version_hasrmapbt(&mp->m_sb);
+	return xfs_sb_version_hasrmapbt(&mp->m_sb) && whichfork != XFS_COW_FORK;
 }
 
 /*
@@ -1311,7 +1312,7 @@ xfs_rmap_map_extent(
 	int			whichfork,
 	struct xfs_bmbt_irec	*PREV)
 {
-	if (!xfs_rmap_update_is_needed(mp))
+	if (!xfs_rmap_update_is_needed(mp, whichfork))
 		return 0;
 
 	return __xfs_rmap_add(mp, dfops, XFS_RMAP_MAP, ip->i_ino,
@@ -1327,7 +1328,7 @@ xfs_rmap_unmap_extent(
 	int			whichfork,
 	struct xfs_bmbt_irec	*PREV)
 {
-	if (!xfs_rmap_update_is_needed(mp))
+	if (!xfs_rmap_update_is_needed(mp, whichfork))
 		return 0;
 
 	return __xfs_rmap_add(mp, dfops, XFS_RMAP_UNMAP, ip->i_ino,
@@ -1343,7 +1344,7 @@ xfs_rmap_convert_extent(
 	int			whichfork,
 	struct xfs_bmbt_irec	*PREV)
 {
-	if (!xfs_rmap_update_is_needed(mp))
+	if (!xfs_rmap_update_is_needed(mp, whichfork))
 		return 0;
 
 	return __xfs_rmap_add(mp, dfops, XFS_RMAP_CONVERT, ip->i_ino,
@@ -1362,7 +1363,7 @@ xfs_rmap_alloc_extent(
 {
 	struct xfs_bmbt_irec	bmap;
 
-	if (!xfs_rmap_update_is_needed(mp))
+	if (!xfs_rmap_update_is_needed(mp, XFS_DATA_FORK))
 		return 0;
 
 	bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
@@ -1386,7 +1387,7 @@ xfs_rmap_free_extent(
 {
 	struct xfs_bmbt_irec	bmap;
 
-	if (!xfs_rmap_update_is_needed(mp))
+	if (!xfs_rmap_update_is_needed(mp, XFS_DATA_FORK))
 		return 0;
 
 	bmap.br_startblock = XFS_AGB_TO_FSB(mp, agno, bno);
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index be7b6de..8d74870 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -90,6 +90,7 @@ typedef __int64_t	xfs_sfiloff_t;	/* signed block number in a file */
  */
 #define	XFS_DATA_FORK	0
 #define	XFS_ATTR_FORK	1
+#define	XFS_COW_FORK	2
 
 /*
  * Min numbers of data/attr fork btree root pointers.
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 65b2e3f..2d3de02 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -76,6 +76,9 @@ xfs_inode_alloc(
 	ip->i_mount = mp;
 	memset(&ip->i_imap, 0, sizeof(struct xfs_imap));
 	ip->i_afp = NULL;
+	ip->i_cowfp = NULL;
+	ip->i_cnextents = 0;
+	ip->i_cformat = XFS_DINODE_FMT_EXTENTS;
 	memset(&ip->i_df, 0, sizeof(xfs_ifork_t));
 	ip->i_flags = 0;
 	ip->i_delayed_blks = 0;
@@ -101,6 +104,8 @@ xfs_inode_free_callback(
 
 	if (ip->i_afp)
 		xfs_idestroy_fork(ip, XFS_ATTR_FORK);
+	if (ip->i_cowfp)
+		xfs_idestroy_fork(ip, XFS_COW_FORK);
 
 	if (ip->i_itemp) {
 		ASSERT(!(ip->i_itemp->ili_item.li_flags & XFS_LI_IN_AIL));
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 46632f1..1af1d8d 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -47,6 +47,7 @@ typedef struct xfs_inode {
 
 	/* Extent information. */
 	xfs_ifork_t		*i_afp;		/* attribute fork pointer */
+	xfs_ifork_t		*i_cowfp;	/* copy on write extents */
 	xfs_ifork_t		i_df;		/* data fork */
 
 	/* operations vectors */
@@ -65,6 +66,9 @@ typedef struct xfs_inode {
 
 	struct xfs_icdinode	i_d;		/* most of ondisk inode */
 
+	xfs_extnum_t		i_cnextents;	/* # of extents in cow fork */
+	unsigned int		i_cformat;	/* format of cow fork */
+
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
 } xfs_inode_t;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
new file mode 100644
index 0000000..7adbb83
--- /dev/null
+++ b/fs/xfs/xfs_reflink.c
@@ -0,0 +1,114 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_error.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_ioctl.h"
+#include "xfs_trace.h"
+#include "xfs_log.h"
+#include "xfs_icache.h"
+#include "xfs_pnfs.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_bit.h"
+#include "xfs_alloc.h"
+#include "xfs_quota_defs.h"
+#include "xfs_quota.h"
+#include "xfs_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_reflink.h"
+
+/*
+ * Copy on Write of Shared Blocks
+ *
+ * XFS must preserve "the usual" file semantics even when two files share
+ * the same physical blocks.  This means that a write to one file must not
+ * alter the blocks in a different file; the way that we'll do that is
+ * through the use of a copy-on-write mechanism.  At a high level, that
+ * means that when we want to write to a shared block, we allocate a new
+ * block, write the data to the new block, and if that succeeds we map the
+ * new block into the file.
+ *
+ * XFS provides a "delayed allocation" mechanism that defers the allocation
+ * of disk blocks to dirty-but-not-yet-mapped file blocks as long as
+ * possible.  This reduces fragmentation by enabling the filesystem to ask
+ * for bigger chunks less often, which is exactly what we want for CoW.
+ *
+ * The delalloc mechanism begins when the kernel wants to make a block
+ * writable (write_begin or page_mkwrite).  If the offset is not mapped, we
+ * create a delalloc mapping, which is a regular in-core extent, but without
+ * a real startblock.  (For delalloc mappings, the startblock encodes both
+ * a flag that this is a delalloc mapping, and a worst-case estimate of how
+ * many blocks might be required to put the mapping into the BMBT.)  delalloc
+ * mappings are a reservation against the free space in the filesystem;
+ * adjacent mappings can also be combined into fewer larger mappings.
+ *
+ * When dirty pages are being written out (typically in writepage), the
+ * delalloc reservations are converted into real mappings by allocating
+ * blocks and replacing the delalloc mapping with real ones.  A delalloc
+ * mapping can be replaced by several real ones if the free space is
+ * fragmented.
+ *
+ * We want to adapt the delalloc mechanism for copy-on-write, since the
+ * write paths are similar.  The first two steps (creating the reservation
+ * and allocating the blocks) are exactly the same as delalloc except that
+ * the mappings must be stored in a separate CoW fork because we do not want
+ * to disturb the mapping in the data fork until we're sure that the write
+ * succeeded.  IO completion in this case is the process of removing the old
+ * mapping from the data fork and moving the new mapping from the CoW fork to
+ * the data fork.  This will be discussed shortly.
+ *
+ * For now, unaligned directio writes will be bounced back to the page cache.
+ * Block-aligned directio writes will use the same mechanism as buffered
+ * writes.
+ *
+ * CoW remapping must be done after the data block write completes,
+ * because we don't want to destroy the old data fork map until we're sure
+ * the new block has been written.  Since the new mappings are kept in a
+ * separate fork, we can simply iterate these mappings to find the ones
+ * that cover the file blocks that we just CoW'd.  For each extent, simply
+ * unmap the corresponding range in the data fork, map the new range into
+ * the data fork, and remove the extent from the CoW fork.
+ *
+ * Since the remapping operation can be applied to an arbitrary file
+ * range, we record the need for the remap step as a flag in the ioend
+ * instead of declaring a new IO type.  This is required for direct io
+ * because we only have ioend for the whole dio, and we have to be able to
+ * remember the presence of unwritten blocks and CoW blocks with a single
+ * ioend structure.  Better yet, the more ground we can cover with one
+ * ioend, the better.
+ */
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
new file mode 100644
index 0000000..820b151
--- /dev/null
+++ b/fs/xfs/xfs_reflink.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef __XFS_REFLINK_H
+#define __XFS_REFLINK_H 1
+
+#endif /* __XFS_REFLINK_H */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 5403199..883c375 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -269,10 +269,10 @@ DECLARE_EVENT_CLASS(xfs_bmap_class,
 		__field(unsigned long, caller_ip)
 	),
 	TP_fast_assign(
-		struct xfs_ifork	*ifp = (state & BMAP_ATTRFORK) ?
-						ip->i_afp : &ip->i_df;
+		struct xfs_ifork	*ifp;
 		struct xfs_bmbt_irec	r;
 
+		ifp = XFS_IEXT_STATE_TO_FORK(ip, state);
 		xfs_bmbt_get_all(xfs_iext_get_ext(ifp, idx), &r);
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 30/63] xfs: support bmapping delalloc extents in the CoW fork
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (28 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 29/63] xfs: introduce the CoW fork Darrick J. Wong
@ 2016-09-30  3:08 ` Darrick J. Wong
  2016-09-30  7:42   ` Christoph Hellwig
  2016-09-30  3:09 ` [PATCH 31/63] xfs: create delalloc extents in " Darrick J. Wong
                   ` (33 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:08 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Allow the creation of delayed allocation extents in the CoW fork.  In
a subsequent patch we'll wire up iomap_begin to actually do this via
reflink helper functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   12 ++++++++----
 fs/xfs/libxfs/xfs_bmap.h |    7 ++++---
 fs/xfs/xfs_iomap.c       |    2 +-
 fs/xfs/xfs_trace.h       |    6 +++---
 4 files changed, 16 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3388058..5749618 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2760,6 +2760,7 @@ xfs_bmap_add_extent_unwritten_real(
 STATIC void
 xfs_bmap_add_extent_hole_delay(
 	xfs_inode_t		*ip,	/* incore inode pointer */
+	int			whichfork,
 	xfs_extnum_t		*idx,	/* extent number to update/insert */
 	xfs_bmbt_irec_t		*new)	/* new data to add to file extents */
 {
@@ -2771,8 +2772,10 @@ xfs_bmap_add_extent_hole_delay(
 	int			state;  /* state bits, accessed thru macros */
 	xfs_filblks_t		temp=0;	/* temp for indirect calculations */
 
-	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	ifp = XFS_IFORK_PTR(ip, whichfork);
 	state = 0;
+	if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
 	ASSERT(isnullstartblock(new->br_startblock));
 
 	/*
@@ -2790,7 +2793,7 @@ xfs_bmap_add_extent_hole_delay(
 	 * Check and set flags if the current (right) segment exists.
 	 * If it doesn't exist, we're converting the hole at end-of-file.
 	 */
-	if (*idx < ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t)) {
+	if (*idx < ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)) {
 		state |= BMAP_RIGHT_VALID;
 		xfs_bmbt_get_all(xfs_iext_get_ext(ifp, *idx), &right);
 
@@ -4146,6 +4149,7 @@ xfs_bmapi_read(
 int
 xfs_bmapi_reserve_delalloc(
 	struct xfs_inode	*ip,
+	int			whichfork,
 	xfs_fileoff_t		aoff,
 	xfs_filblks_t		len,
 	struct xfs_bmbt_irec	*got,
@@ -4154,7 +4158,7 @@ xfs_bmapi_reserve_delalloc(
 	int			eof)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	xfs_extlen_t		alen;
 	xfs_extlen_t		indlen;
 	char			rt = XFS_IS_REALTIME_INODE(ip);
@@ -4213,7 +4217,7 @@ xfs_bmapi_reserve_delalloc(
 	got->br_startblock = nullstartblock(indlen);
 	got->br_blockcount = alen;
 	got->br_state = XFS_EXT_NORM;
-	xfs_bmap_add_extent_hole_delay(ip, lastx, got);
+	xfs_bmap_add_extent_hole_delay(ip, whichfork, lastx, got);
 
 	/*
 	 * Update our extent pointer, given that xfs_bmap_add_extent_hole_delay
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index adb64fb..75b1a1f 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -234,9 +234,10 @@ struct xfs_bmbt_rec_host *
 	xfs_bmap_search_extents(struct xfs_inode *ip, xfs_fileoff_t bno,
 		int fork, int *eofp, xfs_extnum_t *lastxp,
 		struct xfs_bmbt_irec *gotp, struct xfs_bmbt_irec *prevp);
-int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, xfs_fileoff_t aoff,
-		xfs_filblks_t len, struct xfs_bmbt_irec *got,
-		struct xfs_bmbt_irec *prev, xfs_extnum_t *lastx, int eof);
+int	xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork,
+		xfs_fileoff_t aoff, xfs_filblks_t len,
+		struct xfs_bmbt_irec *got, struct xfs_bmbt_irec *prev,
+		xfs_extnum_t *lastx, int eof);
 
 /* The XFS_BMAP_EXTENT_* in xfs_log_format.h must match these. */
 enum xfs_bmap_intent_type {
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index c08253e..59c7beb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -609,7 +609,7 @@ xfs_file_iomap_begin_delay(
 	}
 
 retry:
-	error = xfs_bmapi_reserve_delalloc(ip, offset_fsb,
+	error = xfs_bmapi_reserve_delalloc(ip, XFS_DATA_FORK, offset_fsb,
 			end_fsb - offset_fsb, &got,
 			&prev, &idx, eof);
 	switch (error) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 883c375..7612096 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3326,16 +3326,17 @@ DEFINE_INODE_ERROR_EVENT(xfs_reflink_dirty_page_error);
 
 /* copy on write */
 DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_around_shared);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_alloc);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_found);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
 
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
-DEFINE_INODE_IREC_EVENT(xfs_reflink_reserve_cow_extent);
 DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
 
 DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
 DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_trim_irec);
-DEFINE_SIMPLE_IO_EVENT(xfs_iomap_cow_delay);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
@@ -3343,7 +3344,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap_piece);
 
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_reserve_cow_range_error);
-DEFINE_INODE_ERROR_EVENT(xfs_reflink_reserve_cow_extent_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_allocate_cow_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_cow_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 31/63] xfs: create delalloc extents in CoW fork
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (29 preceding siblings ...)
  2016-09-30  3:08 ` [PATCH 30/63] xfs: support bmapping delalloc extents in " Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-10-04 16:38   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 32/63] xfs: support allocating delayed " Darrick J. Wong
                   ` (32 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Wire up iomap_begin to detect shared extents and create delayed allocation
extents in the CoW fork:

 1) Check if we already have an extent in the COW fork for the area.
    If so nothing to do, we can move along.
 2) Look up block number for the current extent, and if there is none
    it's not shared move along.
 3) Unshare the current extent as far as we are going to write into it.
    For this we avoid an additional COW fork lookup and use the
    information we set aside in step 1) above.
 4) Goto 1) unless we've covered the whole range.

Last but not least, this updates the xfs_reflink_reserve_cow_range calling
convention to pass a byte offset and length, as that is what both callers
expect anyway.  This patch has been refactored considerably as part of the
iomap transition.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_iomap.c   |   12 ++-
 fs/xfs/xfs_reflink.c |  202 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    9 ++
 3 files changed, 221 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 59c7beb..e8312b0 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -39,6 +39,7 @@
 #include "xfs_quota.h"
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
+#include "xfs_reflink.h"
 
 
 #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
@@ -961,8 +962,15 @@ xfs_file_iomap_begin(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	if ((flags & IOMAP_WRITE) &&
-	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
+	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
+		error = xfs_reflink_reserve_cow_range(ip, offset, length);
+		if (error < 0)
+			return error;
+	}
+
+	if ((flags & IOMAP_WRITE) && !IS_DAX(inode) &&
+		   !xfs_get_extsz_hint(ip)) {
+		/* Reserve delalloc blocks for regular writeback. */
 		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
 				iomap);
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 7adbb83..05a7fe6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -51,6 +51,7 @@
 #include "xfs_btree.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
+#include "xfs_iomap.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -112,3 +113,204 @@
  * ioend structure.  Better yet, the more ground we can cover with one
  * ioend, the better.
  */
+
+/*
+ * Given an AG extent, find the lowest-numbered run of shared blocks within
+ * that range and return the range in fbno/flen.
+ */
+int
+xfs_reflink_find_shared(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	xfs_agblock_t		*fbno,
+	xfs_extlen_t		*flen,
+	bool			find_maximal)
+{
+	struct xfs_buf		*agbp;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
+
+	error = xfs_refcount_find_shared(cur, agbno, aglen, fbno, flen,
+			find_maximal);
+
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+
+	xfs_buf_relse(agbp);
+	return error;
+}
+
+/*
+ * Trim the mapping to the next block where there's a change in the
+ * shared/unshared status.  More specifically, this means that we
+ * find the lowest-numbered extent of shared blocks that coincides with
+ * the given block mapping.  If the shared extent overlaps the start of
+ * the mapping, trim the mapping to the end of the shared extent.  If
+ * the shared region intersects the mapping, trim the mapping to the
+ * start of the shared extent.  If there are no shared regions that
+ * overlap, just return the original extent.
+ */
+int
+xfs_reflink_trim_around_shared(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*irec,
+	bool			*shared,
+	bool			*trimmed)
+{
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		aglen;
+	xfs_agblock_t		fbno;
+	xfs_extlen_t		flen;
+	int			error = 0;
+
+	/* Holes, unwritten, and delalloc extents cannot be shared */
+	if (!xfs_is_reflink_inode(ip) ||
+	    ISUNWRITTEN(irec) ||
+	    irec->br_startblock == HOLESTARTBLOCK ||
+	    irec->br_startblock == DELAYSTARTBLOCK) {
+		*shared = false;
+		return 0;
+	}
+
+	trace_xfs_reflink_trim_around_shared(ip, irec);
+
+	agno = XFS_FSB_TO_AGNO(ip->i_mount, irec->br_startblock);
+	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, irec->br_startblock);
+	aglen = irec->br_blockcount;
+
+	error = xfs_reflink_find_shared(ip->i_mount, agno, agbno,
+			aglen, &fbno, &flen, true);
+	if (error)
+		return error;
+
+	*shared = *trimmed = false;
+	if (flen == 0) {
+		/* No shared blocks at all. */
+		return 0;
+	} else if (fbno == agbno) {
+		/* The start of this extent is shared. */
+		irec->br_blockcount = flen;
+		*shared = true;
+		*trimmed = true;
+		return 0;
+	} else {
+		/* There's a shared extent midway through this extent. */
+		irec->br_blockcount = fbno - agbno;
+		*trimmed = true;
+		return 0;
+	}
+}
+
+/* Create a CoW reservation for a range of blocks within a file. */
+static int
+__xfs_reflink_reserve_cow(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_bmbt_irec	got, prev, imap;
+	xfs_fileoff_t		orig_end_fsb;
+	int			nimaps, eof = 0, error = 0;
+	bool			shared = false, trimmed = false;
+	xfs_extnum_t		idx;
+
+	/* Already reserved?  Skip the refcount btree access. */
+	xfs_bmap_search_extents(ip, *offset_fsb, XFS_COW_FORK, &eof, &idx,
+			&got, &prev);
+	if (!eof && got.br_startoff <= *offset_fsb) {
+		end_fsb = orig_end_fsb = got.br_startoff + got.br_blockcount;
+		trace_xfs_reflink_cow_found(ip, &got);
+		goto done;
+	}
+
+	/* Read extent from the source file. */
+	nimaps = 1;
+	error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
+			&imap, &nimaps, 0);
+	if (error)
+		goto out_unlock;
+	ASSERT(nimaps == 1);
+
+	/* Trim the mapping to the nearest shared extent boundary. */
+	error = xfs_reflink_trim_around_shared(ip, &imap, &shared, &trimmed);
+	if (error)
+		goto out_unlock;
+
+	end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
+
+	/* Not shared?  Just report the (potentially capped) extent. */
+	if (!shared)
+		goto done;
+
+	/*
+	 * Fork all the shared blocks from our write offset until the end of
+	 * the extent.
+	 */
+	error = xfs_qm_dqattach_locked(ip, 0);
+	if (error)
+		goto out_unlock;
+
+retry:
+	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, *offset_fsb,
+			end_fsb - *offset_fsb, &got,
+			&prev, &idx, eof);
+	switch (error) {
+	case 0:
+		break;
+	case -ENOSPC:
+	case -EDQUOT:
+		/* retry without any preallocation */
+		trace_xfs_reflink_cow_enospc(ip, &imap);
+		if (end_fsb != orig_end_fsb) {
+			end_fsb = orig_end_fsb;
+			goto retry;
+		}
+		/*FALLTHRU*/
+	default:
+		goto out_unlock;
+	}
+
+	trace_xfs_reflink_cow_alloc(ip, &got);
+done:
+	*offset_fsb = end_fsb;
+out_unlock:
+	return error;
+}
+
+/* Create a CoW reservation for part of a file. */
+int
+xfs_reflink_reserve_cow_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		count)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		offset_fsb, end_fsb;
+	int			error;
+
+	trace_xfs_reflink_reserve_cow_range(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + count);
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	while (offset_fsb < end_fsb) {
+		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb);
+		if (error) {
+			trace_xfs_reflink_reserve_cow_range_error(ip, error,
+				_RET_IP_);
+			break;
+		}
+	}
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 820b151..f824f87 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -20,4 +20,13 @@
 #ifndef __XFS_REFLINK_H
 #define __XFS_REFLINK_H 1
 
+extern int xfs_reflink_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
+		xfs_extlen_t *flen, bool find_maximal);
+extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
+		struct xfs_bmbt_irec *irec, bool *shared, bool *trimmed);
+
+extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
+		xfs_off_t offset, xfs_off_t count);
+
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 32/63] xfs: support allocating delayed extents in CoW fork
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (30 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 31/63] xfs: create delalloc extents in " Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-09-30  7:42   ` Christoph Hellwig
  2016-10-04 16:38   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 33/63] xfs: allocate " Darrick J. Wong
                   ` (31 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
allocation extents in the CoW fork to real allocations, and wire this
up all the way back to xfs_iomap_write_allocate().  In a subsequent
patch, we'll modify the writepage handler to call this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   51 ++++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_aops.c        |    6 ++++-
 fs/xfs/xfs_iomap.c       |    7 +++++-
 fs/xfs/xfs_iomap.h       |    2 +-
 4 files changed, 46 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 5749618..85a0c86 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -141,7 +141,8 @@ xfs_bmbt_lookup_ge(
  */
 static inline bool xfs_bmap_needs_btree(struct xfs_inode *ip, int whichfork)
 {
-	return XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS &&
+	return whichfork != XFS_COW_FORK &&
+		XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS &&
 		XFS_IFORK_NEXTENTS(ip, whichfork) >
 			XFS_IFORK_MAXEXT(ip, whichfork);
 }
@@ -151,7 +152,8 @@ static inline bool xfs_bmap_needs_btree(struct xfs_inode *ip, int whichfork)
  */
 static inline bool xfs_bmap_wants_extents(struct xfs_inode *ip, int whichfork)
 {
-	return XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE &&
+	return whichfork != XFS_COW_FORK &&
+		XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE &&
 		XFS_IFORK_NEXTENTS(ip, whichfork) <=
 			XFS_IFORK_MAXEXT(ip, whichfork);
 }
@@ -641,6 +643,7 @@ xfs_bmap_btree_to_extents(
 
 	mp = ip->i_mount;
 	ifp = XFS_IFORK_PTR(ip, whichfork);
+	ASSERT(whichfork != XFS_COW_FORK);
 	ASSERT(ifp->if_flags & XFS_IFEXTENTS);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE);
 	rblock = ifp->if_broot;
@@ -707,6 +710,7 @@ xfs_bmap_extents_to_btree(
 	xfs_bmbt_ptr_t		*pp;		/* root block address pointer */
 
 	mp = ip->i_mount;
+	ASSERT(whichfork != XFS_COW_FORK);
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_EXTENTS);
 
@@ -838,6 +842,7 @@ xfs_bmap_local_to_extents_empty(
 {
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 
+	ASSERT(whichfork != XFS_COW_FORK);
 	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_LOCAL);
 	ASSERT(ifp->if_bytes == 0);
 	ASSERT(XFS_IFORK_NEXTENTS(ip, whichfork) == 0);
@@ -1671,7 +1676,8 @@ xfs_bmap_one_block(
  */
 STATIC int				/* error */
 xfs_bmap_add_extent_delay_real(
-	struct xfs_bmalloca	*bma)
+	struct xfs_bmalloca	*bma,
+	int			whichfork)
 {
 	struct xfs_bmbt_irec	*new = &bma->got;
 	int			diff;	/* temp value */
@@ -1689,11 +1695,14 @@ xfs_bmap_add_extent_delay_real(
 	xfs_filblks_t		temp=0;	/* value for da_new calculations */
 	xfs_filblks_t		temp2=0;/* value for da_new calculations */
 	int			tmp_rval;	/* partial logging flags */
-	int			whichfork = XFS_DATA_FORK;
 	struct xfs_mount	*mp;
+	xfs_extnum_t		*nextents;
 
 	mp = bma->ip->i_mount;
 	ifp = XFS_IFORK_PTR(bma->ip, whichfork);
+	ASSERT(whichfork != XFS_ATTR_FORK);
+	nextents = (whichfork == XFS_COW_FORK ? &bma->ip->i_cnextents :
+						&bma->ip->i_d.di_nextents);
 
 	ASSERT(bma->idx >= 0);
 	ASSERT(bma->idx <= ifp->if_bytes / sizeof(struct xfs_bmbt_rec));
@@ -1707,6 +1716,9 @@ xfs_bmap_add_extent_delay_real(
 #define	RIGHT		r[1]
 #define	PREV		r[2]
 
+	if (whichfork == XFS_COW_FORK)
+		state |= BMAP_COWFORK;
+
 	/*
 	 * Set up a bunch of variables to make the tests simpler.
 	 */
@@ -1793,7 +1805,7 @@ xfs_bmap_add_extent_delay_real(
 		trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
 
 		xfs_iext_remove(bma->ip, bma->idx + 1, 2, state);
-		bma->ip->i_d.di_nextents--;
+		(*nextents)--;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -1895,7 +1907,7 @@ xfs_bmap_add_extent_delay_real(
 		xfs_bmbt_set_startblock(ep, new->br_startblock);
 		trace_xfs_bmap_post_update(bma->ip, bma->idx, state, _THIS_IP_);
 
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -1965,7 +1977,7 @@ xfs_bmap_add_extent_delay_real(
 		temp = PREV.br_blockcount - new->br_blockcount;
 		xfs_bmbt_set_blockcount(ep, temp);
 		xfs_iext_insert(bma->ip, bma->idx, 1, new, state);
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2049,7 +2061,7 @@ xfs_bmap_add_extent_delay_real(
 		trace_xfs_bmap_pre_update(bma->ip, bma->idx, state, _THIS_IP_);
 		xfs_bmbt_set_blockcount(ep, temp);
 		xfs_iext_insert(bma->ip, bma->idx + 1, 1, new, state);
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2118,7 +2130,7 @@ xfs_bmap_add_extent_delay_real(
 		RIGHT.br_blockcount = temp2;
 		/* insert LEFT (r[0]) and RIGHT (r[1]) at the same time */
 		xfs_iext_insert(bma->ip, bma->idx + 1, 2, &LEFT, state);
-		bma->ip->i_d.di_nextents++;
+		(*nextents)++;
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2216,7 +2228,8 @@ xfs_bmap_add_extent_delay_real(
 
 	xfs_bmap_check_leaf_extents(bma->cur, bma->ip, whichfork);
 done:
-	bma->logflags |= rval;
+	if (whichfork != XFS_COW_FORK)
+		bma->logflags |= rval;
 	return error;
 #undef	LEFT
 #undef	RIGHT
@@ -3861,7 +3874,8 @@ xfs_bmap_btalloc(
 		ASSERT(nullfb || fb_agno == args.agno ||
 		       (ap->dfops->dop_low && fb_agno < args.agno));
 		ap->length = args.len;
-		ap->ip->i_d.di_nblocks += args.len;
+		if (!(ap->flags & XFS_BMAPI_COWFORK))
+			ap->ip->i_d.di_nblocks += args.len;
 		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
 		if (ap->wasdel)
 			ap->ip->i_delayed_blks -= args.len;
@@ -4248,8 +4262,7 @@ xfs_bmapi_allocate(
 	struct xfs_bmalloca	*bma)
 {
 	struct xfs_mount	*mp = bma->ip->i_mount;
-	int			whichfork = (bma->flags & XFS_BMAPI_ATTRFORK) ?
-						XFS_ATTR_FORK : XFS_DATA_FORK;
+	int			whichfork = xfs_bmapi_whichfork(bma->flags);
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(bma->ip, whichfork);
 	int			tmp_logflags = 0;
 	int			error;
@@ -4344,7 +4357,7 @@ xfs_bmapi_allocate(
 		bma->got.br_state = XFS_EXT_UNWRITTEN;
 
 	if (bma->wasdel)
-		error = xfs_bmap_add_extent_delay_real(bma);
+		error = xfs_bmap_add_extent_delay_real(bma, whichfork);
 	else
 		error = xfs_bmap_add_extent_hole_real(bma, whichfork);
 
@@ -4498,8 +4511,7 @@ xfs_bmapi_write(
 	orig_mval = mval;
 	orig_nmap = *nmap;
 #endif
-	whichfork = (flags & XFS_BMAPI_ATTRFORK) ?
-		XFS_ATTR_FORK : XFS_DATA_FORK;
+	whichfork = xfs_bmapi_whichfork(flags);
 
 	ASSERT(*nmap >= 1);
 	ASSERT(*nmap <= XFS_BMAP_MAX_NMAP);
@@ -4510,6 +4522,11 @@ xfs_bmapi_write(
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	if (whichfork == XFS_ATTR_FORK)
 		ASSERT(!(flags & XFS_BMAPI_REMAP));
+	if (whichfork == XFS_COW_FORK) {
+		ASSERT(!(flags & XFS_BMAPI_REMAP));
+		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
+		ASSERT(!(flags & XFS_BMAPI_CONVERT));
+	}
 	if (flags & XFS_BMAPI_REMAP) {
 		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
 		ASSERT(!(flags & XFS_BMAPI_CONVERT));
@@ -4579,6 +4596,8 @@ xfs_bmapi_write(
 		 */
 		if (flags & XFS_BMAPI_REMAP)
 			ASSERT(inhole);
+		if (flags & XFS_BMAPI_COWFORK)
+			ASSERT(!inhole);
 
 		/*
 		 * First, deal with the hole before the allocated space
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4a28fa9..007a520 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -362,9 +362,11 @@ xfs_map_blocks(
 
 	if (type == XFS_IO_DELALLOC &&
 	    (!nimaps || isnullstartblock(imap->br_startblock))) {
-		error = xfs_iomap_write_allocate(ip, offset, imap);
+		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
+				imap);
 		if (!error)
-			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
+			trace_xfs_map_blocks_alloc(ip, offset, count, type,
+					imap);
 		return error;
 	}
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index e8312b0..ad6939d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -667,6 +667,7 @@ xfs_file_iomap_begin_delay(
 int
 xfs_iomap_write_allocate(
 	xfs_inode_t	*ip,
+	int		whichfork,
 	xfs_off_t	offset,
 	xfs_bmbt_irec_t *imap)
 {
@@ -679,8 +680,12 @@ xfs_iomap_write_allocate(
 	xfs_trans_t	*tp;
 	int		nimaps;
 	int		error = 0;
+	int		flags = 0;
 	int		nres;
 
+	if (whichfork == XFS_COW_FORK)
+		flags |= XFS_BMAPI_COWFORK;
+
 	/*
 	 * Make sure that the dquots are there.
 	 */
@@ -774,7 +779,7 @@ xfs_iomap_write_allocate(
 			 * pointer that the caller gave to us.
 			 */
 			error = xfs_bmapi_write(tp, ip, map_start_fsb,
-						count_fsb, 0, &first_block,
+						count_fsb, flags, &first_block,
 						nres, imap, &nimaps,
 						&dfops);
 			if (error)
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 6498be4..a16b956 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -25,7 +25,7 @@ struct xfs_bmbt_irec;
 
 int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
 			struct xfs_bmbt_irec *, int);
-int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
+int xfs_iomap_write_allocate(struct xfs_inode *, int, xfs_off_t,
 			struct xfs_bmbt_irec *);
 int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
 


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 33/63] xfs: allocate delayed extents in CoW fork
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (31 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 32/63] xfs: support allocating delayed " Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-10-04 16:38   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 34/63] xfs: support removing extents from " Darrick J. Wong
                   ` (30 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Modify the writepage handler to find and convert pending delalloc
extents to real allocations.  Furthermore, when we're doing non-cow
writes to a part of a file that already has a CoW reservation (the
cowextsz hint that we set up in a subsequent patch facilitates this),
promote the write to copy-on-write so that the entire extent can get
written out as a single extent on disk, thereby reducing post-CoW
fragmentation.

Christoph moved the CoW support code in _map_blocks to a separate helper
function, refactored other functions, and reduced the number of CoW fork
lookups, so I merged those changes here to reduce churn.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c    |  106 ++++++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_aops.h    |    4 +-
 fs/xfs/xfs_reflink.c |   86 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    4 ++
 4 files changed, 178 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 007a520..7b1e9de 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -31,6 +31,7 @@
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_reflink.h"
 #include <linux/gfp.h>
 #include <linux/mpage.h>
 #include <linux/pagevec.h>
@@ -341,6 +342,7 @@ xfs_map_blocks(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	ASSERT(type != XFS_IO_COW);
 	if (type == XFS_IO_UNWRITTEN)
 		bmapi_flags |= XFS_BMAPI_IGSTATE;
 
@@ -355,6 +357,13 @@ xfs_map_blocks(
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
 				imap, &nimaps, bmapi_flags);
+	/*
+	 * Truncate an overwrite extent if there's a pending CoW
+	 * reservation before the end of this extent.  This forces us
+	 * to come back to writepage to take care of the CoW.
+	 */
+	if (nimaps && type == XFS_IO_OVERWRITE)
+		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (error)
@@ -365,8 +374,7 @@ xfs_map_blocks(
 		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
 				imap);
 		if (!error)
-			trace_xfs_map_blocks_alloc(ip, offset, count, type,
-					imap);
+			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
 		return error;
 	}
 
@@ -645,13 +653,16 @@ xfs_check_page_type(
 	bh = head = page_buffers(page);
 	do {
 		if (buffer_unwritten(bh)) {
-			if (type == XFS_IO_UNWRITTEN)
+			if (type == XFS_IO_UNWRITTEN ||
+			    type == XFS_IO_COW)
 				return true;
 		} else if (buffer_delay(bh)) {
-			if (type == XFS_IO_DELALLOC)
+			if (type == XFS_IO_DELALLOC ||
+			    type == XFS_IO_COW)
 				return true;
 		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
-			if (type == XFS_IO_OVERWRITE)
+			if (type == XFS_IO_OVERWRITE ||
+			    type == XFS_IO_COW)
 				return true;
 		}
 
@@ -739,6 +750,56 @@ xfs_aops_discard_page(
 	return;
 }
 
+static int
+xfs_map_cow(
+	struct xfs_writepage_ctx *wpc,
+	struct inode		*inode,
+	loff_t			offset,
+	unsigned int		*new_type)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_bmbt_irec	imap;
+	bool			is_cow = false, need_alloc = false;
+	int			error;
+
+	/*
+	 * If we already have a valid COW mapping keep using it.
+	 */
+	if (wpc->io_type == XFS_IO_COW) {
+		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
+		if (wpc->imap_valid) {
+			*new_type = XFS_IO_COW;
+			return 0;
+		}
+	}
+
+	/*
+	 * Else we need to check if there is a COW mapping at this offset.
+	 */
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap, &need_alloc);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+
+	if (!is_cow)
+		return 0;
+
+	/*
+	 * And if the COW mapping has a delayed extent here we need to
+	 * allocate real space for it now.
+	 */
+	if (need_alloc) {
+		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
+				&imap);
+		if (error)
+			return error;
+	}
+
+	wpc->io_type = *new_type = XFS_IO_COW;
+	wpc->imap_valid = true;
+	wpc->imap = imap;
+	return 0;
+}
+
 /*
  * We implement an immediate ioend submission policy here to avoid needing to
  * chain multiple ioends and hence nest mempool allocations which can violate
@@ -771,6 +832,7 @@ xfs_writepage_map(
 	int			error = 0;
 	int			count = 0;
 	int			uptodate = 1;
+	unsigned int		new_type;
 
 	bh = head = page_buffers(page);
 	offset = page_offset(page);
@@ -791,22 +853,13 @@ xfs_writepage_map(
 			continue;
 		}
 
-		if (buffer_unwritten(bh)) {
-			if (wpc->io_type != XFS_IO_UNWRITTEN) {
-				wpc->io_type = XFS_IO_UNWRITTEN;
-				wpc->imap_valid = false;
-			}
-		} else if (buffer_delay(bh)) {
-			if (wpc->io_type != XFS_IO_DELALLOC) {
-				wpc->io_type = XFS_IO_DELALLOC;
-				wpc->imap_valid = false;
-			}
-		} else if (buffer_uptodate(bh)) {
-			if (wpc->io_type != XFS_IO_OVERWRITE) {
-				wpc->io_type = XFS_IO_OVERWRITE;
-				wpc->imap_valid = false;
-			}
-		} else {
+		if (buffer_unwritten(bh))
+			new_type = XFS_IO_UNWRITTEN;
+		else if (buffer_delay(bh))
+			new_type = XFS_IO_DELALLOC;
+		else if (buffer_uptodate(bh))
+			new_type = XFS_IO_OVERWRITE;
+		else {
 			if (PageUptodate(page))
 				ASSERT(buffer_mapped(bh));
 			/*
@@ -819,6 +872,17 @@ xfs_writepage_map(
 			continue;
 		}
 
+		if (xfs_is_reflink_inode(XFS_I(inode))) {
+			error = xfs_map_cow(wpc, inode, offset, &new_type);
+			if (error)
+				goto out;
+		}
+
+		if (wpc->io_type != new_type) {
+			wpc->io_type = new_type;
+			wpc->imap_valid = false;
+		}
+
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
 							 offset);
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 1950e3b..b3c6634 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -28,13 +28,15 @@ enum {
 	XFS_IO_DELALLOC,	/* covers delalloc region */
 	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
 	XFS_IO_OVERWRITE,	/* covers already allocated extent */
+	XFS_IO_COW,		/* covers copy-on-write extent */
 };
 
 #define XFS_IO_TYPES \
 	{ XFS_IO_INVALID,		"invalid" }, \
 	{ XFS_IO_DELALLOC,		"delalloc" }, \
 	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
-	{ XFS_IO_OVERWRITE,		"overwrite" }
+	{ XFS_IO_OVERWRITE,		"overwrite" }, \
+	{ XFS_IO_COW,			"CoW" }
 
 /*
  * Structure for buffered I/O completions.
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 05a7fe6..e8c7c85 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -314,3 +314,89 @@ xfs_reflink_reserve_cow_range(
 
 	return error;
 }
+
+/*
+ * Find the CoW reservation (and whether or not it needs block allocation)
+ * for a given byte offset of a file.
+ */
+bool
+xfs_reflink_find_cow_mapping(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	struct xfs_bmbt_irec		*imap,
+	bool				*need_alloc)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	xfs_fileoff_t			bno;
+	xfs_extnum_t			idx;
+
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
+
+	if (!xfs_is_reflink_inode(ip))
+		return false;
+
+	/* Find the extent in the CoW fork. */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	bno = XFS_B_TO_FSBT(ip->i_mount, offset);
+	gotp = xfs_iext_bno_to_ext(ifp, bno, &idx);
+	if (!gotp)
+		return false;
+
+	xfs_bmbt_get_all(gotp, &irec);
+	if (bno >= irec.br_startoff + irec.br_blockcount ||
+	    bno < irec.br_startoff)
+		return false;
+
+	trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
+			&irec);
+
+	/* If it's still delalloc, we must allocate later. */
+	*imap = irec;
+	*need_alloc = !!(isnullstartblock(irec.br_startblock));
+
+	return true;
+}
+
+/*
+ * Trim an extent to end at the next CoW reservation past offset_fsb.
+ */
+int
+xfs_reflink_trim_irec_to_next_cow(
+	struct xfs_inode		*ip,
+	xfs_fileoff_t			offset_fsb,
+	struct xfs_bmbt_irec		*imap)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	xfs_extnum_t			idx;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	/* Find the extent in the CoW fork. */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	gotp = xfs_iext_bno_to_ext(ifp, offset_fsb, &idx);
+	if (!gotp)
+		return 0;
+	xfs_bmbt_get_all(gotp, &irec);
+
+	/* This is the extent before; try sliding up one. */
+	if (irec.br_startoff < offset_fsb) {
+		idx++;
+		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
+			return 0;
+		gotp = xfs_iext_get_ext(ifp, idx);
+		xfs_bmbt_get_all(gotp, &irec);
+	}
+
+	if (irec.br_startoff >= imap->br_startoff + imap->br_blockcount)
+		return 0;
+
+	imap->br_blockcount = irec.br_startoff - imap->br_startoff;
+	trace_xfs_reflink_trim_irec(ip, imap);
+
+	return 0;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index f824f87..11408c0 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -28,5 +28,9 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
 
 extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
 		xfs_off_t offset, xfs_off_t count);
+extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
+		struct xfs_bmbt_irec *imap, bool *need_alloc);
+extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
+		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 34/63] xfs: support removing extents from CoW fork
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (32 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 33/63] xfs: allocate " Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-09-30  7:46   ` Christoph Hellwig
  2016-10-05 18:26   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 35/63] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
                   ` (29 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Create a helper method to remove extents from the CoW fork without
any of the side effects (rmapbt/bmbt updates) of the regular extent
deletion routine.  We'll eventually use this to clear out the CoW fork
during ioend processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Use bmapi_read to iterate and trim the CoW extents instead of
reading them raw via the iext code.
---
 fs/xfs/libxfs/xfs_bmap.c |  176 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap.h |    1 
 2 files changed, 177 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 85a0c86..451f3e4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4906,6 +4906,7 @@ xfs_bmap_del_extent(
 		/*
 		 * Matches the whole extent.  Delete the entry.
 		 */
+		trace_xfs_bmap_pre_update(ip, *idx, state, _THIS_IP_);
 		xfs_iext_remove(ip, *idx, 1,
 				whichfork == XFS_ATTR_FORK ? BMAP_ATTRFORK : 0);
 		--*idx;
@@ -5123,6 +5124,181 @@ xfs_bmap_del_extent(
 }
 
 /*
+ * xfs_bunmapi_cow() -- Remove the relevant parts of the CoW fork.
+ *			See xfs_bmap_del_extent.
+ * @ip: XFS inode.
+ * @idx: Extent number to delete.
+ * @del: Extent to remove.
+ */
+int
+xfs_bunmapi_cow(
+	xfs_inode_t		*ip,
+	xfs_bmbt_irec_t		*del)
+{
+	xfs_filblks_t		da_new;	/* new delay-alloc indirect blocks */
+	xfs_filblks_t		da_old;	/* old delay-alloc indirect blocks */
+	xfs_fsblock_t		del_endblock = 0;/* first block past del */
+	xfs_fileoff_t		del_endoff;	/* first offset past del */
+	int			delay;	/* current block is delayed allocated */
+	xfs_bmbt_rec_host_t	*ep;	/* current extent entry pointer */
+	int			error;	/* error return value */
+	xfs_bmbt_irec_t		got;	/* current extent entry */
+	xfs_fileoff_t		got_endoff;	/* first offset past got */
+	xfs_ifork_t		*ifp;	/* inode fork pointer */
+	xfs_mount_t		*mp;	/* mount structure */
+	xfs_filblks_t		nblks;	/* quota/sb block count */
+	xfs_bmbt_irec_t		new;	/* new record to be inserted */
+	/* REFERENCED */
+	uint			qfield;	/* quota field to update */
+	xfs_filblks_t		temp;	/* for indirect length calculations */
+	xfs_filblks_t		temp2;	/* for indirect length calculations */
+	int			state = BMAP_COWFORK;
+	int			eof;
+	xfs_extnum_t		eidx;
+
+	mp = ip->i_mount;
+	XFS_STATS_INC(mp, xs_del_exlist);
+
+	ep = xfs_bmap_search_extents(ip, del->br_startoff, XFS_COW_FORK, &eof,
+			&eidx, &got, &new);
+
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK); ifp = ifp;
+	ASSERT((eidx >= 0) && (eidx < ifp->if_bytes /
+		(uint)sizeof(xfs_bmbt_rec_t)));
+	ASSERT(del->br_blockcount > 0);
+	ASSERT(got.br_startoff <= del->br_startoff);
+	del_endoff = del->br_startoff + del->br_blockcount;
+	got_endoff = got.br_startoff + got.br_blockcount;
+	ASSERT(got_endoff >= del_endoff);
+	delay = isnullstartblock(got.br_startblock);
+	ASSERT(isnullstartblock(del->br_startblock) == delay);
+	qfield = 0;
+	error = 0;
+	/*
+	 * If deleting a real allocation, must free up the disk space.
+	 */
+	if (!delay) {
+		nblks = del->br_blockcount;
+		qfield = XFS_TRANS_DQ_BCOUNT;
+		/*
+		 * Set up del_endblock and cur for later.
+		 */
+		del_endblock = del->br_startblock + del->br_blockcount;
+		da_old = da_new = 0;
+	} else {
+		da_old = startblockval(got.br_startblock);
+		da_new = 0;
+		nblks = 0;
+	}
+	qfield = qfield;
+	nblks = nblks;
+
+	/*
+	 * Set flag value to use in switch statement.
+	 * Left-contig is 2, right-contig is 1.
+	 */
+	switch (((got.br_startoff == del->br_startoff) << 1) |
+		(got_endoff == del_endoff)) {
+	case 3:
+		/*
+		 * Matches the whole extent.  Delete the entry.
+		 */
+		xfs_iext_remove(ip, eidx, 1, BMAP_COWFORK);
+		--eidx;
+		break;
+
+	case 2:
+		/*
+		 * Deleting the first part of the extent.
+		 */
+		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
+		xfs_bmbt_set_startoff(ep, del_endoff);
+		temp = got.br_blockcount - del->br_blockcount;
+		xfs_bmbt_set_blockcount(ep, temp);
+		if (delay) {
+			temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+				da_old);
+			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
+			trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+			da_new = temp;
+			break;
+		}
+		xfs_bmbt_set_startblock(ep, del_endblock);
+		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+		break;
+
+	case 1:
+		/*
+		 * Deleting the last part of the extent.
+		 */
+		temp = got.br_blockcount - del->br_blockcount;
+		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
+		xfs_bmbt_set_blockcount(ep, temp);
+		if (delay) {
+			temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+				da_old);
+			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
+			trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+			da_new = temp;
+			break;
+		}
+		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+		break;
+
+	case 0:
+		/*
+		 * Deleting the middle of the extent.
+		 */
+		temp = del->br_startoff - got.br_startoff;
+		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
+		xfs_bmbt_set_blockcount(ep, temp);
+		new.br_startoff = del_endoff;
+		temp2 = got_endoff - del_endoff;
+		new.br_blockcount = temp2;
+		new.br_state = got.br_state;
+		if (!delay) {
+			new.br_startblock = del_endblock;
+		} else {
+			temp = xfs_bmap_worst_indlen(ip, temp);
+			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
+			temp2 = xfs_bmap_worst_indlen(ip, temp2);
+			new.br_startblock = nullstartblock((int)temp2);
+			da_new = temp + temp2;
+			while (da_new > da_old) {
+				if (temp) {
+					temp--;
+					da_new--;
+					xfs_bmbt_set_startblock(ep,
+						nullstartblock((int)temp));
+				}
+				if (da_new == da_old)
+					break;
+				if (temp2) {
+					temp2--;
+					da_new--;
+					new.br_startblock =
+						nullstartblock((int)temp2);
+				}
+			}
+		}
+		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
+		xfs_iext_insert(ip, eidx + 1, 1, &new, state);
+		++eidx;
+		break;
+	}
+
+	/*
+	 * Account for change in delayed indirect blocks.
+	 * Nothing to do for disk quota accounting here.
+	 */
+	ASSERT(da_old >= da_new);
+	if (da_old > da_new)
+		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), false);
+
+	return error;
+}
+
+/*
  * Unmap (remove) blocks from a file.
  * If nexts is nonzero then the number of extents to remove is limited to
  * that value.  If not all extents in the block range can be removed then
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 75b1a1f..7c4ad01 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -221,6 +221,7 @@ int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
 		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
 		struct xfs_defer_ops *dfops, int *done);
+int	xfs_bunmapi_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *del);
 int	xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
 		xfs_extnum_t num);
 uint	xfs_default_attroffset(struct xfs_inode *ip);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 35/63] xfs: move mappings from cow fork to data fork after copy-write
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (33 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 34/63] xfs: support removing extents from " Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-10-05 18:26   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 36/63] xfs: report shared extent mappings to userspace correctly Darrick J. Wong
                   ` (28 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

After the write component of a copy-write operation finishes, clean up
the bookkeeping left behind.  On error, we simply free the new blocks
and pass the error up.  If we succeed, however, then we must remove
the old data fork mapping and move the cow fork mapping to the data
fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: Call the CoW failure function during xfs_cancel_ioend]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
v2: If CoW fails, we need to remove the CoW fork mapping and free the
blocks.  Furthermore, if xfs_cancel_ioend happens, we also need to
clean out all the CoW record keeping.

v3: When we're removing CoW extents, only free one extent per
transaction to avoid running out of reservation.  Also,
xfs_cancel_ioend mustn't clean out the CoW fork because it is called
when async writeback can't get an inode lock and will try again.

v4: Use bmapi_read to iterate the CoW fork instead of calling the
iext functions directly, and make the CoW remapping atomic by
using the deferred ops mechanism which takes care of logging redo
items for us.

v5: Unlock the inode if cancelling the CoW reservation fails.
---
 fs/xfs/xfs_aops.c    |   22 ++++
 fs/xfs/xfs_reflink.c |  271 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    8 +
 3 files changed, 299 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 7b1e9de..aa23993 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -288,6 +288,23 @@ xfs_end_io(
 		error = -EIO;
 
 	/*
+	 * For a CoW extent, we need to move the mapping from the CoW fork
+	 * to the data fork.  If instead an error happened, just dump the
+	 * new blocks.
+	 */
+	if (ioend->io_type == XFS_IO_COW) {
+		if (ioend->io_bio->bi_error) {
+			error = xfs_reflink_cancel_cow_range(ip,
+					ioend->io_offset, ioend->io_size);
+			goto done;
+		}
+		error = xfs_reflink_end_cow(ip, ioend->io_offset,
+				ioend->io_size);
+		if (error)
+			goto done;
+	}
+
+	/*
 	 * For unwritten extents we need to issue transactions to convert a
 	 * range to normal written extens after the data I/O has finished.
 	 * Detecting and handling completion IO errors is done individually
@@ -302,7 +319,8 @@ xfs_end_io(
 	} else if (ioend->io_append_trans) {
 		error = xfs_setfilesize_ioend(ioend, error);
 	} else {
-		ASSERT(!xfs_ioend_is_append(ioend));
+		ASSERT(!xfs_ioend_is_append(ioend) ||
+		       ioend->io_type == XFS_IO_COW);
 	}
 
 done:
@@ -316,7 +334,7 @@ xfs_end_bio(
 	struct xfs_ioend	*ioend = bio->bi_private;
 	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
 
-	if (ioend->io_type == XFS_IO_UNWRITTEN)
+	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
 		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
 	else if (ioend->io_append_trans)
 		queue_work(mp->m_data_workqueue, &ioend->io_work);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index e8c7c85..d913ad1 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -52,6 +52,7 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_iomap.h"
+#include "xfs_rmap_btree.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -114,6 +115,37 @@
  * ioend, the better.
  */
 
+/* Trim extent to fit a logical block range. */
+static void
+xfs_trim_extent(
+	struct xfs_bmbt_irec	*irec,
+	xfs_fileoff_t		bno,
+	xfs_filblks_t		len)
+{
+	xfs_fileoff_t		distance;
+	xfs_fileoff_t		end = bno + len;
+
+	if (irec->br_startoff + irec->br_blockcount <= bno ||
+	    irec->br_startoff >= end) {
+		irec->br_blockcount = 0;
+		return;
+	}
+
+	if (irec->br_startoff < bno) {
+		distance = bno - irec->br_startoff;
+		if (irec->br_startblock != DELAYSTARTBLOCK &&
+		    irec->br_startblock != HOLESTARTBLOCK)
+			irec->br_startblock += distance;
+		irec->br_startoff += distance;
+		irec->br_blockcount -= distance;
+	}
+
+	if (end < irec->br_startoff + irec->br_blockcount) {
+		distance = irec->br_startoff + irec->br_blockcount - end;
+		irec->br_blockcount -= distance;
+	}
+}
+
 /*
  * Given an AG extent, find the lowest-numbered run of shared blocks within
  * that range and return the range in fbno/flen.
@@ -400,3 +432,242 @@ xfs_reflink_trim_irec_to_next_cow(
 
 	return 0;
 }
+
+/*
+ * Cancel all pending CoW reservations for some block range of an inode.
+ */
+int
+xfs_reflink_cancel_cow_blocks(
+	struct xfs_inode		*ip,
+	struct xfs_trans		**tpp,
+	xfs_fileoff_t			offset_fsb,
+	xfs_fileoff_t			end_fsb)
+{
+	struct xfs_bmbt_irec		irec;
+	xfs_filblks_t			count_fsb;
+	xfs_fsblock_t			firstfsb;
+	struct xfs_defer_ops		dfops;
+	int				error = 0;
+	int				nimaps;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	/* Go find the old extent in the CoW fork. */
+	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
+	while (count_fsb) {
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
+				&nimaps, XFS_BMAPI_COWFORK);
+		if (error)
+			break;
+		ASSERT(nimaps == 1);
+
+		trace_xfs_reflink_cancel_cow(ip, &irec);
+
+		if (irec.br_startblock == DELAYSTARTBLOCK) {
+			/* Free a delayed allocation. */
+			xfs_mod_fdblocks(ip->i_mount, irec.br_blockcount,
+					false);
+			ip->i_delayed_blks -= irec.br_blockcount;
+
+			/* Remove the mapping from the CoW fork. */
+			error = xfs_bunmapi_cow(ip, &irec);
+			if (error)
+				break;
+		} else if (irec.br_startblock == HOLESTARTBLOCK) {
+			/* empty */
+		} else {
+			xfs_trans_ijoin(*tpp, ip, 0);
+			xfs_defer_init(&dfops, &firstfsb);
+
+			xfs_bmap_add_free(ip->i_mount, &dfops,
+					irec.br_startblock, irec.br_blockcount,
+					NULL);
+
+			/* Update quota accounting */
+			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
+					-(long)irec.br_blockcount);
+
+			/* Roll the transaction */
+			error = xfs_defer_finish(tpp, &dfops, ip);
+			if (error) {
+				xfs_defer_cancel(&dfops);
+				break;
+			}
+
+			/* Remove the mapping from the CoW fork. */
+			error = xfs_bunmapi_cow(ip, &irec);
+			if (error)
+				break;
+		}
+
+		/* Roll on... */
+		count_fsb -= irec.br_startoff + irec.br_blockcount - offset_fsb;
+		offset_fsb = irec.br_startoff + irec.br_blockcount;
+	}
+
+	return error;
+}
+
+/*
+ * Cancel all pending CoW reservations for some byte range of an inode.
+ */
+int
+xfs_reflink_cancel_cow_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		count)
+{
+	struct xfs_trans	*tp;
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		end_fsb;
+	int			error;
+
+	trace_xfs_reflink_cancel_cow_range(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+	if (count == NULLFILEOFF)
+		end_fsb = NULLFILEOFF;
+	else
+		end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+
+	/* Start a rolling transaction to remove the mappings */
+	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
+			0, 0, 0, &tp);
+	if (error)
+		goto out;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/* Scrape out the old CoW reservations */
+	error = xfs_reflink_cancel_cow_blocks(ip, &tp, offset_fsb, end_fsb);
+	if (error)
+		goto out_defer;
+
+	error = xfs_trans_commit(tp);
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+
+out_defer:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_cancel_cow_range_error(ip, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Remap parts of a file's data fork after a successful CoW.
+ */
+int
+xfs_reflink_end_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_bmbt_irec		uirec;
+	struct xfs_trans		*tp;
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	xfs_filblks_t			count_fsb;
+	xfs_fsblock_t			firstfsb;
+	struct xfs_defer_ops		dfops;
+	int				error;
+	unsigned int			resblks;
+	xfs_filblks_t			ilen;
+	xfs_filblks_t			rlen;
+	int				nimaps;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
+
+	/* Start a rolling transaction to switch the mappings */
+	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
+			resblks, 0, 0, &tp);
+	if (error)
+		goto out;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/* Go find the old extent in the CoW fork. */
+	while (count_fsb) {
+		/* Read extent from the source file */
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
+				&nimaps, XFS_BMAPI_COWFORK);
+		if (error)
+			goto out_cancel;
+		ASSERT(nimaps == 1);
+
+		ASSERT(irec.br_startblock != DELAYSTARTBLOCK);
+		trace_xfs_reflink_cow_remap(ip, &irec);
+
+		/*
+		 * We can have a hole in the CoW fork if part of a directio
+		 * write is CoW but part of it isn't.
+		 */
+		rlen = ilen = irec.br_blockcount;
+		if (irec.br_startblock == HOLESTARTBLOCK)
+			goto next_extent;
+
+		/* Unmap the old blocks in the data fork. */
+		while (rlen) {
+			xfs_defer_init(&dfops, &firstfsb);
+			error = __xfs_bunmapi(tp, ip, irec.br_startoff,
+					&rlen, 0, 1, &firstfsb, &dfops);
+			if (error)
+				goto out_defer;
+
+			/* Trim the extent to whatever got unmapped. */
+			uirec = irec;
+			xfs_trim_extent(&uirec, irec.br_startoff + rlen,
+					irec.br_blockcount - rlen);
+			irec.br_blockcount = rlen;
+			trace_xfs_reflink_cow_remap_piece(ip, &uirec);
+
+			/* Map the new blocks into the data fork. */
+			error = xfs_bmap_map_extent(tp->t_mountp, &dfops,
+					ip, XFS_DATA_FORK, &uirec);
+			if (error)
+				goto out_defer;
+
+			/* Remove the mapping from the CoW fork. */
+			error = xfs_bunmapi_cow(ip, &uirec);
+			if (error)
+				goto out_defer;
+
+			error = xfs_defer_finish(&tp, &dfops, ip);
+			if (error)
+				goto out_defer;
+		}
+
+next_extent:
+		/* Roll on... */
+		count_fsb -= irec.br_startoff + ilen - offset_fsb;
+		offset_fsb = irec.br_startoff + ilen;
+	}
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error)
+		goto out;
+	return 0;
+
+out_defer:
+	xfs_defer_cancel(&dfops);
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 11408c0..bffa4be 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -33,4 +33,12 @@ extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
 		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
 
+extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
+		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
+		xfs_fileoff_t end_fsb);
+extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
+extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
+
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 36/63] xfs: report shared extent mappings to userspace correctly
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (34 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 35/63] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-09-30  3:09 ` [PATCH 37/63] xfs: implement CoW for directio writes Darrick J. Wong
                   ` (27 subsequent siblings)
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Report shared extents through the iomap interface so that FIEMAP flags
shared blocks accurately.  Have xfs_vm_bmap return zero for reflinked
files because the bmap-based swap code requires static block mappings,
which is incompatible with copy on write.

NOTE: Existing userspace bmap users such as lilo will have the same
problem with reflink files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: combine the fiemap and bmap reporting patches because they're both
related and trivially short.
---
 fs/xfs/xfs_aops.c  |   11 +++++++++++
 fs/xfs/xfs_iomap.c |   12 +++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index aa23993..1d0435a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1509,6 +1509,17 @@ xfs_vm_bmap(
 
 	trace_xfs_vm_bmap(XFS_I(inode));
 	xfs_ilock(ip, XFS_IOLOCK_SHARED);
+
+	/*
+	 * The swap code (ab-)uses ->bmap to get a block mapping and then
+	 * bypasseѕ the file system for actual I/O.  We really can't allow
+	 * that on reflinks inodes, so we have to skip out here.  And yes,
+	 * 0 is the magic code for a bmap error..
+	 */
+	if (xfs_is_reflink_inode(ip)) {
+		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+		return 0;
+	}
 	filemap_write_and_wait(mapping);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 	return generic_block_bmap(mapping, block, xfs_get_blocks);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ad6939d..765849e 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -961,6 +961,7 @@ xfs_file_iomap_begin(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_bmbt_irec	imap;
 	xfs_fileoff_t		offset_fsb, end_fsb;
+	bool			shared, trimmed;
 	int			nimaps = 1, error = 0;
 	unsigned		lockmode;
 
@@ -989,7 +990,14 @@ xfs_file_iomap_begin(
 	end_fsb = XFS_B_TO_FSB(mp, offset + length);
 
 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
-			       &nimaps, XFS_BMAPI_ENTIRE);
+			       &nimaps, 0);
+	if (error) {
+		xfs_iunlock(ip, lockmode);
+		return error;
+	}
+
+	/* Trim the mapping to the nearest shared extent boundary. */
+	error = xfs_reflink_trim_around_shared(ip, &imap, &shared, &trimmed);
 	if (error) {
 		xfs_iunlock(ip, lockmode);
 		return error;
@@ -1028,6 +1036,8 @@ xfs_file_iomap_begin(
 	}
 
 	xfs_bmbt_to_iomap(ip, iomap, &imap);
+	if (shared)
+		iomap->flags |= IOMAP_F_SHARED;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 37/63] xfs: implement CoW for directio writes
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (35 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 36/63] xfs: report shared extent mappings to userspace correctly Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-10-05 18:27   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
                   ` (26 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

For O_DIRECT writes to shared blocks, we have to CoW them just like
we would with buffered writes.  For writes that are not block-aligned,
just bounce them to the page cache.

For block-aligned writes, however, we can do better than that.  Use
the same mechanisms that we employ for buffered CoW to set up a
delalloc reservation, allocate all the blocks at once, issue the
writes against the new blocks and use the same ioend functions to
remap the blocks after the write.  This should be fairly performant.

Christoph discovered that xfs_reflink_allocate_cow_range may stumble
over invalid entries in the extent array given that it drops the ilock
but still expects the index to be stable.  Simple fixing it to a new
lookup for every iteration still isn't correct given that
xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
there is nothing preventing a xfs_bunmapi_cow call removing extents
once we dropped the ilock either.

This patch duplicates the inner loop of xfs_bmapi_allocate into a
helper for xfs_reflink_allocate_cow_range so that it can be done under
the same ilock critical section as our CoW fork delayed allocation.
The directio CoW warts will be revisited in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
v2: Turns out that there's no way for xfs_end_io_direct_write to know
if the write completed successfully.  Therefore, do /not/ use the
ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
where we *can* tell if the write succeeded or not.

v3: Update the file size if we do a directio CoW across EOF.  This
can happen if the last block is shared, the cowextsize hint is set,
and we do a dio write past the end of the file.

v4: Christoph rewrote the allocate code to fix some concurrency
problems as part of migrating the code to support iomap.
---
 fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_file.c    |   20 ++++++++-
 fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_reflink.h |    2 +
 fs/xfs/xfs_trace.h   |    1 
 5 files changed, 208 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1d0435a..62a95e4 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -40,6 +40,7 @@
 /* flags for direct write completions */
 #define XFS_DIO_FLAG_UNWRITTEN	(1 << 0)
 #define XFS_DIO_FLAG_APPEND	(1 << 1)
+#define XFS_DIO_FLAG_COW	(1 << 2)
 
 /*
  * structure owned by writepages passed to individual writepage calls
@@ -1191,18 +1192,24 @@ xfs_map_direct(
 	struct inode		*inode,
 	struct buffer_head	*bh_result,
 	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
+	xfs_off_t		offset,
+	bool			is_cow)
 {
 	uintptr_t		*flags = (uintptr_t *)&bh_result->b_private;
 	xfs_off_t		size = bh_result->b_size;
 
 	trace_xfs_get_blocks_map_direct(XFS_I(inode), offset, size,
-		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : XFS_IO_OVERWRITE, imap);
+		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : is_cow ? XFS_IO_COW :
+		XFS_IO_OVERWRITE, imap);
 
 	if (ISUNWRITTEN(imap)) {
 		*flags |= XFS_DIO_FLAG_UNWRITTEN;
 		set_buffer_defer_completion(bh_result);
-	} else if (offset + size > i_size_read(inode) || offset + size < 0) {
+	} else if (is_cow) {
+		*flags |= XFS_DIO_FLAG_COW;
+		set_buffer_defer_completion(bh_result);
+	}
+	if (offset + size > i_size_read(inode) || offset + size < 0) {
 		*flags |= XFS_DIO_FLAG_APPEND;
 		set_buffer_defer_completion(bh_result);
 	}
@@ -1248,6 +1255,44 @@ xfs_map_trim_size(
 	bh_result->b_size = mapping_size;
 }
 
+/* Bounce unaligned directio writes to the page cache. */
+static int
+xfs_bounce_unaligned_dio_write(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		offset_fsb,
+	struct xfs_bmbt_irec	*imap)
+{
+	struct xfs_bmbt_irec	irec;
+	xfs_fileoff_t		delta;
+	bool			shared;
+	bool			x;
+	int			error;
+
+	irec = *imap;
+	if (offset_fsb > irec.br_startoff) {
+		delta = offset_fsb - irec.br_startoff;
+		irec.br_blockcount -= delta;
+		irec.br_startblock += delta;
+		irec.br_startoff = offset_fsb;
+	}
+	error = xfs_reflink_trim_around_shared(ip, &irec, &x, &shared);
+	if (error)
+		return error;
+	/*
+	 * Are we doing a DIO write to a shared block?  In
+	 * the ideal world we at least would fork full blocks,
+	 * but for now just fall back to buffered mode.  Yuck.
+	 * Use -EREMCHG ("remote address changed") to signal
+	 * this, since in general XFS doesn't do this sort of
+	 * fallback.
+	 */
+	if (shared) {
+		trace_xfs_reflink_bounce_dio_write(ip, imap);
+		return -EREMCHG;
+	}
+	return 0;
+}
+
 STATIC int
 __xfs_get_blocks(
 	struct inode		*inode,
@@ -1267,6 +1312,8 @@ __xfs_get_blocks(
 	xfs_off_t		offset;
 	ssize_t			size;
 	int			new = 0;
+	bool			is_cow = false;
+	bool			need_alloc = false;
 
 	BUG_ON(create && !direct);
 
@@ -1292,8 +1339,27 @@ __xfs_get_blocks(
 	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size);
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
 
-	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
-				&imap, &nimaps, XFS_BMAPI_ENTIRE);
+	if (create && direct) {
+		is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap,
+					&need_alloc);
+	}
+
+	if (!is_cow) {
+		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+					&imap, &nimaps, XFS_BMAPI_ENTIRE);
+		/*
+		 * Truncate an overwrite extent if there's a pending CoW
+		 * reservation before the end of this extent.  This forces us
+		 * to come back to writepage to take care of the CoW.
+		 */
+		if (create && direct && nimaps &&
+		    imap.br_startblock != HOLESTARTBLOCK &&
+		    imap.br_startblock != DELAYSTARTBLOCK &&
+		    !ISUNWRITTEN(&imap))
+			xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb,
+					&imap);
+	}
+	ASSERT(!need_alloc);
 	if (error)
 		goto out_unlock;
 
@@ -1345,6 +1411,13 @@ __xfs_get_blocks(
 	if (imap.br_startblock != HOLESTARTBLOCK &&
 	    imap.br_startblock != DELAYSTARTBLOCK &&
 	    (create || !ISUNWRITTEN(&imap))) {
+		if (create && direct && !is_cow) {
+			error = xfs_bounce_unaligned_dio_write(ip, offset_fsb,
+					&imap);
+			if (error)
+				return error;
+		}
+
 		xfs_map_buffer(inode, bh_result, &imap, offset);
 		if (ISUNWRITTEN(&imap))
 			set_buffer_unwritten(bh_result);
@@ -1353,7 +1426,8 @@ __xfs_get_blocks(
 			if (dax_fault)
 				ASSERT(!ISUNWRITTEN(&imap));
 			else
-				xfs_map_direct(inode, bh_result, &imap, offset);
+				xfs_map_direct(inode, bh_result, &imap, offset,
+						is_cow);
 		}
 	}
 
@@ -1479,7 +1553,10 @@ xfs_end_io_direct_write(
 		trace_xfs_end_io_direct_write_unwritten(ip, offset, size);
 
 		error = xfs_iomap_write_unwritten(ip, offset, size);
-	} else if (flags & XFS_DIO_FLAG_APPEND) {
+	}
+	if (flags & XFS_DIO_FLAG_COW)
+		error = xfs_reflink_end_cow(ip, offset, size);
+	if (flags & XFS_DIO_FLAG_APPEND) {
 		trace_xfs_end_io_direct_write_append(ip, offset, size);
 
 		error = xfs_setfilesize(ip, offset, size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f99d7fa..025d52f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -38,6 +38,7 @@
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
+#include "xfs_reflink.h"
 
 #include <linux/dcache.h>
 #include <linux/falloc.h>
@@ -672,6 +673,13 @@ xfs_file_dio_aio_write(
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 
+	/* If this is a block-aligned directio CoW, remap immediately. */
+	if (xfs_is_reflink_inode(ip) && !unaligned_io) {
+		ret = xfs_reflink_allocate_cow_range(ip, iocb->ki_pos, count);
+		if (ret)
+			goto out;
+	}
+
 	data = *from;
 	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
 			xfs_get_blocks_direct, xfs_end_io_direct_write,
@@ -812,10 +820,18 @@ xfs_file_write_iter(
 
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_write(iocb, from);
-	else if (iocb->ki_flags & IOCB_DIRECT)
+	else if (iocb->ki_flags & IOCB_DIRECT) {
+		/*
+		 * Allow DIO to fall back to buffered *only* in the case
+		 * that we're doing a reflink CoW.
+		 */
 		ret = xfs_file_dio_aio_write(iocb, from);
-	else
+		if (ret == -EREMCHG)
+			goto buffered;
+	} else {
+buffered:
 		ret = xfs_file_buffered_aio_write(iocb, from);
+	}
 
 	if (ret > 0) {
 		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d913ad1..c95cdc3 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -246,7 +246,8 @@ static int
 __xfs_reflink_reserve_cow(
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		*offset_fsb,
-	xfs_fileoff_t		end_fsb)
+	xfs_fileoff_t		end_fsb,
+	bool			*skipped)
 {
 	struct xfs_bmbt_irec	got, prev, imap;
 	xfs_fileoff_t		orig_end_fsb;
@@ -279,8 +280,10 @@ __xfs_reflink_reserve_cow(
 	end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
 
 	/* Not shared?  Just report the (potentially capped) extent. */
-	if (!shared)
+	if (!shared) {
+		*skipped = true;
 		goto done;
+	}
 
 	/*
 	 * Fork all the shared blocks from our write offset until the end of
@@ -326,6 +329,7 @@ xfs_reflink_reserve_cow_range(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fileoff_t		offset_fsb, end_fsb;
+	bool			skipped = false;
 	int			error;
 
 	trace_xfs_reflink_reserve_cow_range(ip, offset, count);
@@ -335,7 +339,8 @@ xfs_reflink_reserve_cow_range(
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	while (offset_fsb < end_fsb) {
-		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb);
+		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb,
+				&skipped);
 		if (error) {
 			trace_xfs_reflink_reserve_cow_range_error(ip, error,
 				_RET_IP_);
@@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
 	return error;
 }
 
+/* Allocate all CoW reservations covering a range of blocks in a file. */
+static int
+__xfs_reflink_allocate_cow(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	imap;
+	struct xfs_defer_ops	dfops;
+	struct xfs_trans	*tp;
+	xfs_fsblock_t		first_block;
+	xfs_fileoff_t		next_fsb;
+	int			nimaps = 1, error;
+	bool			skipped = false;
+
+	xfs_defer_init(&dfops, &first_block);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	next_fsb = *offset_fsb;
+	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
+	if (error)
+		goto out_trans_cancel;
+
+	if (skipped) {
+		*offset_fsb = next_fsb;
+		goto out_trans_cancel;
+	}
+
+	xfs_trans_ijoin(tp, ip, 0);
+	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
+			XFS_BMAPI_COWFORK, &first_block,
+			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
+			&imap, &nimaps, &dfops);
+	if (error)
+		goto out_trans_cancel;
+
+	/* We might not have been able to map the whole delalloc extent */
+	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
+
+	error = xfs_defer_finish(&tp, &dfops, NULL);
+	if (error)
+		goto out_trans_cancel;
+
+	error = xfs_trans_commit(tp);
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+out_trans_cancel:
+	xfs_defer_cancel(&dfops);
+	xfs_trans_cancel(tp);
+	goto out_unlock;
+}
+
+/* Allocate all CoW reservations covering a part of a file. */
+int
+xfs_reflink_allocate_cow_range(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		count)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
+	int			error;
+
+	ASSERT(xfs_is_reflink_inode(ip));
+
+	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
+
+	/*
+	 * Make sure that the dquots are there.
+	 */
+	error = xfs_qm_dqattach(ip, 0);
+	if (error)
+		return error;
+
+	while (offset_fsb < end_fsb) {
+		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
+		if (error) {
+			trace_xfs_reflink_allocate_cow_range_error(ip, error,
+					_RET_IP_);
+			break;
+		}
+	}
+
+	return error;
+}
+
 /*
  * Find the CoW reservation (and whether or not it needs block allocation)
  * for a given byte offset of a file.
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index bffa4be..c0c989a 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
 
 extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
 		xfs_off_t offset, xfs_off_t count);
+extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
+		xfs_off_t offset, xfs_off_t count);
 extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
 		struct xfs_bmbt_irec *imap, bool *need_alloc);
 extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 7612096..8e89223 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
 
 DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
 DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
-DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
 
 DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
 DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (36 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 37/63] xfs: implement CoW for directio writes Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-09-30  7:47   ` Christoph Hellwig
  2016-10-06 16:44   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
                   ` (25 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

When we're freeing blocks (truncate, punch, etc.), clear all CoW
reservations in the range being freed.  If the file block count
drops to zero, also clear the inode reflink flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 0c25a76..8c971fd 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -49,6 +49,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_reflink.h"
 
 kmem_zone_t *xfs_inode_zone;
 
@@ -1586,6 +1587,18 @@ xfs_itruncate_extents(
 			goto out;
 	}
 
+	/* Remove all pending CoW reservations. */
+	error = xfs_reflink_cancel_cow_blocks(ip, &tp, first_unmap_block,
+			last_block);
+	if (error)
+		goto out;
+
+	/*
+	 * Clear the reflink flag if we truncated everything.
+	 */
+	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip))
+		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
 	/*
 	 * Always re-log the inode so that our permanent transaction can keep
 	 * on rolling it forward in the log.


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (37 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-09-30  7:47   ` Christoph Hellwig
  2016-10-06 16:44   ` Brian Foster
  2016-09-30  3:09 ` [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
                   ` (24 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

When destroying the inode, cancel all pending reservations in the CoW
fork so that all the reserved blocks go back to the free pile.  In
theory this sort of cleanup is only needed to clean up after write
errors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_super.c |    8 ++++++++
 1 file changed, 8 insertions(+)


diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 204b794..26b45b3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -49,6 +49,7 @@
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
+#include "xfs_reflink.h"
 
 #include <linux/namei.h>
 #include <linux/init.h>
@@ -938,6 +939,7 @@ xfs_fs_destroy_inode(
 	struct inode		*inode)
 {
 	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
 
 	trace_xfs_destroy_inode(ip);
 
@@ -945,6 +947,12 @@ xfs_fs_destroy_inode(
 	XFS_STATS_INC(ip->i_mount, vn_rele);
 	XFS_STATS_INC(ip->i_mount, vn_remove);
 
+	error = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
+	if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount))
+		xfs_warn(ip->i_mount, "Error %d while evicting CoW blocks "
+				"for inode %llu.",
+				error, ip->i_ino);
+
 	xfs_inactive(ip);
 
 	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (38 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
@ 2016-09-30  3:09 ` Darrick J. Wong
  2016-09-30  7:49   ` Christoph Hellwig
  2016-10-07 18:04   ` Brian Foster
  2016-09-30  3:10 ` [PATCH 41/63] xfs: reflink extents from one file to another Darrick J. Wong
                   ` (23 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:09 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Due to the way the CoW algorithm in XFS works, there's an interval
during which blocks allocated to handle a CoW can be lost -- if the FS
goes down after the blocks are allocated but before the block
remapping takes place.  This is exacerbated by the cowextsz hint --
allocated reservations can sit around for a while, waiting to get
used.

Since the refcount btree doesn't normally store records with refcount
of 1, we can use it to record these in-progress extents.  In-progress
blocks cannot be shared because they're not user-visible, so there
shouldn't be any conflicts with other programs.  This is a better
solution than holding EFIs during writeback because (a) EFIs can't be
relogged currently, (b) even if they could, EFIs are bound by
available log space, which puts an unnecessary upper bound on how much
CoW we can have in flight, and (c) we already have a mechanism to
track blocks.

At mount time, read the refcount records and free anything we find
with a refcount of 1 because those were in-progress when the FS went
down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Use the deferred operations system to avoid deadlocks and blowing
out the transaction reservation.  This allows us to unmap a CoW
extent from the refcountbt and into a file atomically.
---
 fs/xfs/libxfs/xfs_bmap.c     |   11 +
 fs/xfs/libxfs/xfs_format.h   |    3 
 fs/xfs/libxfs/xfs_refcount.c |  336 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_refcount.h |   10 +
 fs/xfs/xfs_mount.c           |   12 ++
 fs/xfs/xfs_refcount_item.c   |   12 ++
 fs/xfs/xfs_reflink.c         |  150 +++++++++++++++++++
 fs/xfs/xfs_reflink.h         |    1 
 fs/xfs/xfs_super.c           |    9 +
 fs/xfs/xfs_trace.h           |    4 +
 10 files changed, 537 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 451f3e4..0ef7fb4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4629,6 +4629,17 @@ xfs_bmapi_write(
 				goto error0;
 			if (bma.blkno == NULLFSBLOCK)
 				break;
+
+			/*
+			 * If this is a CoW allocation, record the data in
+			 * the refcount btree for orphan recovery.
+			 */
+			if (whichfork == XFS_COW_FORK) {
+				error = xfs_refcount_alloc_cow_extent(mp, dfops,
+						bma.blkno, bma.length);
+				if (error)
+					goto error0;
+			}
 		}
 
 		/* Deal with the allocated space we found.  */
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 8b826102df..a7ae738 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1375,7 +1375,8 @@ struct xfs_owner_info {
 #define XFS_RMAP_OWN_INOBT	(-6ULL)	/* Inode btree blocks */
 #define XFS_RMAP_OWN_INODES	(-7ULL)	/* Inode chunk */
 #define XFS_RMAP_OWN_REFC	(-8ULL) /* refcount tree */
-#define XFS_RMAP_OWN_MIN	(-9ULL) /* guard */
+#define XFS_RMAP_OWN_COW	(-9ULL) /* cow allocations */
+#define XFS_RMAP_OWN_MIN	(-10ULL) /* guard */
 
 #define XFS_RMAP_NON_INODE_OWNER(owner)	(!!((owner) & (1ULL << 63)))
 
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 0748c9c..29abc5c 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -36,13 +36,23 @@
 #include "xfs_trans.h"
 #include "xfs_bit.h"
 #include "xfs_refcount.h"
+#include "xfs_rmap.h"
 
 /* Allowable refcount adjustment amounts. */
 enum xfs_refc_adjust_op {
 	XFS_REFCOUNT_ADJUST_INCREASE	= 1,
 	XFS_REFCOUNT_ADJUST_DECREASE	= -1,
+	XFS_REFCOUNT_ADJUST_COW_ALLOC	= 0,
+	XFS_REFCOUNT_ADJUST_COW_FREE	= -1,
 };
 
+STATIC int __xfs_refcount_cow_alloc(struct xfs_btree_cur *rcur,
+		xfs_agblock_t agbno, xfs_extlen_t aglen,
+		struct xfs_defer_ops *dfops);
+STATIC int __xfs_refcount_cow_free(struct xfs_btree_cur *rcur,
+		xfs_agblock_t agbno, xfs_extlen_t aglen,
+		struct xfs_defer_ops *dfops);
+
 /*
  * Look up the first record less than or equal to [bno, len] in the btree
  * given by cur.
@@ -77,6 +87,17 @@ xfs_refcount_lookup_ge(
 	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
 }
 
+/* Convert on-disk record to in-core format. */
+void
+xfs_refcount_btrec_to_irec(
+	union xfs_btree_rec		*rec,
+	struct xfs_refcount_irec	*irec)
+{
+	irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
+	irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
+	irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -86,14 +107,12 @@ xfs_refcount_get_rec(
 	struct xfs_refcount_irec	*irec,
 	int				*stat)
 {
-	union xfs_btree_rec	*rec;
-	int			error;
+	union xfs_btree_rec		*rec;
+	int				error;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (!error && *stat == 1) {
-		irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
-		irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
-		irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
+		xfs_refcount_btrec_to_irec(rec, irec);
 		trace_xfs_refcount_get(cur->bc_mp, cur->bc_private.a.agno,
 				irec);
 	}
@@ -485,6 +504,8 @@ xfs_refcount_merge_right_extent(
 	return error;
 }
 
+#define XFS_FIND_RCEXT_SHARED	1
+#define XFS_FIND_RCEXT_COW	2
 /*
  * Find the left extent and the one after it (cleft).  This function assumes
  * that we've already split any extent crossing agbno.
@@ -495,7 +516,8 @@ xfs_refcount_find_left_extents(
 	struct xfs_refcount_irec	*left,
 	struct xfs_refcount_irec	*cleft,
 	xfs_agblock_t			agbno,
-	xfs_extlen_t			aglen)
+	xfs_extlen_t			aglen,
+	int				flags)
 {
 	struct xfs_refcount_irec	tmp;
 	int				error;
@@ -515,6 +537,10 @@ xfs_refcount_find_left_extents(
 
 	if (RCNEXT(tmp) != agbno)
 		return 0;
+	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
+		return 0;
+	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
+		return 0;
 	/* We have a left extent; retrieve (or invent) the next right one */
 	*left = tmp;
 
@@ -574,7 +600,8 @@ xfs_refcount_find_right_extents(
 	struct xfs_refcount_irec	*right,
 	struct xfs_refcount_irec	*cright,
 	xfs_agblock_t			agbno,
-	xfs_extlen_t			aglen)
+	xfs_extlen_t			aglen,
+	int				flags)
 {
 	struct xfs_refcount_irec	tmp;
 	int				error;
@@ -594,6 +621,10 @@ xfs_refcount_find_right_extents(
 
 	if (tmp.rc_startblock != agbno + aglen)
 		return 0;
+	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
+		return 0;
+	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
+		return 0;
 	/* We have a right extent; retrieve (or invent) the next left one */
 	*right = tmp;
 
@@ -654,6 +685,7 @@ xfs_refcount_merge_extents(
 	xfs_agblock_t		*agbno,
 	xfs_extlen_t		*aglen,
 	enum xfs_refc_adjust_op adjust,
+	int			flags,
 	bool			*shape_changed)
 {
 	struct xfs_refcount_irec	left = {0}, cleft = {0};
@@ -669,11 +701,11 @@ xfs_refcount_merge_extents(
 	 * [right].
 	 */
 	error = xfs_refcount_find_left_extents(cur, &left, &cleft, *agbno,
-			*aglen);
+			*aglen, flags);
 	if (error)
 		return error;
 	error = xfs_refcount_find_right_extents(cur, &right, &cright, *agbno,
-			*aglen);
+			*aglen, flags);
 	if (error)
 		return error;
 
@@ -950,7 +982,7 @@ xfs_refcount_adjust(
 	 */
 	orig_aglen = aglen;
 	error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
-			&shape_changed);
+			XFS_FIND_RCEXT_SHARED, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
@@ -1068,6 +1100,18 @@ xfs_refcount_finish_one(
 		error = xfs_refcount_adjust(rcur, bno, blockcount, adjusted,
 			XFS_REFCOUNT_ADJUST_DECREASE, dfops, NULL);
 		break;
+	case XFS_REFCOUNT_ALLOC_COW:
+		*adjusted = 0;
+		error = __xfs_refcount_cow_alloc(rcur, bno, blockcount, dfops);
+		if (!error)
+			*adjusted = blockcount;
+		break;
+	case XFS_REFCOUNT_FREE_COW:
+		*adjusted = 0;
+		error = __xfs_refcount_cow_free(rcur, bno, blockcount, dfops);
+		if (!error)
+			*adjusted = blockcount;
+		break;
 	default:
 		ASSERT(0);
 		error = -EFSCORRUPTED;
@@ -1242,3 +1286,275 @@ xfs_refcount_find_shared(
 				cur->bc_private.a.agno, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Recovering CoW Blocks After a Crash
+ *
+ * Due to the way that the copy on write mechanism works, there's a window of
+ * opportunity in which we can lose track of allocated blocks during a crash.
+ * Because CoW uses delayed allocation in the in-core CoW fork, writeback
+ * causes blocks to be allocated and stored in the CoW fork.  The blocks are
+ * no longer in the free space btree but are not otherwise recorded anywhere
+ * until the write completes and the blocks are mapped into the file.  A crash
+ * in between allocation and remapping results in the replacement blocks being
+ * lost.  This situation is exacerbated by the CoW extent size hint because
+ * allocations can hang around for long time.
+ *
+ * However, there is a place where we can record these allocations before they
+ * become mappings -- the reference count btree.  The btree does not record
+ * extents with refcount == 1, so we can record allocations with a refcount of
+ * 1.  Blocks being used for CoW writeout cannot be shared, so there should be
+ * no conflict with shared block records.  These mappings should be created
+ * when we allocate blocks to the CoW fork and deleted when they're removed
+ * from the CoW fork.
+ *
+ * Minor nit: records for in-progress CoW allocations and records for shared
+ * extents must never be merged, to preserve the property that (except for CoW
+ * allocations) there are no refcount btree entries with refcount == 1.  The
+ * only time this could potentially happen is when unsharing a block that's
+ * adjacent to CoW allocations, so we must be careful to avoid this.
+ *
+ * At mount time we recover lost CoW allocations by searching the refcount
+ * btree for these refcount == 1 mappings.  These represent CoW allocations
+ * that were in progress at the time the filesystem went down, so we can free
+ * them to get the space back.
+ *
+ * This mechanism is superior to creating EFIs for unmapped CoW extents for
+ * several reasons -- first, EFIs pin the tail of the log and would have to be
+ * periodically relogged to avoid filling up the log.  Second, CoW completions
+ * will have to file an EFD and create new EFIs for whatever remains in the
+ * CoW fork; this partially takes care of (1) but extent-size reservations
+ * will have to periodically relog even if there's no writeout in progress.
+ * This can happen if the CoW extent size hint is set, which you really want.
+ * Third, EFIs cannot currently be automatically relogged into newer
+ * transactions to advance the log tail.  Fourth, stuffing the log full of
+ * EFIs places an upper bound on the number of CoW allocations that can be
+ * held filesystem-wide at any given time.  Recording them in the refcount
+ * btree doesn't require us to maintain any state in memory and doesn't pin
+ * the log.
+ */
+/*
+ * Adjust the refcounts of CoW allocations.  These allocations are "magic"
+ * in that they're not referenced anywhere else in the filesystem, so we
+ * stash them in the refcount btree with a refcount of 1 until either file
+ * remapping (or CoW cancellation) happens.
+ */
+STATIC int
+xfs_refcount_adjust_cow_extents(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_refcount_irec	ext, tmp;
+	int				error;
+	int				found_rec, found_tmp;
+
+	if (aglen == 0)
+		return 0;
+
+	/* Find any overlapping refcount records */
+	error = xfs_refcount_lookup_ge(cur, agbno, &found_rec);
+	if (error)
+		goto out_error;
+	error = xfs_refcount_get_rec(cur, &ext, &found_rec);
+	if (error)
+		goto out_error;
+	if (!found_rec) {
+		ext.rc_startblock = cur->bc_mp->m_sb.sb_agblocks;
+		ext.rc_blockcount = 0;
+		ext.rc_refcount = 0;
+	}
+
+	switch (adj) {
+	case XFS_REFCOUNT_ADJUST_COW_ALLOC:
+		/* Adding a CoW reservation, there should be nothing here. */
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+				ext.rc_startblock >= agbno + aglen, out_error);
+
+		tmp.rc_startblock = agbno;
+		tmp.rc_blockcount = aglen;
+		tmp.rc_refcount = 1;
+		trace_xfs_refcount_modify_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &tmp);
+
+		error = xfs_refcount_insert(cur, &tmp,
+				&found_tmp);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+				found_tmp == 1, out_error);
+		break;
+	case XFS_REFCOUNT_ADJUST_COW_FREE:
+		/* Removing a CoW reservation, there should be one extent. */
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+			ext.rc_startblock == agbno, out_error);
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+			ext.rc_blockcount == aglen, out_error);
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+			ext.rc_refcount == 1, out_error);
+
+		ext.rc_refcount = 0;
+		trace_xfs_refcount_modify_extent(cur->bc_mp,
+				cur->bc_private.a.agno, &ext);
+		error = xfs_refcount_delete(cur, &found_rec);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp,
+				found_rec == 1, out_error);
+		break;
+	default:
+		ASSERT(0);
+	}
+
+	return error;
+out_error:
+	trace_xfs_refcount_modify_extent_error(cur->bc_mp,
+			cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Add or remove refcount btree entries for CoW reservations.
+ */
+STATIC int
+xfs_refcount_adjust_cow(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	enum xfs_refc_adjust_op	adj,
+	struct xfs_defer_ops	*dfops)
+{
+	bool			shape_changed;
+	int			error;
+
+	/*
+	 * Ensure that no rcextents cross the boundary of the adjustment range.
+	 */
+	error = xfs_refcount_split_extent(cur, agbno, &shape_changed);
+	if (error)
+		goto out_error;
+
+	error = xfs_refcount_split_extent(cur, agbno + aglen, &shape_changed);
+	if (error)
+		goto out_error;
+
+	/*
+	 * Try to merge with the left or right extents of the range.
+	 */
+	error = xfs_refcount_merge_extents(cur, &agbno, &aglen, adj,
+			XFS_FIND_RCEXT_COW, &shape_changed);
+	if (error)
+		goto out_error;
+
+	/* Now that we've taken care of the ends, adjust the middle extents */
+	error = xfs_refcount_adjust_cow_extents(cur, agbno, aglen, adj,
+			dfops, NULL);
+	if (error)
+		goto out_error;
+
+	return 0;
+
+out_error:
+	trace_xfs_refcount_adjust_cow_error(cur->bc_mp, cur->bc_private.a.agno,
+			error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Record a CoW allocation in the refcount btree.
+ */
+STATIC int
+__xfs_refcount_cow_alloc(
+	struct xfs_btree_cur	*rcur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	struct xfs_defer_ops	*dfops)
+{
+	int			error;
+
+	trace_xfs_refcount_cow_increase(rcur->bc_mp, rcur->bc_private.a.agno,
+			agbno, aglen);
+
+	/* Add refcount btree reservation */
+	error = xfs_refcount_adjust_cow(rcur, agbno, aglen,
+			XFS_REFCOUNT_ADJUST_COW_ALLOC, dfops);
+	if (error)
+		return error;
+
+	/* Add rmap entry */
+	if (xfs_sb_version_hasrmapbt(&rcur->bc_mp->m_sb)) {
+		error = xfs_rmap_alloc_extent(rcur->bc_mp, dfops,
+				rcur->bc_private.a.agno,
+				agbno, aglen, XFS_RMAP_OWN_COW);
+		if (error)
+			return error;
+	}
+
+	return error;
+}
+
+/*
+ * Remove a CoW allocation from the refcount btree.
+ */
+STATIC int
+__xfs_refcount_cow_free(
+	struct xfs_btree_cur	*rcur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		aglen,
+	struct xfs_defer_ops	*dfops)
+{
+	int			error;
+
+	trace_xfs_refcount_cow_decrease(rcur->bc_mp, rcur->bc_private.a.agno,
+			agbno, aglen);
+
+	/* Remove refcount btree reservation */
+	error = xfs_refcount_adjust_cow(rcur, agbno, aglen,
+			XFS_REFCOUNT_ADJUST_COW_FREE, dfops);
+	if (error)
+		return error;
+
+	/* Remove rmap entry */
+	if (xfs_sb_version_hasrmapbt(&rcur->bc_mp->m_sb)) {
+		error = xfs_rmap_free_extent(rcur->bc_mp, dfops,
+				rcur->bc_private.a.agno,
+				agbno, aglen, XFS_RMAP_OWN_COW);
+		if (error)
+			return error;
+	}
+
+	return error;
+}
+
+/* Record a CoW staging extent in the refcount btree. */
+int
+xfs_refcount_alloc_cow_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	xfs_fsblock_t			fsb,
+	xfs_extlen_t			len)
+{
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_ALLOC_COW,
+			fsb, len);
+}
+
+/* Forget a CoW staging event in the refcount btree. */
+int
+xfs_refcount_free_cow_extent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_ops		*dfops,
+	xfs_fsblock_t			fsb,
+	xfs_extlen_t			len)
+{
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_FREE_COW,
+			fsb, len);
+}
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 48c576c..ddfcf65 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -24,6 +24,9 @@ extern int xfs_refcount_lookup_le(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, int *stat);
 extern int xfs_refcount_lookup_ge(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, int *stat);
+union xfs_btree_rec;
+extern void xfs_refcount_btrec_to_irec(union xfs_btree_rec *rec,
+		struct xfs_refcount_irec *irec);
 extern int xfs_refcount_get_rec(struct xfs_btree_cur *cur,
 		struct xfs_refcount_irec *irec, int *stat);
 
@@ -58,4 +61,11 @@ extern int xfs_refcount_find_shared(struct xfs_btree_cur *cur,
 		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
 		xfs_extlen_t *flen, bool find_maximal);
 
+extern int xfs_refcount_alloc_cow_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, xfs_fsblock_t fsb,
+		xfs_extlen_t len);
+extern int xfs_refcount_free_cow_extent(struct xfs_mount *mp,
+		struct xfs_defer_ops *dfops, xfs_fsblock_t fsb,
+		xfs_extlen_t len);
+
 #endif	/* __XFS_REFCOUNT_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 3f64615..caecbd2 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -44,6 +44,7 @@
 #include "xfs_sysfs.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_reflink.h"
 
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -976,10 +977,21 @@ xfs_mountfs(
 		if (error)
 			xfs_warn(mp,
 	"Unable to allocate reserve blocks. Continuing without reserve pool.");
+
+		/* Recover any CoW blocks that never got remapped. */
+		error = xfs_reflink_recover_cow(mp);
+		if (error) {
+			xfs_err(mp,
+	"Error %d recovering leftover CoW allocations.", error);
+			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+			goto out_quota;
+		}
 	}
 
 	return 0;
 
+ out_quota:
+	xfs_qm_unmount_quotas(mp);
  out_rtunmount:
 	xfs_rtunmount_inodes(mp);
  out_rele_rip:
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index e44007a..34be3e8 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -489,6 +489,18 @@ xfs_cui_recover(
 				error = xfs_refcount_decrease_extent(
 						tp->t_mountp, &dfops, &irec);
 				break;
+			case XFS_REFCOUNT_ALLOC_COW:
+				error = xfs_refcount_alloc_cow_extent(
+						tp->t_mountp, &dfops,
+						irec.br_startblock,
+						irec.br_blockcount);
+				break;
+			case XFS_REFCOUNT_FREE_COW:
+				error = xfs_refcount_free_cow_extent(
+						tp->t_mountp, &dfops,
+						irec.br_startblock,
+						irec.br_blockcount);
+				break;
 			default:
 				ASSERT(0);
 			}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index c95cdc3..673ecc1 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -40,6 +40,7 @@
 #include "xfs_log.h"
 #include "xfs_icache.h"
 #include "xfs_pnfs.h"
+#include "xfs_btree.h"
 #include "xfs_refcount_btree.h"
 #include "xfs_refcount.h"
 #include "xfs_bmap_btree.h"
@@ -582,6 +583,13 @@ xfs_reflink_cancel_cow_blocks(
 			xfs_trans_ijoin(*tpp, ip, 0);
 			xfs_defer_init(&dfops, &firstfsb);
 
+			/* Free the CoW orphan record. */
+			error = xfs_refcount_free_cow_extent(ip->i_mount,
+					&dfops, irec.br_startblock,
+					irec.br_blockcount);
+			if (error)
+				break;
+
 			xfs_bmap_add_free(ip->i_mount, &dfops,
 					irec.br_startblock, irec.br_blockcount,
 					NULL);
@@ -735,6 +743,13 @@ xfs_reflink_end_cow(
 			irec.br_blockcount = rlen;
 			trace_xfs_reflink_cow_remap_piece(ip, &uirec);
 
+			/* Free the CoW orphan record. */
+			error = xfs_refcount_free_cow_extent(tp->t_mountp,
+					&dfops, uirec.br_startblock,
+					uirec.br_blockcount);
+			if (error)
+				goto out_defer;
+
 			/* Map the new blocks into the data fork. */
 			error = xfs_bmap_map_extent(tp->t_mountp, &dfops,
 					ip, XFS_DATA_FORK, &uirec);
@@ -772,3 +787,138 @@ xfs_reflink_end_cow(
 	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
 	return error;
 }
+
+struct xfs_reflink_recovery {
+	struct list_head		rr_list;
+	struct xfs_refcount_irec	rr_rrec;
+};
+
+/* Stuff an extent on the recovery list. */
+STATIC int
+xfs_reflink_recover_extent(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_rec		*rec,
+	void				*priv)
+{
+	struct list_head		*debris = priv;
+	struct xfs_reflink_recovery	*rr;
+
+	if (be32_to_cpu(rec->refc.rc_refcount) != 1)
+		return 0;
+
+	rr = kmem_alloc(sizeof(struct xfs_reflink_recovery), KM_SLEEP);
+	xfs_refcount_btrec_to_irec(rec, &rr->rr_rrec);
+	list_add_tail(&rr->rr_list, debris);
+
+	return 0;
+}
+
+/*
+ * Find and remove leftover CoW reservations.
+ */
+STATIC int
+xfs_reflink_recover_cow_ag(
+	struct xfs_mount		*mp,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_trans		*tp;
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agbp;
+	struct xfs_reflink_recovery	*rr, *n;
+	struct list_head		debris;
+	union xfs_btree_irec		low;
+	union xfs_btree_irec		high;
+	struct xfs_defer_ops		dfops;
+	xfs_fsblock_t			fsb;
+	int				error;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
+
+	/* Find all the leftover CoW staging extents. */
+	INIT_LIST_HEAD(&debris);
+	memset(&low, 0, sizeof(low));
+	memset(&high, 0, sizeof(high));
+	low.rc.rc_startblock = 0;
+	high.rc.rc_startblock = -1U;
+	error = xfs_btree_query_range(cur, &low, &high,
+			xfs_reflink_recover_extent, &debris);
+	if (error)
+		goto out_error;
+	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+	xfs_buf_relse(agbp);
+
+	/* Now iterate the list to free the leftovers */
+	list_for_each_entry(rr, &debris, rr_list) {
+		/* Set up transaction. */
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
+		if (error)
+			goto out_free;
+
+		trace_xfs_reflink_recover_extent(mp, agno, &rr->rr_rrec);
+
+		/* Free the orphan record */
+		xfs_defer_init(&dfops, &fsb);
+		fsb = XFS_AGB_TO_FSB(mp, agno, rr->rr_rrec.rc_startblock);
+		error = xfs_refcount_free_cow_extent(mp, &dfops, fsb,
+				rr->rr_rrec.rc_blockcount);
+		if (error)
+			goto out_defer;
+
+		/* Free the block. */
+		xfs_bmap_add_free(mp, &dfops, fsb,
+				rr->rr_rrec.rc_blockcount, NULL);
+
+		error = xfs_defer_finish(&tp, &dfops, NULL);
+		if (error)
+			goto out_defer;
+
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto out_cancel;
+	}
+	goto out_free;
+
+out_defer:
+	xfs_defer_cancel(&dfops);
+out_cancel:
+	xfs_trans_cancel(tp);
+
+out_free:
+	/* Free the leftover list */
+	list_for_each_entry_safe(rr, n, &debris, rr_list) {
+		list_del(&rr->rr_list);
+		kmem_free(rr);
+	}
+
+	return error;
+
+out_error:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	xfs_buf_relse(agbp);
+	return error;
+}
+
+/*
+ * Free leftover CoW reservations that didn't get cleaned out.
+ */
+int
+xfs_reflink_recover_cow(
+	struct xfs_mount	*mp)
+{
+	xfs_agnumber_t		agno;
+	int			error = 0;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = xfs_reflink_recover_cow_ag(mp, agno);
+		if (error)
+			break;
+	}
+
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c0c989a..1d2f180 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -42,5 +42,6 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 
 #endif /* __XFS_REFLINK_H */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 26b45b3..e6aaa91 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1306,6 +1306,15 @@ xfs_fs_remount(
 		xfs_restore_resvblks(mp);
 		xfs_log_work_queue(mp);
 		xfs_queue_eofblocks(mp);
+
+		/* Recover any CoW blocks that never got remapped. */
+		error = xfs_reflink_recover_cow(mp);
+		if (error) {
+			xfs_err(mp,
+	"Error %d recovering leftover CoW allocations.", error);
+			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+			return error;
+		}
 	}
 
 	/* rw -> ro */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8e89223..ca0930b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2916,14 +2916,18 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_update_error);
 /* refcount adjustment tracepoints */
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_increase);
 DEFINE_AG_EXTENT_EVENT(xfs_refcount_decrease);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_increase);
+DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_decrease);
 DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(xfs_refcount_merge_center_extents);
 DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_modify_extent);
+DEFINE_REFCOUNT_EXTENT_EVENT(xfs_reflink_recover_extent);
 DEFINE_REFCOUNT_EXTENT_AT_EVENT(xfs_refcount_split_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_left_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_right_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_left_extent);
 DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_right_extent);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_error);
+DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_cow_error);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_center_extents_error);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_modify_extent_error);
 DEFINE_AG_ERROR_EVENT(xfs_refcount_split_extent_error);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 41/63] xfs: reflink extents from one file to another
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (39 preceding siblings ...)
  2016-09-30  3:09 ` [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  7:50   ` Christoph Hellwig
  2016-10-07 18:04   ` Brian Foster
  2016-09-30  3:10 ` [PATCH 42/63] xfs: add clone file and clone range vfs functions Darrick J. Wong
                   ` (22 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Reflink extents from one file to another; that is to say, iteratively
remove the mappings from the destination file, copy the mappings from
the source file to the destination file, and increment the reference
count of all the blocks that got remapped.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Call xfs_defer_cancel before cancelling the transaction if the
remap operation fails.  Use the deferred operations system to avoid
deadlocks or blowing out the transaction reservation, and make the
entire reflink operation atomic for each extent being remapped.  The
destination file's i_size will be updated if necessary to avoid
violating the assumption that there are no shared blocks past the EOF
block.
---
 fs/xfs/xfs_reflink.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    2 
 2 files changed, 427 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 673ecc1..94c19fff 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -922,3 +922,428 @@ xfs_reflink_recover_cow(
 
 	return error;
 }
+
+/*
+ * Reflinking (Block) Ranges of Two Files Together
+ *
+ * First, ensure that the reflink flag is set on both inodes.  The flag is an
+ * optimization to avoid unnecessary refcount btree lookups in the write path.
+ *
+ * Now we can iteratively remap the range of extents (and holes) in src to the
+ * corresponding ranges in dest.  Let drange and srange denote the ranges of
+ * logical blocks in dest and src touched by the reflink operation.
+ *
+ * While the length of drange is greater than zero,
+ *    - Read src's bmbt at the start of srange ("imap")
+ *    - If imap doesn't exist, make imap appear to start at the end of srange
+ *      with zero length.
+ *    - If imap starts before srange, advance imap to start at srange.
+ *    - If imap goes beyond srange, truncate imap to end at the end of srange.
+ *    - Punch (imap start - srange start + imap len) blocks from dest at
+ *      offset (drange start).
+ *    - If imap points to a real range of pblks,
+ *         > Increase the refcount of the imap's pblks
+ *         > Map imap's pblks into dest at the offset
+ *           (drange start + imap start - srange start)
+ *    - Advance drange and srange by (imap start - srange start + imap len)
+ *
+ * Finally, if the reflink made dest longer, update both the in-core and
+ * on-disk file sizes.
+ *
+ * ASCII Art Demonstration:
+ *
+ * Let's say we want to reflink this source file:
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS (src file)
+ *   <-------------------->
+ *
+ * into this destination file:
+ *
+ * --DDDDDDDDDDDDDDDDDDD--DDD (dest file)
+ *        <-------------------->
+ * '-' means a hole, and 'S' and 'D' are written blocks in the src and dest.
+ * Observe that the range has different logical offsets in either file.
+ *
+ * Consider that the first extent in the source file doesn't line up with our
+ * reflink range.  Unmapping  and remapping are separate operations, so we can
+ * unmap more blocks from the destination file than we remap.
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *   <------->
+ * --DDDDD---------DDDDD--DDD
+ *        <------->
+ *
+ * Now remap the source extent into the destination file:
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *   <------->
+ * --DDDDD--SSSSSSSDDDDD--DDD
+ *        <------->
+ *
+ * Do likewise with the second hole and extent in our range.  Holes in the
+ * unmap range don't affect our operation.
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *            <---->
+ * --DDDDD--SSSSSSS-SSSSS-DDD
+ *                 <---->
+ *
+ * Finally, unmap and remap part of the third extent.  This will increase the
+ * size of the destination file.
+ *
+ * ----SSSSSSS-SSSSS----SSSSSS
+ *                  <----->
+ * --DDDDD--SSSSSSS-SSSSS----SSS
+ *                       <----->
+ *
+ * Once we update the destination file's i_size, we're done.
+ */
+
+/*
+ * Ensure the reflink bit is set in both inodes.
+ */
+STATIC int
+xfs_reflink_set_inode_flag(
+	struct xfs_inode	*src,
+	struct xfs_inode	*dest)
+{
+	struct xfs_mount	*mp = src->i_mount;
+	int			error;
+	struct xfs_trans	*tp;
+
+	if (xfs_is_reflink_inode(src) && xfs_is_reflink_inode(dest))
+		return 0;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		goto out_error;
+
+	/* Lock both files against IO */
+	if (src->i_ino == dest->i_ino)
+		xfs_ilock(src, XFS_ILOCK_EXCL);
+	else
+		xfs_lock_two_inodes(src, dest, XFS_ILOCK_EXCL);
+
+	if (!xfs_is_reflink_inode(src)) {
+		trace_xfs_reflink_set_inode_flag(src);
+		xfs_trans_ijoin(tp, src, XFS_ILOCK_EXCL);
+		src->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+		xfs_trans_log_inode(tp, src, XFS_ILOG_CORE);
+		xfs_ifork_init_cow(src);
+	} else
+		xfs_iunlock(src, XFS_ILOCK_EXCL);
+
+	if (src->i_ino == dest->i_ino)
+		goto commit_flags;
+
+	if (!xfs_is_reflink_inode(dest)) {
+		trace_xfs_reflink_set_inode_flag(dest);
+		xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+		dest->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+		xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
+		xfs_ifork_init_cow(dest);
+	} else
+		xfs_iunlock(dest, XFS_ILOCK_EXCL);
+
+commit_flags:
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_error;
+	return error;
+
+out_error:
+	trace_xfs_reflink_set_inode_flag_error(dest, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Update destination inode size, if necessary.
+ */
+STATIC int
+xfs_reflink_update_dest(
+	struct xfs_inode	*dest,
+	xfs_off_t		newlen)
+{
+	struct xfs_mount	*mp = dest->i_mount;
+	struct xfs_trans	*tp;
+	int			error;
+
+	if (newlen <= i_size_read(VFS_I(dest)))
+		return 0;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		goto out_error;
+
+	xfs_ilock(dest, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
+
+	trace_xfs_reflink_update_inode_size(dest, newlen);
+	i_size_write(VFS_I(dest), newlen);
+	dest->i_d.di_size = newlen;
+	xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_error;
+	return error;
+
+out_error:
+	trace_xfs_reflink_update_inode_size_error(dest, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Unmap a range of blocks from a file, then map other blocks into the hole.
+ * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
+ * The extent irec is mapped into dest at irec->br_startoff.
+ */
+STATIC int
+xfs_reflink_remap_extent(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*irec,
+	xfs_fileoff_t		destoff,
+	xfs_off_t		new_isize)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	xfs_fsblock_t		firstfsb;
+	unsigned int		resblks;
+	struct xfs_defer_ops	dfops;
+	struct xfs_bmbt_irec	uirec;
+	bool			real_extent;
+	xfs_filblks_t		rlen;
+	xfs_filblks_t		unmap_len;
+	xfs_off_t		newlen;
+	int			error;
+
+	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
+	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
+
+	/* Only remap normal extents. */
+	real_extent =  (irec->br_startblock != HOLESTARTBLOCK &&
+			irec->br_startblock != DELAYSTARTBLOCK &&
+			!ISUNWRITTEN(irec));
+
+	/* Start a rolling transaction to switch the mappings */
+	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
+	if (error)
+		goto out;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/* If we're not just clearing space, then do we have enough quota? */
+	if (real_extent) {
+		error = xfs_trans_reserve_quota_nblks(tp, ip,
+				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
+		if (error)
+			goto out_cancel;
+	}
+
+	trace_xfs_reflink_remap(ip, irec->br_startoff,
+				irec->br_blockcount, irec->br_startblock);
+
+	/* Unmap the old blocks in the data fork. */
+	rlen = unmap_len;
+	while (rlen) {
+		xfs_defer_init(&dfops, &firstfsb);
+		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
+				&firstfsb, &dfops);
+		if (error)
+			goto out_defer;
+
+		/* Trim the extent to whatever got unmapped. */
+		uirec = *irec;
+		xfs_trim_extent(&uirec, destoff + rlen, unmap_len - rlen);
+		unmap_len = rlen;
+
+		/* If this isn't a real mapping, we're done. */
+		if (!real_extent || uirec.br_blockcount == 0)
+			goto next_extent;
+
+		trace_xfs_reflink_remap(ip, uirec.br_startoff,
+				uirec.br_blockcount, uirec.br_startblock);
+
+		/* Update the refcount tree */
+		error = xfs_refcount_increase_extent(mp, &dfops, &uirec);
+		if (error)
+			goto out_defer;
+
+		/* Map the new blocks into the data fork. */
+		error = xfs_bmap_map_extent(mp, &dfops, ip, XFS_DATA_FORK,
+				&uirec);
+		if (error)
+			goto out_defer;
+
+		/* Update quota accounting. */
+		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
+				uirec.br_blockcount);
+
+		/* Update dest isize if needed. */
+		newlen = XFS_FSB_TO_B(mp,
+				uirec.br_startoff + uirec.br_blockcount);
+		newlen = min_t(xfs_off_t, newlen, new_isize);
+		if (newlen > i_size_read(VFS_I(ip))) {
+			trace_xfs_reflink_update_inode_size(ip, newlen);
+			i_size_write(VFS_I(ip), newlen);
+			ip->i_d.di_size = newlen;
+			xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		}
+
+next_extent:
+		/* Process all the deferred stuff. */
+		error = xfs_defer_finish(&tp, &dfops, ip);
+		if (error)
+			goto out_defer;
+	}
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error)
+		goto out;
+	return 0;
+
+out_defer:
+	xfs_defer_cancel(&dfops);
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_remap_extent_error(ip, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Iteratively remap one file's extents (and holes) to another's.
+ */
+STATIC int
+xfs_reflink_remap_blocks(
+	struct xfs_inode	*src,
+	xfs_fileoff_t		srcoff,
+	struct xfs_inode	*dest,
+	xfs_fileoff_t		destoff,
+	xfs_filblks_t		len,
+	xfs_off_t		new_isize)
+{
+	struct xfs_bmbt_irec	imap;
+	int			nimaps;
+	int			error = 0;
+	xfs_filblks_t		range_len;
+
+	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
+	while (len) {
+		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
+				dest, destoff);
+		/* Read extent from the source file */
+		nimaps = 1;
+		xfs_ilock(src, XFS_ILOCK_EXCL);
+		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
+		xfs_iunlock(src, XFS_ILOCK_EXCL);
+		if (error)
+			goto err;
+		ASSERT(nimaps == 1);
+
+		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
+				&imap);
+
+		/* Translate imap into the destination file. */
+		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
+		imap.br_startoff += destoff - srcoff;
+
+		/* Clear dest from destoff to the end of imap and map it in. */
+		error = xfs_reflink_remap_extent(dest, &imap, destoff,
+				new_isize);
+		if (error)
+			goto err;
+
+		if (fatal_signal_pending(current)) {
+			error = -EINTR;
+			goto err;
+		}
+
+		/* Advance drange/srange */
+		srcoff += range_len;
+		destoff += range_len;
+		len -= range_len;
+	}
+
+	return 0;
+
+err:
+	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Link a range of blocks from one file to another.
+ */
+int
+xfs_reflink_remap_range(
+	struct xfs_inode	*src,
+	xfs_off_t		srcoff,
+	struct xfs_inode	*dest,
+	xfs_off_t		destoff,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = src->i_mount;
+	xfs_fileoff_t		sfsbno, dfsbno;
+	xfs_filblks_t		fsblen;
+	int			error;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/* Don't reflink realtime inodes */
+	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
+		return -EINVAL;
+
+	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
+
+	/* Lock both files against IO */
+	if (src->i_ino == dest->i_ino) {
+		xfs_ilock(src, XFS_IOLOCK_EXCL);
+		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
+	} else {
+		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
+		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
+	}
+
+	error = xfs_reflink_set_inode_flag(src, dest);
+	if (error)
+		goto out_error;
+
+	/*
+	 * Invalidate the page cache so that we can clear any CoW mappings
+	 * in the destination file.
+	 */
+	truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
+				   PAGE_ALIGN(destoff + len) - 1);
+
+	dfsbno = XFS_B_TO_FSBT(mp, destoff);
+	sfsbno = XFS_B_TO_FSBT(mp, srcoff);
+	fsblen = XFS_B_TO_FSB(mp, len);
+	error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
+			destoff + len);
+	if (error)
+		goto out_error;
+
+	error = xfs_reflink_update_dest(dest, destoff + len);
+	if (error)
+		goto out_error;
+
+out_error:
+	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(src, XFS_IOLOCK_EXCL);
+	if (src->i_ino != dest->i_ino) {
+		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
+	}
+	if (error)
+		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 1d2f180..c35ce29 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -43,5 +43,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
+extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
+		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 42/63] xfs: add clone file and clone range vfs functions
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (40 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 41/63] xfs: reflink extents from one file to another Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  7:51   ` Christoph Hellwig
  2016-10-07 18:04   ` Brian Foster
  2016-09-30  3:10 ` [PATCH 43/63] xfs: add dedupe range vfs function Darrick J. Wong
                   ` (21 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Define two VFS functions which allow userspace to reflink a range of
blocks between two files or to reflink one file's contents to another.
These functions fit the new VFS ioctls that standardize the checking
for the btrfs CLONE and CLONE RANGE ioctls.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Plug into the VFS function pointers instead of handling ioctls
directly.
---
 fs/xfs/xfs_file.c |  142 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 025d52f..3db3f34 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -974,6 +974,146 @@ xfs_file_fallocate(
 	return error;
 }
 
+/*
+ * Flush all file writes out to disk.
+ */
+static int
+xfs_file_wait_for_io(
+	struct inode	*inode,
+	loff_t		offset,
+	size_t		len)
+{
+	loff_t		rounding;
+	loff_t		ioffset;
+	loff_t		iendoffset;
+	loff_t		bs;
+	int		ret;
+
+	bs = inode->i_sb->s_blocksize;
+	inode_dio_wait(inode);
+
+	rounding = max_t(xfs_off_t, bs, PAGE_SIZE);
+	ioffset = round_down(offset, rounding);
+	iendoffset = round_up(offset + len, rounding) - 1;
+	ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
+					   iendoffset);
+	return ret;
+}
+
+/* Hook up to the VFS reflink function */
+STATIC int
+xfs_file_share_range(
+	struct file	*file_in,
+	loff_t		pos_in,
+	struct file	*file_out,
+	loff_t		pos_out,
+	u64		len)
+{
+	struct inode	*inode_in;
+	struct inode	*inode_out;
+	ssize_t		ret;
+	loff_t		bs;
+	loff_t		isize;
+	int		same_inode;
+	loff_t		blen;
+
+	inode_in = file_inode(file_in);
+	inode_out = file_inode(file_out);
+	bs = inode_out->i_sb->s_blocksize;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode_out))
+		return -EPERM;
+	if (IS_SWAPFILE(inode_in) ||
+	    IS_SWAPFILE(inode_out))
+		return -ETXTBSY;
+
+	/* Reflink only works within this filesystem. */
+	if (inode_in->i_sb != inode_out->i_sb)
+		return -EXDEV;
+	same_inode = (inode_in->i_ino == inode_out->i_ino);
+
+	/* Don't reflink dirs, pipes, sockets... */
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (S_ISFIFO(inode_in->i_mode) || S_ISFIFO(inode_out->i_mode))
+		return -EINVAL;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EINVAL;
+
+	/* Are we going all the way to the end? */
+	isize = i_size_read(inode_in);
+	if (isize == 0)
+		return 0;
+	if (len == 0)
+		len = isize - pos_in;
+
+	/* Ensure offsets don't wrap and the input is inside i_size */
+	if (pos_in + len < pos_in || pos_out + len < pos_out ||
+	    pos_in + len > isize)
+		return -EINVAL;
+
+	/* If we're linking to EOF, continue to the block boundary. */
+	if (pos_in + len == isize)
+		blen = ALIGN(isize, bs) - pos_in;
+	else
+		blen = len;
+
+	/* Only reflink if we're aligned to block boundaries */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_in + blen, bs) ||
+	    !IS_ALIGNED(pos_out, bs) || !IS_ALIGNED(pos_out + blen, bs))
+		return -EINVAL;
+
+	/* Don't allow overlapped reflink within the same file */
+	if (same_inode && pos_out + blen > pos_in && pos_out < pos_in + blen)
+		return -EINVAL;
+
+	/* Wait for the completion of any pending IOs on srcfile */
+	ret = xfs_file_wait_for_io(inode_in, pos_in, len);
+	if (ret)
+		goto out_unlock;
+	ret = xfs_file_wait_for_io(inode_out, pos_out, len);
+	if (ret)
+		goto out_unlock;
+
+	ret = xfs_reflink_remap_range(XFS_I(inode_in), pos_in, XFS_I(inode_out),
+			pos_out, len);
+	if (ret < 0)
+		goto out_unlock;
+
+out_unlock:
+	return ret;
+}
+
+STATIC ssize_t
+xfs_file_copy_range(
+	struct file	*file_in,
+	loff_t		pos_in,
+	struct file	*file_out,
+	loff_t		pos_out,
+	size_t		len,
+	unsigned int	flags)
+{
+	int		error;
+
+	error = xfs_file_share_range(file_in, pos_in, file_out, pos_out,
+				     len);
+	if (error)
+		return error;
+	return len;
+}
+
+STATIC int
+xfs_file_clone_range(
+	struct file	*file_in,
+	loff_t		pos_in,
+	struct file	*file_out,
+	loff_t		pos_out,
+	u64		len)
+{
+	return xfs_file_share_range(file_in, pos_in, file_out, pos_out,
+				     len);
+}
 
 STATIC int
 xfs_file_open(
@@ -1634,6 +1774,8 @@ const struct file_operations xfs_file_operations = {
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
 	.fallocate	= xfs_file_fallocate,
+	.copy_file_range = xfs_file_copy_range,
+	.clone_file_range = xfs_file_clone_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 43/63] xfs: add dedupe range vfs function
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (41 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 42/63] xfs: add clone file and clone range vfs functions Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  7:53   ` Christoph Hellwig
  2016-09-30  3:10 ` [PATCH 44/63] xfs: teach get_bmapx about shared extents and the CoW fork Darrick J. Wong
                   ` (20 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Define a VFS function which allows userspace to request that the
kernel reflink a range of blocks between two files if the ranges'
contents match.  The function fits the new VFS ioctl that standardizes
the checking for the btrfs EXTENT SAME ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Plug into the VFS function pointers instead of handling ioctls
directly, and lock the pages so they don't disappear while we're
trying to compare them.
---
 fs/xfs/xfs_file.c    |   48 +++++++++++++++++--
 fs/xfs/xfs_reflink.c |  127 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    5 ++
 3 files changed, 174 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3db3f34..450bf2b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1007,7 +1007,8 @@ xfs_file_share_range(
 	loff_t		pos_in,
 	struct file	*file_out,
 	loff_t		pos_out,
-	u64		len)
+	u64		len,
+	bool		is_dedupe)
 {
 	struct inode	*inode_in;
 	struct inode	*inode_out;
@@ -1016,6 +1017,7 @@ xfs_file_share_range(
 	loff_t		isize;
 	int		same_inode;
 	loff_t		blen;
+	unsigned int	flags = 0;
 
 	inode_in = file_inode(file_in);
 	inode_out = file_inode(file_out);
@@ -1053,6 +1055,15 @@ xfs_file_share_range(
 	    pos_in + len > isize)
 		return -EINVAL;
 
+	/* Don't allow dedupe past EOF in the dest file */
+	if (is_dedupe) {
+		loff_t	disize;
+
+		disize = i_size_read(inode_out);
+		if (pos_out >= disize || pos_out + len > disize)
+			return -EINVAL;
+	}
+
 	/* If we're linking to EOF, continue to the block boundary. */
 	if (pos_in + len == isize)
 		blen = ALIGN(isize, bs) - pos_in;
@@ -1076,8 +1087,10 @@ xfs_file_share_range(
 	if (ret)
 		goto out_unlock;
 
+	if (is_dedupe)
+		flags |= XFS_REFLINK_DEDUPE;
 	ret = xfs_reflink_remap_range(XFS_I(inode_in), pos_in, XFS_I(inode_out),
-			pos_out, len);
+			pos_out, len, flags);
 	if (ret < 0)
 		goto out_unlock;
 
@@ -1097,7 +1110,7 @@ xfs_file_copy_range(
 	int		error;
 
 	error = xfs_file_share_range(file_in, pos_in, file_out, pos_out,
-				     len);
+				     len, false);
 	if (error)
 		return error;
 	return len;
@@ -1112,7 +1125,33 @@ xfs_file_clone_range(
 	u64		len)
 {
 	return xfs_file_share_range(file_in, pos_in, file_out, pos_out,
-				     len);
+				     len, false);
+}
+
+#define XFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+STATIC ssize_t
+xfs_file_dedupe_range(
+	struct file	*src_file,
+	u64		loff,
+	u64		len,
+	struct file	*dst_file,
+	u64		dst_loff)
+{
+	int		error;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > XFS_MAX_DEDUPE_LEN)
+		len = XFS_MAX_DEDUPE_LEN;
+
+	error = xfs_file_share_range(src_file, loff, dst_file, dst_loff,
+				     len, true);
+	if (error)
+		return error;
+	return len;
 }
 
 STATIC int
@@ -1776,6 +1815,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.copy_file_range = xfs_file_copy_range,
 	.clone_file_range = xfs_file_clone_range,
+	.dedupe_file_range = xfs_file_dedupe_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 94c19fff..77ac810 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1277,6 +1277,111 @@ xfs_reflink_remap_blocks(
 }
 
 /*
+ * Read a page's worth of file data into the page cache.  Return the page
+ * locked.
+ */
+static struct page *
+xfs_get_page(
+	struct inode	*inode,
+	xfs_off_t	offset)
+{
+	struct address_space	*mapping;
+	struct page		*page;
+	pgoff_t			n;
+
+	n = offset >> PAGE_SHIFT;
+	mapping = inode->i_mapping;
+	page = read_mapping_page(mapping, n, NULL);
+	if (IS_ERR(page))
+		return page;
+	if (!PageUptodate(page)) {
+		put_page(page);
+		return ERR_PTR(-EIO);
+	}
+	lock_page(page);
+	return page;
+}
+
+/*
+ * Compare extents of two files to see if they are the same.
+ */
+static int
+xfs_compare_extents(
+	struct inode	*src,
+	xfs_off_t	srcoff,
+	struct inode	*dest,
+	xfs_off_t	destoff,
+	xfs_off_t	len,
+	bool		*is_same)
+{
+	xfs_off_t	src_poff;
+	xfs_off_t	dest_poff;
+	void		*src_addr;
+	void		*dest_addr;
+	struct page	*src_page;
+	struct page	*dest_page;
+	xfs_off_t	cmp_len;
+	bool		same;
+	int		error;
+
+	error = -EINVAL;
+	same = true;
+	while (len) {
+		src_poff = srcoff & (PAGE_SIZE - 1);
+		dest_poff = destoff & (PAGE_SIZE - 1);
+		cmp_len = min(PAGE_SIZE - src_poff,
+			      PAGE_SIZE - dest_poff);
+		cmp_len = min(cmp_len, len);
+		ASSERT(cmp_len > 0);
+
+		trace_xfs_reflink_compare_extents(XFS_I(src), srcoff, cmp_len,
+				XFS_I(dest), destoff);
+
+		src_page = xfs_get_page(src, srcoff);
+		if (IS_ERR(src_page)) {
+			error = PTR_ERR(src_page);
+			goto out_error;
+		}
+		dest_page = xfs_get_page(dest, destoff);
+		if (IS_ERR(dest_page)) {
+			error = PTR_ERR(dest_page);
+			unlock_page(src_page);
+			put_page(src_page);
+			goto out_error;
+		}
+		src_addr = kmap_atomic(src_page);
+		dest_addr = kmap_atomic(dest_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dest_page);
+
+		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
+			same = false;
+
+		kunmap_atomic(dest_addr);
+		kunmap_atomic(src_addr);
+		unlock_page(dest_page);
+		unlock_page(src_page);
+		put_page(dest_page);
+		put_page(src_page);
+
+		if (!same)
+			break;
+
+		srcoff += cmp_len;
+		destoff += cmp_len;
+		len -= cmp_len;
+	}
+
+	*is_same = same;
+	return 0;
+
+out_error:
+	trace_xfs_reflink_compare_extents_error(XFS_I(dest), error, _RET_IP_);
+	return error;
+}
+
+/*
  * Link a range of blocks from one file to another.
  */
 int
@@ -1285,12 +1390,14 @@ xfs_reflink_remap_range(
 	xfs_off_t		srcoff,
 	struct xfs_inode	*dest,
 	xfs_off_t		destoff,
-	xfs_off_t		len)
+	xfs_off_t		len,
+	unsigned int		flags)
 {
 	struct xfs_mount	*mp = src->i_mount;
 	xfs_fileoff_t		sfsbno, dfsbno;
 	xfs_filblks_t		fsblen;
 	int			error;
+	bool			is_same;
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
 		return -EOPNOTSUPP;
@@ -1302,6 +1409,9 @@ xfs_reflink_remap_range(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		return -EINVAL;
 
+	if (flags & ~XFS_REFLINK_ALL)
+		return -EINVAL;
+
 	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
 
 	/* Lock both files against IO */
@@ -1313,6 +1423,21 @@ xfs_reflink_remap_range(
 		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
 	}
 
+	/*
+	 * Check that the extents are the same.
+	 */
+	if (flags & XFS_REFLINK_DEDUPE) {
+		is_same = false;
+		error = xfs_compare_extents(VFS_I(src), srcoff, VFS_I(dest),
+				destoff, len, &is_same);
+		if (error)
+			goto out_error;
+		if (!is_same) {
+			error = -EBADE;
+			goto out_error;
+		}
+	}
+
 	error = xfs_reflink_set_inode_flag(src, dest);
 	if (error)
 		goto out_error;
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index c35ce29..df82b20 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -43,7 +43,10 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
+#define XFS_REFLINK_DEDUPE	1	/* only reflink if contents match */
+#define XFS_REFLINK_ALL		(XFS_REFLINK_DEDUPE)
 extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
-		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
+		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
+		unsigned int flags);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 44/63] xfs: teach get_bmapx about shared extents and the CoW fork
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (42 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 43/63] xfs: add dedupe range vfs function Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  7:53   ` Christoph Hellwig
  2016-09-30  3:10 ` [PATCH 45/63] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
                   ` (19 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Teach xfs_getbmapx how to report shared extents and CoW fork contents
accurately in the bmap output by querying the refcount btree
appropriately.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h |    4 +
 fs/xfs/xfs_bmap_util.c |  146 +++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 135 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 6f4f2c3..3d1efe5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -81,14 +81,16 @@ struct getbmapx {
 #define BMV_IF_PREALLOC		0x4	/* rtn status BMV_OF_PREALLOC if req */
 #define BMV_IF_DELALLOC		0x8	/* rtn status BMV_OF_DELALLOC if req */
 #define BMV_IF_NO_HOLES		0x10	/* Do not return holes */
+#define BMV_IF_COWFORK		0x20	/* return CoW fork rather than data */
 #define BMV_IF_VALID	\
 	(BMV_IF_ATTRFORK|BMV_IF_NO_DMAPI_READ|BMV_IF_PREALLOC|	\
-	 BMV_IF_DELALLOC|BMV_IF_NO_HOLES)
+	 BMV_IF_DELALLOC|BMV_IF_NO_HOLES|BMV_IF_COWFORK)
 
 /*	bmv_oflags values - returned for each non-header segment */
 #define BMV_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
 #define BMV_OF_DELALLOC		0x2	/* segment = delayed allocation */
 #define BMV_OF_LAST		0x4	/* segment is the last in the file */
+#define BMV_OF_SHARED		0x8	/* segment shared with another file */
 
 /*
  * Structure for XFS_IOC_FSSETDM.
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e827d65..6a95a3a 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -42,6 +42,9 @@
 #include "xfs_icache.h"
 #include "xfs_log.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_iomap.h"
+#include "xfs_reflink.h"
+#include "xfs_refcount.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -389,11 +392,13 @@ xfs_bmap_count_blocks(
 STATIC int
 xfs_getbmapx_fix_eof_hole(
 	xfs_inode_t		*ip,		/* xfs incore inode pointer */
+	int			whichfork,
 	struct getbmapx		*out,		/* output structure */
 	int			prealloced,	/* this is a file with
 						 * preallocated data space */
 	__int64_t		end,		/* last block requested */
-	xfs_fsblock_t		startblock)
+	xfs_fsblock_t		startblock,
+	bool			moretocome)
 {
 	__int64_t		fixlen;
 	xfs_mount_t		*mp;		/* file system mount point */
@@ -418,8 +423,9 @@ xfs_getbmapx_fix_eof_hole(
 		else
 			out->bmv_block = xfs_fsb_to_db(ip, startblock);
 		fileblock = XFS_BB_TO_FSB(ip->i_mount, out->bmv_offset);
-		ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
-		if (xfs_iext_bno_to_ext(ifp, fileblock, &lastx) &&
+		ifp = XFS_IFORK_PTR(ip, whichfork);
+		if (!moretocome &&
+		    xfs_iext_bno_to_ext(ifp, fileblock, &lastx) &&
 		   (lastx == (ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t))-1))
 			out->bmv_oflags |= BMV_OF_LAST;
 	}
@@ -427,6 +433,73 @@ xfs_getbmapx_fix_eof_hole(
 	return 1;
 }
 
+/* Adjust the reported bmap around shared/unshared extent transitions. */
+STATIC int
+xfs_getbmap_adjust_shared(
+	struct xfs_inode		*ip,
+	int				whichfork,
+	struct xfs_bmbt_irec		*map,
+	struct getbmapx			*out,
+	struct xfs_bmbt_irec		*next_map)
+{
+	struct xfs_mount		*mp = ip->i_mount;
+	xfs_agnumber_t			agno;
+	xfs_agblock_t			agbno;
+	xfs_agblock_t			ebno;
+	xfs_extlen_t			elen;
+	xfs_extlen_t			nlen;
+	int				error;
+
+	next_map->br_blockcount = 0;
+
+	/* Only written data blocks can be shared. */
+	if (!xfs_is_reflink_inode(ip) || whichfork != XFS_DATA_FORK ||
+	    map->br_startblock == DELAYSTARTBLOCK ||
+	    map->br_startblock == HOLESTARTBLOCK ||
+	    ISUNWRITTEN(map))
+		return 0;
+
+	agno = XFS_FSB_TO_AGNO(mp, map->br_startblock);
+	agbno = XFS_FSB_TO_AGBNO(mp, map->br_startblock);
+	error = xfs_reflink_find_shared(mp, agno, agbno, map->br_blockcount,
+			&ebno, &elen, true);
+	if (error)
+		return error;
+
+	*next_map = *map;
+	if (agbno == ebno) {
+		/*
+		 * Shared extent at (agbno, elen).  Shrink the reported
+		 * extent length and prepare to move the start of map[i]
+		 * to agbno+elen, with the aim of (re)formatting the new
+		 * map[i] the next time through the inner loop.
+		 */
+		out->bmv_length = XFS_FSB_TO_BB(mp, elen);
+		out->bmv_oflags |= BMV_OF_SHARED;
+		next_map->br_startblock += elen;
+		next_map->br_startoff += elen;
+		next_map->br_blockcount -= elen;
+		map->br_blockcount -= elen;
+	} else {
+		/*
+		 * There's an unshared extent (agbno, ebno - agbno)
+		 * followed by shared extent at (ebno, elen).  Shrink
+		 * the reported extent length to cover only the unshared
+		 * extent and prepare to move up the start of map[i] to
+		 * ebno, with the aim of (re)formatting the new map[i]
+		 * the next time through the inner loop.
+		 */
+		nlen = ebno - agbno;
+		out->bmv_length = XFS_FSB_TO_BB(mp, nlen);
+		next_map->br_startblock += nlen;
+		next_map->br_startoff += nlen;
+		next_map->br_blockcount -= nlen;
+		map->br_blockcount -= nlen;
+	}
+
+	return 0;
+}
+
 /*
  * Get inode's extents as described in bmv, and format for output.
  * Calls formatter to fill the user's buffer until all extents
@@ -459,12 +532,28 @@ xfs_getbmap(
 	int			iflags;		/* interface flags */
 	int			bmapi_flags;	/* flags for xfs_bmapi */
 	int			cur_ext = 0;
+	struct xfs_bmbt_irec	inject_map;
 
 	mp = ip->i_mount;
 	iflags = bmv->bmv_iflags;
-	whichfork = iflags & BMV_IF_ATTRFORK ? XFS_ATTR_FORK : XFS_DATA_FORK;
 
-	if (whichfork == XFS_ATTR_FORK) {
+#ifndef DEBUG
+	/* Only allow CoW fork queries if we're debugging. */
+	if (iflags & BMV_IF_COWFORK)
+		return -EINVAL;
+#endif
+	if ((iflags & BMV_IF_ATTRFORK) && (iflags & BMV_IF_COWFORK))
+		return -EINVAL;
+
+	if (iflags & BMV_IF_ATTRFORK)
+		whichfork = XFS_ATTR_FORK;
+	else if (iflags & BMV_IF_COWFORK)
+		whichfork = XFS_COW_FORK;
+	else
+		whichfork = XFS_DATA_FORK;
+
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
 		if (XFS_IFORK_Q(ip)) {
 			if (ip->i_d.di_aformat != XFS_DINODE_FMT_EXTENTS &&
 			    ip->i_d.di_aformat != XFS_DINODE_FMT_BTREE &&
@@ -480,7 +569,15 @@ xfs_getbmap(
 
 		prealloced = 0;
 		fixlen = 1LL << 32;
-	} else {
+		break;
+	case XFS_COW_FORK:
+		if (ip->i_cformat != XFS_DINODE_FMT_EXTENTS)
+			return -EINVAL;
+
+		prealloced = 0;
+		fixlen = XFS_ISIZE(ip);
+		break;
+	default:
 		if (ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
 		    ip->i_d.di_format != XFS_DINODE_FMT_BTREE &&
 		    ip->i_d.di_format != XFS_DINODE_FMT_LOCAL)
@@ -494,6 +591,7 @@ xfs_getbmap(
 			prealloced = 0;
 			fixlen = XFS_ISIZE(ip);
 		}
+		break;
 	}
 
 	if (bmv->bmv_length == -1) {
@@ -520,7 +618,8 @@ xfs_getbmap(
 		return -ENOMEM;
 
 	xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	if (whichfork == XFS_DATA_FORK) {
+	switch (whichfork) {
+	case XFS_DATA_FORK:
 		if (!(iflags & BMV_IF_DELALLOC) &&
 		    (ip->i_delayed_blks || XFS_ISIZE(ip) > ip->i_d.di_size)) {
 			error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
@@ -538,8 +637,14 @@ xfs_getbmap(
 		}
 
 		lock = xfs_ilock_data_map_shared(ip);
-	} else {
+		break;
+	case XFS_COW_FORK:
+		lock = XFS_ILOCK_SHARED;
+		xfs_ilock(ip, lock);
+		break;
+	case XFS_ATTR_FORK:
 		lock = xfs_ilock_attr_map_shared(ip);
+		break;
 	}
 
 	/*
@@ -581,7 +686,8 @@ xfs_getbmap(
 			goto out_free_map;
 		ASSERT(nmap <= subnex);
 
-		for (i = 0; i < nmap && nexleft && bmv->bmv_length; i++) {
+		for (i = 0; i < nmap && nexleft && bmv->bmv_length &&
+				cur_ext < bmv->bmv_count; i++) {
 			out[cur_ext].bmv_oflags = 0;
 			if (map[i].br_state == XFS_EXT_UNWRITTEN)
 				out[cur_ext].bmv_oflags |= BMV_OF_PREALLOC;
@@ -614,9 +720,16 @@ xfs_getbmap(
 				goto out_free_map;
 			}
 
-			if (!xfs_getbmapx_fix_eof_hole(ip, &out[cur_ext],
-					prealloced, bmvend,
-					map[i].br_startblock))
+			/* Is this a shared block? */
+			error = xfs_getbmap_adjust_shared(ip, whichfork,
+					&map[i], &out[cur_ext], &inject_map);
+			if (error)
+				goto out_free_map;
+
+			if (!xfs_getbmapx_fix_eof_hole(ip, whichfork,
+					&out[cur_ext], prealloced, bmvend,
+					map[i].br_startblock,
+					inject_map.br_blockcount != 0))
 				goto out_free_map;
 
 			bmv->bmv_offset =
@@ -636,11 +749,16 @@ xfs_getbmap(
 				continue;
 			}
 
-			nexleft--;
+			if (inject_map.br_blockcount) {
+				map[i] = inject_map;
+				i--;
+			} else
+				nexleft--;
 			bmv->bmv_entries++;
 			cur_ext++;
 		}
-	} while (nmap && nexleft && bmv->bmv_length);
+	} while (nmap && nexleft && bmv->bmv_length &&
+		 cur_ext < bmv->bmv_count);
 
  out_free_map:
 	kmem_free(map);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 45/63] xfs: swap inode reflink flags when swapping inode extents
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (43 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 44/63] xfs: teach get_bmapx about shared extents and the CoW fork Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  7:54   ` Christoph Hellwig
  2016-09-30  3:10 ` [PATCH 46/63] xfs: unshare a range of blocks via fallocate Darrick J. Wong
                   ` (18 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

When we're swapping the extents of two inodes, be sure to swap the
reflink inode flag too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 6a95a3a..a835e12 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1652,6 +1652,8 @@ xfs_swap_extents(
 	int		taforkblks = 0;
 	__uint64_t	tmp;
 	int		lock_flags;
+	struct xfs_ifork	*cowfp;
+	__uint64_t	f;
 
 	/* XXX: we can't do this with rmap, will fix later */
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
@@ -1865,6 +1867,19 @@ xfs_swap_extents(
 		break;
 	}
 
+	/* Do we have to swap reflink flags? */
+	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
+	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
+		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
+		cowfp = ip->i_cowfp;
+		ip->i_cowfp = tip->i_cowfp;
+		tip->i_cowfp = cowfp;
+	}
+
 	xfs_trans_log_inode(tp, ip,  src_log_flags);
 	xfs_trans_log_inode(tp, tip, target_log_flags);
 


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (44 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 45/63] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  7:54   ` Christoph Hellwig
  2016-10-07 18:05   ` Brian Foster
  2016-09-30  3:10 ` [PATCH 47/63] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
                   ` (17 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

Unshare all shared extents if the user calls fallocate with the new
unshare mode flag set, so that we can guarantee that a subsequent
write will not ENOSPC.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: pass inode instead of file to xfs_reflink_dirty_range,
      use iomap infrastructure for copy up]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c    |   10 ++
 fs/xfs/xfs_reflink.c |  237 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |    2 
 3 files changed, 247 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 450bf2b..f3e5cb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -845,7 +845,7 @@ xfs_file_write_iter(
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
-		 FALLOC_FL_INSERT_RANGE)
+		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
 
 STATIC long
 xfs_file_fallocate(
@@ -935,9 +935,15 @@ xfs_file_fallocate(
 
 		if (mode & FALLOC_FL_ZERO_RANGE)
 			error = xfs_zero_file_space(ip, offset, len);
-		else
+		else {
+			if (mode & FALLOC_FL_UNSHARE_RANGE) {
+				error = xfs_reflink_unshare(ip, offset, len);
+				if (error)
+					goto out_unlock;
+			}
 			error = xfs_alloc_file_space(ip, offset, len,
 						     XFS_BMAPI_PREALLOC);
+		}
 		if (error)
 			goto out_unlock;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 77ac810..065e836 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1472,3 +1472,240 @@ xfs_reflink_remap_range(
 		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * The user wants to preemptively CoW all shared blocks in this file,
+ * which enables us to turn off the reflink flag.  Iterate all
+ * extents which are not prealloc/delalloc to see which ranges are
+ * mentioned in the refcount tree, then read those blocks into the
+ * pagecache, dirty them, fsync them back out, and then we can update
+ * the inode flag.  What happens if we run out of memory? :)
+ */
+STATIC int
+xfs_reflink_dirty_extents(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		fbno,
+	xfs_filblks_t		end,
+	xfs_off_t		isize)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		aglen;
+	xfs_agblock_t		rbno;
+	xfs_extlen_t		rlen;
+	xfs_off_t		fpos;
+	xfs_off_t		flen;
+	struct xfs_bmbt_irec	map[2];
+	int			nmaps;
+	int			error;
+
+	while (end - fbno > 0) {
+		nmaps = 1;
+		/*
+		 * Look for extents in the file.  Skip holes, delalloc, or
+		 * unwritten extents; they can't be reflinked.
+		 */
+		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
+		if (error)
+			goto out;
+		if (nmaps == 0)
+			break;
+		if (map[0].br_startblock == HOLESTARTBLOCK ||
+		    map[0].br_startblock == DELAYSTARTBLOCK ||
+		    ISUNWRITTEN(&map[0]))
+			goto next;
+
+		map[1] = map[0];
+		while (map[1].br_blockcount) {
+			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
+			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
+			aglen = map[1].br_blockcount;
+
+			error = xfs_reflink_find_shared(mp, agno, agbno, aglen,
+					&rbno, &rlen, true);
+			if (error)
+				goto out;
+			if (rlen == 0)
+				goto skip_copy;
+
+			/* Dirty the pages */
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			fpos = XFS_FSB_TO_B(mp, map[1].br_startoff +
+					(rbno - agbno));
+			flen = XFS_FSB_TO_B(mp, rlen);
+			if (fpos + flen > isize)
+				flen = isize - fpos;
+			error = iomap_file_dirty(VFS_I(ip), fpos, flen,
+					&xfs_iomap_ops);
+			xfs_ilock(ip, XFS_ILOCK_EXCL);
+			if (error)
+				goto out;
+skip_copy:
+			map[1].br_blockcount -= (rbno - agbno + rlen);
+			map[1].br_startoff += (rbno - agbno + rlen);
+			map[1].br_startblock += (rbno - agbno + rlen);
+		}
+
+next:
+		fbno = map[0].br_startoff + map[0].br_blockcount;
+	}
+out:
+	return error;
+}
+
+/* Iterate the extents; if there are no reflinked blocks, clear the flag. */
+STATIC int
+xfs_reflink_try_clear_inode_flag(
+	struct xfs_inode	*ip,
+	xfs_off_t		old_isize)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	xfs_fileoff_t		fbno;
+	xfs_filblks_t		end;
+	xfs_agnumber_t		agno;
+	xfs_agblock_t		agbno;
+	xfs_extlen_t		aglen;
+	xfs_agblock_t		rbno;
+	xfs_extlen_t		rlen;
+	struct xfs_bmbt_irec	map[2];
+	int			nmaps;
+	int			error = 0;
+
+	/* Start a rolling transaction to remove the mappings */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	if (old_isize != i_size_read(VFS_I(ip)))
+		goto cancel;
+	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK))
+		goto cancel;
+
+	fbno = 0;
+	end = XFS_B_TO_FSB(mp, old_isize);
+	while (end - fbno > 0) {
+		nmaps = 1;
+		/*
+		 * Look for extents in the file.  Skip holes, delalloc, or
+		 * unwritten extents; they can't be reflinked.
+		 */
+		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
+		if (error)
+			goto cancel;
+		if (nmaps == 0)
+			break;
+		if (map[0].br_startblock == HOLESTARTBLOCK ||
+		    map[0].br_startblock == DELAYSTARTBLOCK ||
+		    ISUNWRITTEN(&map[0]))
+			goto next;
+
+		map[1] = map[0];
+		while (map[1].br_blockcount) {
+			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
+			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
+			aglen = map[1].br_blockcount;
+
+			error = xfs_reflink_find_shared(mp, agno, agbno, aglen,
+					&rbno, &rlen, false);
+			if (error)
+				goto cancel;
+			/* Is there still a shared block here? */
+			if (rlen > 0) {
+				error = 0;
+				goto cancel;
+			}
+
+			map[1].br_blockcount -= aglen;
+			map[1].br_startoff += aglen;
+			map[1].br_startblock += aglen;
+		}
+
+next:
+		fbno = map[0].br_startoff + map[0].br_blockcount;
+	}
+
+	/*
+	 * We didn't find any shared blocks so turn off the reflink flag.
+	 * First, get rid of any leftover CoW mappings.
+	 */
+	error = xfs_reflink_cancel_cow_blocks(ip, &tp, 0, NULLFILEOFF);
+	if (error)
+		goto cancel;
+
+	/* Clear the inode flag. */
+	trace_xfs_reflink_unset_inode_flag(ip);
+	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+	xfs_trans_ijoin(tp, ip, 0);
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out;
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return 0;
+cancel:
+	xfs_trans_cancel(tp);
+out:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
+/*
+ * Pre-COW all shared blocks within a given byte range of a file and turn off
+ * the reflink flag if we unshare all of the file's blocks.
+ */
+int
+xfs_reflink_unshare(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		fbno;
+	xfs_filblks_t		end;
+	xfs_off_t		old_isize, isize;
+	int			error;
+
+	if (!xfs_is_reflink_inode(ip))
+		return 0;
+
+	trace_xfs_reflink_unshare(ip, offset, len);
+
+	inode_dio_wait(VFS_I(ip));
+
+	/* Try to CoW the selected ranges */
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	fbno = XFS_B_TO_FSB(mp, offset);
+	old_isize = isize = i_size_read(VFS_I(ip));
+	end = XFS_B_TO_FSB(mp, offset + len);
+	error = xfs_reflink_dirty_extents(ip, fbno, end, isize);
+	if (error)
+		goto out_unlock;
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	/* Wait for the IO to finish */
+	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+	if (error)
+		goto out;
+
+	/* Turn off the reflink flag if we unshared the whole file */
+	if (offset == 0 && len == isize) {
+		error = xfs_reflink_try_clear_inode_flag(ip, old_isize);
+		if (error)
+			goto out;
+	}
+
+	return 0;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
+	return error;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index df82b20..ad4fc61 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -48,5 +48,7 @@ extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
 		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
 		unsigned int flags);
+extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t len);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 47/63] xfs: create a separate cow extent size hint for the allocator
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (45 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 46/63] xfs: unshare a range of blocks via fallocate Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  7:55   ` Christoph Hellwig
  2016-09-30  3:10 ` [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
                   ` (16 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Create a per-inode extent size allocator hint for copy-on-write.  This
hint is separate from the existing extent size hint so that CoW can
take advantage of the fragmentation-reducing properties of extent size
hints without disabling delalloc for regular writes.

The extent size hint that's fed to the allocator during a copy on
write operation is the greater of the cowextsize and regular extsize
hint.

During reflink, if we're sharing the entire source file to the entire
destination file and the destination file doesn't already have a
cowextsize hint, propagate the source file's cowextsize hint to the
destination file.

Furthermore, zero the bulkstat buffer prior to setting the fields
so that we don't copy kernel memory contents into userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       |    9 ++++-
 fs/xfs/libxfs/xfs_format.h     |    3 +-
 fs/xfs/libxfs/xfs_fs.h         |    3 +-
 fs/xfs/libxfs/xfs_inode_buf.c  |    4 ++
 fs/xfs/libxfs/xfs_inode_buf.h  |    1 +
 fs/xfs/libxfs/xfs_log_format.h |    3 +-
 fs/xfs/xfs_bmap_util.c         |    9 ++++-
 fs/xfs/xfs_inode.c             |   33 ++++++++++++++++++++
 fs/xfs/xfs_inode.h             |    1 +
 fs/xfs/xfs_inode_item.c        |    2 +
 fs/xfs/xfs_ioctl.c             |   67 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.c             |    2 +
 fs/xfs/xfs_iomap.h             |    1 +
 fs/xfs/xfs_itable.c            |    8 ++++-
 fs/xfs/xfs_reflink.c           |   41 ++++++++++++++++++++----
 15 files changed, 167 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0ef7fb4..69a6ae6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3666,7 +3666,9 @@ xfs_bmap_btalloc(
 	else if (mp->m_dalign)
 		stripe_align = mp->m_dalign;
 
-	if (xfs_alloc_is_userdata(ap->datatype))
+	if (ap->flags & XFS_BMAPI_COWFORK)
+		align = xfs_get_cowextsz_hint(ap->ip);
+	else if (xfs_alloc_is_userdata(ap->datatype))
 		align = xfs_get_extsz_hint(ap->ip);
 	if (unlikely(align)) {
 		error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
@@ -4184,7 +4186,10 @@ xfs_bmapi_reserve_delalloc(
 		alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
 
 	/* Figure out the extent size, adjust alen */
-	extsz = xfs_get_extsz_hint(ip);
+	if (whichfork == XFS_COW_FORK)
+		extsz = xfs_get_cowextsz_hint(ip);
+	else
+		extsz = xfs_get_extsz_hint(ip);
 	if (extsz) {
 		error = xfs_bmap_extsize_align(mp, got, prev, extsz, rt, eof,
 					       1, 0, &aoff, &alen);
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index a7ae738..94f610a 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -901,7 +901,8 @@ typedef struct xfs_dinode {
 	__be64		di_changecount;	/* number of attribute changes */
 	__be64		di_lsn;		/* flush sequence */
 	__be64		di_flags2;	/* more random flags */
-	__u8		di_pad2[16];	/* more padding for future expansion */
+	__be32		di_cowextsize;	/* basic cow extent size for file */
+	__u8		di_pad2[12];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_timestamp_t	di_crtime;	/* time created */
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 3d1efe5..b72dc82 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -278,7 +278,8 @@ typedef struct xfs_bstat {
 #define	bs_projid	bs_projid_lo	/* (previously just bs_projid)	*/
 	__u16		bs_forkoff;	/* inode fork offset in bytes	*/
 	__u16		bs_projid_hi;	/* higher part of project id	*/
-	unsigned char	bs_pad[10];	/* pad space, unused		*/
+	unsigned char	bs_pad[6];	/* pad space, unused		*/
+	__u32		bs_cowextsize;	/* cow extent size		*/
 	__u32		bs_dmevmask;	/* DMIG event mask		*/
 	__u16		bs_dmstate;	/* DMIG state info		*/
 	__u16		bs_aextents;	/* attribute number of extents	*/
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 4b9769e..a3e8038 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -256,6 +256,7 @@ xfs_inode_from_disk(
 		to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
 		to->di_flags2 = be64_to_cpu(from->di_flags2);
+		to->di_cowextsize = be32_to_cpu(from->di_cowextsize);
 	}
 }
 
@@ -305,7 +306,7 @@ xfs_inode_to_disk(
 		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
-
+		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
 		to->di_ino = cpu_to_be64(ip->i_ino);
 		to->di_lsn = cpu_to_be64(lsn);
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
@@ -357,6 +358,7 @@ xfs_log_dinode_to_disk(
 		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
+		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
 		to->di_ino = cpu_to_be64(from->di_ino);
 		to->di_lsn = cpu_to_be64(from->di_lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 7c4dd32..62d9d46 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -47,6 +47,7 @@ struct xfs_icdinode {
 	__uint16_t	di_flags;	/* random flags, XFS_DIFLAG_... */
 
 	__uint64_t	di_flags2;	/* more random flags */
+	__uint32_t	di_cowextsize;	/* basic cow extent size for file */
 
 	xfs_ictimestamp_t di_crtime;	/* time created */
 };
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 5dd7c2e..364ce6f 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -423,7 +423,8 @@ struct xfs_log_dinode {
 	__uint64_t	di_changecount;	/* number of attribute changes */
 	xfs_lsn_t	di_lsn;		/* flush sequence */
 	__uint64_t	di_flags2;	/* more random flags */
-	__uint8_t	di_pad2[16];	/* more padding for future expansion */
+	__uint32_t	di_cowextsize;	/* basic cow extent size for file */
+	__uint8_t	di_pad2[12];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_ictimestamp_t di_crtime;	/* time created */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a835e12..d91f406 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -574,8 +574,13 @@ xfs_getbmap(
 		if (ip->i_cformat != XFS_DINODE_FMT_EXTENTS)
 			return -EINVAL;
 
-		prealloced = 0;
-		fixlen = XFS_ISIZE(ip);
+		if (xfs_get_cowextsz_hint(ip)) {
+			prealloced = 1;
+			fixlen = mp->m_super->s_maxbytes;
+		} else {
+			prealloced = 0;
+			fixlen = XFS_ISIZE(ip);
+		}
 		break;
 	default:
 		if (ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8c971fd..a4b061e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -78,6 +78,27 @@ xfs_get_extsz_hint(
 }
 
 /*
+ * Helper function to extract CoW extent size hint from inode.
+ * Between the extent size hint and the CoW extent size hint, we
+ * return the greater of the two.
+ */
+xfs_extlen_t
+xfs_get_cowextsz_hint(
+	struct xfs_inode	*ip)
+{
+	xfs_extlen_t		a, b;
+
+	a = 0;
+	if (ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
+		a = ip->i_d.di_cowextsize;
+	b = xfs_get_extsz_hint(ip);
+
+	if (a > b)
+		return a;
+	return b;
+}
+
+/*
  * These two are wrapper routines around the xfs_ilock() routine used to
  * centralize some grungy code.  They are used in places that wish to lock the
  * inode solely for reading the extents.  The reason these places can't just
@@ -652,6 +673,8 @@ _xfs_dic2xflags(
 	if (di_flags2 & XFS_DIFLAG2_ANY) {
 		if (di_flags2 & XFS_DIFLAG2_DAX)
 			flags |= FS_XFLAG_DAX;
+		if (di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
+			flags |= FS_XFLAG_COWEXTSIZE;
 	}
 
 	if (has_attr)
@@ -835,6 +858,7 @@ xfs_ialloc(
 	if (ip->i_d.di_version == 3) {
 		inode->i_version = 1;
 		ip->i_d.di_flags2 = 0;
+		ip->i_d.di_cowextsize = 0;
 		ip->i_d.di_crtime.t_sec = (__int32_t)tv.tv_sec;
 		ip->i_d.di_crtime.t_nsec = (__int32_t)tv.tv_nsec;
 	}
@@ -897,6 +921,15 @@ xfs_ialloc(
 			ip->i_d.di_flags |= di_flags;
 			ip->i_d.di_flags2 |= di_flags2;
 		}
+		if (pip &&
+		    (pip->i_d.di_flags2 & XFS_DIFLAG2_ANY) &&
+		    pip->i_d.di_version == 3 &&
+		    ip->i_d.di_version == 3) {
+			if (pip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE) {
+				ip->i_d.di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+				ip->i_d.di_cowextsize = pip->i_d.di_cowextsize;
+			}
+		}
 		/* FALLTHROUGH */
 	case S_IFLNK:
 		ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1af1d8d..b1c32d4 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -426,6 +426,7 @@ int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
 void		xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
 
 xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
+xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
 
 int		xfs_dir_ialloc(struct xfs_trans **, struct xfs_inode *, umode_t,
 			       xfs_nlink_t, xfs_dev_t, prid_t, int,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 892c2ac..9610e9c 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -368,7 +368,7 @@ xfs_inode_to_log_dinode(
 		to->di_crtime.t_sec = from->di_crtime.t_sec;
 		to->di_crtime.t_nsec = from->di_crtime.t_nsec;
 		to->di_flags2 = from->di_flags2;
-
+		to->di_cowextsize = from->di_cowextsize;
 		to->di_ino = ip->i_ino;
 		to->di_lsn = lsn;
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 96a70fd..1388a127 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -903,6 +903,8 @@ xfs_ioc_fsgetxattr(
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	fa.fsx_xflags = xfs_ip2xflags(ip);
 	fa.fsx_extsize = ip->i_d.di_extsize << ip->i_mount->m_sb.sb_blocklog;
+	fa.fsx_cowextsize = ip->i_d.di_cowextsize <<
+			ip->i_mount->m_sb.sb_blocklog;
 	fa.fsx_projid = xfs_get_projid(ip);
 
 	if (attr) {
@@ -973,12 +975,13 @@ xfs_set_diflags(
 	if (ip->i_d.di_version < 3)
 		return;
 
-	di_flags2 = 0;
+	di_flags2 = (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK);
 	if (xflags & FS_XFLAG_DAX)
 		di_flags2 |= XFS_DIFLAG2_DAX;
+	if (xflags & FS_XFLAG_COWEXTSIZE)
+		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
 
 	ip->i_d.di_flags2 = di_flags2;
-
 }
 
 STATIC void
@@ -1219,6 +1222,56 @@ xfs_ioctl_setattr_check_extsize(
 	return 0;
 }
 
+/*
+ * CoW extent size hint validation rules are:
+ *
+ * 1. CoW extent size hint can only be set if reflink is enabled on the fs.
+ *    The inode does not have to have any shared blocks, but it must be a v3.
+ * 2. FS_XFLAG_COWEXTSIZE is only valid for directories and regular files;
+ *    for a directory, the hint is propagated to new files.
+ * 3. Can be changed on files & directories at any time.
+ * 4. CoW extsize hint of 0 turns off hints, clears inode flags.
+ * 5. Extent size must be a multiple of the appropriate block size.
+ * 6. The extent size hint must be limited to half the AG size to avoid
+ *    alignment extending the extent beyond the limits of the AG.
+ */
+static int
+xfs_ioctl_setattr_check_cowextsize(
+	struct xfs_inode	*ip,
+	struct fsxattr		*fa)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (!(fa->fsx_xflags & FS_XFLAG_COWEXTSIZE))
+		return 0;
+
+	if (!xfs_sb_version_hasreflink(&ip->i_mount->m_sb) ||
+	    ip->i_d.di_version != 3)
+		return -EINVAL;
+
+	if (!S_ISREG(VFS_I(ip)->i_mode) && !S_ISDIR(VFS_I(ip)->i_mode))
+		return -EINVAL;
+
+	if (fa->fsx_cowextsize != 0) {
+		xfs_extlen_t    size;
+		xfs_fsblock_t   cowextsize_fsb;
+
+		cowextsize_fsb = XFS_B_TO_FSB(mp, fa->fsx_cowextsize);
+		if (cowextsize_fsb > MAXEXTLEN)
+			return -EINVAL;
+
+		size = mp->m_sb.sb_blocksize;
+		if (cowextsize_fsb > mp->m_sb.sb_agblocks / 2)
+			return -EINVAL;
+
+		if (fa->fsx_cowextsize % size)
+			return -EINVAL;
+	} else
+		fa->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE;
+
+	return 0;
+}
+
 static int
 xfs_ioctl_setattr_check_projid(
 	struct xfs_inode	*ip,
@@ -1311,6 +1364,10 @@ xfs_ioctl_setattr(
 	if (code)
 		goto error_trans_cancel;
 
+	code = xfs_ioctl_setattr_check_cowextsize(ip, fa);
+	if (code)
+		goto error_trans_cancel;
+
 	code = xfs_ioctl_setattr_xflags(tp, ip, fa);
 	if (code)
 		goto error_trans_cancel;
@@ -1346,6 +1403,12 @@ xfs_ioctl_setattr(
 		ip->i_d.di_extsize = fa->fsx_extsize >> mp->m_sb.sb_blocklog;
 	else
 		ip->i_d.di_extsize = 0;
+	if (ip->i_d.di_version == 3 &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
+		ip->i_d.di_cowextsize = fa->fsx_cowextsize >>
+				mp->m_sb.sb_blocklog;
+	else
+		ip->i_d.di_cowextsize = 0;
 
 	code = xfs_trans_commit(tp);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 765849e..d907eb9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -71,7 +71,7 @@ xfs_bmbt_to_iomap(
 	iomap->bdev = xfs_find_bdev_for_inode(VFS_I(ip));
 }
 
-static xfs_extlen_t
+xfs_extlen_t
 xfs_eof_alignment(
 	struct xfs_inode	*ip,
 	xfs_extlen_t		extsize)
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index a16b956..6d45cf0 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -31,6 +31,7 @@ int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
 
 void xfs_bmbt_to_iomap(struct xfs_inode *, struct iomap *,
 		struct xfs_bmbt_irec *);
+xfs_extlen_t xfs_eof_alignment(struct xfs_inode *ip, xfs_extlen_t extsize);
 
 extern struct iomap_ops xfs_iomap_ops;
 extern struct iomap_ops xfs_xattr_iomap_ops;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index ce73eb3..66e8817 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -66,7 +66,7 @@ xfs_bulkstat_one_int(
 	if (!buffer || xfs_internal_inum(mp, ino))
 		return -EINVAL;
 
-	buf = kmem_alloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
+	buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
 	if (!buf)
 		return -ENOMEM;
 
@@ -111,6 +111,12 @@ xfs_bulkstat_one_int(
 	buf->bs_aextents = dic->di_anextents;
 	buf->bs_forkoff = XFS_IFORK_BOFF(ip);
 
+	if (dic->di_version == 3) {
+		if (dic->di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
+			buf->bs_cowextsize = dic->di_cowextsize <<
+					mp->m_sb.sb_blocklog;
+	}
+
 	switch (dic->di_format) {
 	case XFS_DINODE_FMT_DEV:
 		buf->bs_rdev = ip->i_df.if_u2.if_rdev;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 065e836..75d49df 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -255,6 +255,7 @@ __xfs_reflink_reserve_cow(
 	int			nimaps, eof = 0, error = 0;
 	bool			shared = false, trimmed = false;
 	xfs_extnum_t		idx;
+	xfs_extlen_t		align;
 
 	/* Already reserved?  Skip the refcount btree access. */
 	xfs_bmap_search_extents(ip, *offset_fsb, XFS_COW_FORK, &eof, &idx,
@@ -294,6 +295,10 @@ __xfs_reflink_reserve_cow(
 	if (error)
 		goto out_unlock;
 
+	align = xfs_eof_alignment(ip, xfs_get_cowextsz_hint(ip));
+	if (align)
+		end_fsb = roundup_64(end_fsb, align);
+
 retry:
 	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, *offset_fsb,
 			end_fsb - *offset_fsb, &got,
@@ -1057,18 +1062,19 @@ xfs_reflink_set_inode_flag(
 }
 
 /*
- * Update destination inode size, if necessary.
+ * Update destination inode size & cowextsize hint, if necessary.
  */
 STATIC int
 xfs_reflink_update_dest(
 	struct xfs_inode	*dest,
-	xfs_off_t		newlen)
+	xfs_off_t		newlen,
+	xfs_extlen_t		cowextsize)
 {
 	struct xfs_mount	*mp = dest->i_mount;
 	struct xfs_trans	*tp;
 	int			error;
 
-	if (newlen <= i_size_read(VFS_I(dest)))
+	if (newlen <= i_size_read(VFS_I(dest)) && cowextsize == 0)
 		return 0;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
@@ -1078,9 +1084,17 @@ xfs_reflink_update_dest(
 	xfs_ilock(dest, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, dest, XFS_ILOCK_EXCL);
 
-	trace_xfs_reflink_update_inode_size(dest, newlen);
-	i_size_write(VFS_I(dest), newlen);
-	dest->i_d.di_size = newlen;
+	if (newlen > i_size_read(VFS_I(dest))) {
+		trace_xfs_reflink_update_inode_size(dest, newlen);
+		i_size_write(VFS_I(dest), newlen);
+		dest->i_d.di_size = newlen;
+	}
+
+	if (cowextsize) {
+		dest->i_d.di_cowextsize = cowextsize;
+		dest->i_d.di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	}
+
 	xfs_trans_log_inode(tp, dest, XFS_ILOG_CORE);
 
 	error = xfs_trans_commit(tp);
@@ -1397,6 +1411,7 @@ xfs_reflink_remap_range(
 	xfs_fileoff_t		sfsbno, dfsbno;
 	xfs_filblks_t		fsblen;
 	int			error;
+	xfs_extlen_t		cowextsize;
 	bool			is_same;
 
 	if (!xfs_sb_version_hasreflink(&mp->m_sb))
@@ -1457,7 +1472,19 @@ xfs_reflink_remap_range(
 	if (error)
 		goto out_error;
 
-	error = xfs_reflink_update_dest(dest, destoff + len);
+	/*
+	 * Carry the cowextsize hint from src to dest if we're sharing the
+	 * entire source file to the entire destination file, the source file
+	 * has a cowextsize hint, and the destination file does not.
+	 */
+	cowextsize = 0;
+	if (srcoff == 0 && len == i_size_read(VFS_I(src)) &&
+	    (src->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE) &&
+	    destoff == 0 && len >= i_size_read(VFS_I(dest)) &&
+	    !(dest->i_d.di_flags2 & XFS_DIFLAG2_COWEXTSIZE))
+		cowextsize = src->i_d.di_cowextsize;
+
+	error = xfs_reflink_update_dest(dest, destoff + len, cowextsize);
 	if (error)
 		goto out_error;
 


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (46 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 47/63] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  8:19   ` Christoph Hellwig
  2016-10-12 18:44   ` Brian Foster
  2016-09-30  3:10 ` [PATCH 49/63] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
                   ` (15 subsequent siblings)
  63 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel, Christoph Hellwig

To gracefully handle the situation where a CoW operation turns a
single refcount extent into a lot of tiny ones and then run out of
space when a tree split has to happen, use the per-AG reserved block
pool to pre-allocate all the space we'll ever need for a maximal
btree.  For a 4K block size, this only costs an overhead of 0.3% of
available disk space.

When reflink is enabled, we have an unfortunate problem with rmap --
since we can share a block billions of times, this means that the
reverse mapping btree can expand basically infinitely.  When an AG is
so full that there are no free blocks with which to expand the rmapbt,
the filesystem will shut down hard.

This is rather annoying to the user, so use the AG reservation code to
reserve a "reasonable" amount of space for rmap.  We'll prevent
reflinks and CoW operations if we think we're getting close to
exhausting an AG's free space rather than shutting down, but this
permanent reservation should be enough for "most" users.  Hopefully.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch@lst.de: ensure that we invalidate the freed btree buffer]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
v2: Simplify the return value from xfs_perag_pool_free_block to a bool
so that we can easily call xfs_trans_binval for both the per-AG pool
and the real freeing case.  Without this we fail to invalidate the
btree buffer and will trip over the write verifier on a shrinking
refcount btree.

v3: Convert to the new per-AG reservation code.

v4: Combine this patch with the one that adds the rmapbt reservation,
since the rmapbt reservation is only needed for reflink filesystems.

v5: If we detect errors while counting the refcount or rmap btrees,
shut down the filesystem to avoid the scenario where the fs shuts down
mid-transaction due to btree corruption, repair refuses to run until
the log is clean, and the log cannot be cleaned because replay hits
btree corruption and shuts down.
---
 fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
 fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
 fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
 fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsops.h                 |    3 ++
 fs/xfs/xfs_mount.c                 |    8 +++++
 fs/xfs/xfs_super.c                 |   12 +++++++
 9 files changed, 210 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index e3ae0f2..adf770f 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -38,6 +38,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_btree.h"
+#include "xfs_refcount_btree.h"
 
 /*
  * Per-AG Block Reservations
@@ -228,6 +229,11 @@ xfs_ag_resv_init(
 	if (pag->pag_meta_resv.ar_asked == 0) {
 		ask = used = 0;
 
+		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
+				pag->pag_agno, &ask, &used);
+		if (error)
+			goto out;
+
 		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
 				ask, used);
 		if (error)
@@ -238,6 +244,11 @@ xfs_ag_resv_init(
 	if (pag->pag_agfl_resv.ar_asked == 0) {
 		ask = used = 0;
 
+		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
+				&ask, &used);
+		if (error)
+			goto out;
+
 		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
 		if (error)
 			goto out;
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 6b5e82b9..453bb27 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
 	struct xfs_alloc_arg	args;		/* block allocation args */
 	int			error;		/* error return value */
 
+	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
+
 	memset(&args, 0, sizeof(args));
 	args.tp = cur->bc_tp;
 	args.mp = cur->bc_mp;
@@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
 	args.firstblock = args.fsbno;
 	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
 	args.minlen = args.maxlen = args.prod = 1;
+	args.resv = XFS_AG_RESV_METADATA;
 
 	error = xfs_alloc_vextent(&args);
 	if (error)
@@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
 	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
 	struct xfs_owner_info	oinfo;
+	int			error;
 
 	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
 			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
-	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
-			&oinfo);
+	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
+			XFS_AG_RESV_METADATA);
+	if (error)
+		return error;
 
-	return 0;
+	return error;
 }
 
 STATIC int
@@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
 
 	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
 }
+
+/*
+ * Figure out how many blocks to reserve and how many are used by this btree.
+ */
+int
+xfs_refcountbt_calc_reserves(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*ask,
+	xfs_extlen_t		*used)
+{
+	struct xfs_buf		*agbp;
+	struct xfs_agf		*agf;
+	xfs_extlen_t		tree_len;
+	int			error;
+
+	if (!xfs_sb_version_hasreflink(&mp->m_sb))
+		return 0;
+
+	*ask += xfs_refcountbt_max_size(mp);
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	agf = XFS_BUF_TO_AGF(agbp);
+	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
+	xfs_buf_relse(agbp);
+
+	*used += tree_len;
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index 780b02f..3be7768 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
 		unsigned long long len);
 extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
 
+extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
+		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 9c0585e..83e672f 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -35,6 +35,7 @@
 #include "xfs_cksum.h"
 #include "xfs_error.h"
 #include "xfs_extent_busy.h"
+#include "xfs_ag_resv.h"
 
 /*
  * Reverse map btree.
@@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
 		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
 				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
 }
+
+/* Calculate the refcount btree size for some records. */
+xfs_extlen_t
+xfs_rmapbt_calc_size(
+	struct xfs_mount	*mp,
+	unsigned long long	len)
+{
+	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
+}
+
+/*
+ * Calculate the maximum refcount btree size.
+ */
+xfs_extlen_t
+xfs_rmapbt_max_size(
+	struct xfs_mount	*mp)
+{
+	/* Bail out if we're uninitialized, which can happen in mkfs. */
+	if (mp->m_rmap_mxr[0] == 0)
+		return 0;
+
+	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
+}
+
+/*
+ * Figure out how many blocks to reserve and how many are used by this btree.
+ */
+int
+xfs_rmapbt_calc_reserves(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*ask,
+	xfs_extlen_t		*used)
+{
+	struct xfs_buf		*agbp;
+	struct xfs_agf		*agf;
+	xfs_extlen_t		pool_len;
+	xfs_extlen_t		tree_len;
+	int			error;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return 0;
+
+	/* Reserve 1% of the AG or enough for 1 block per record. */
+	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
+	*ask += pool_len;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	agf = XFS_BUF_TO_AGF(agbp);
+	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
+	xfs_buf_relse(agbp);
+
+	*used += tree_len;
+
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index e73a553..2a9ac47 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
 int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
 extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
 
+extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
+		unsigned long long len);
+extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
+
+extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
+		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
+
 #endif	/* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 3acbf4e0..93d12fa 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -43,6 +43,7 @@
 #include "xfs_log.h"
 #include "xfs_filestream.h"
 #include "xfs_rmap.h"
+#include "xfs_ag_resv.h"
 
 /*
  * File system operations
@@ -630,6 +631,11 @@ xfs_growfs_data_private(
 	xfs_set_low_space_thresholds(mp);
 	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
+	/* Reserve AG metadata blocks. */
+	error = xfs_fs_reserve_ag_blocks(mp);
+	if (error && error != -ENOSPC)
+		goto out;
+
 	/* update secondary superblocks. */
 	for (agno = 1; agno < nagcount; agno++) {
 		error = 0;
@@ -680,6 +686,8 @@ xfs_growfs_data_private(
 			continue;
 		}
 	}
+
+ out:
 	return saved_error ? saved_error : error;
 
  error0:
@@ -989,3 +997,59 @@ xfs_do_force_shutdown(
 	"Please umount the filesystem and rectify the problem(s)");
 	}
 }
+
+/*
+ * Reserve free space for per-AG metadata.
+ */
+int
+xfs_fs_reserve_ag_blocks(
+	struct xfs_mount	*mp)
+{
+	xfs_agnumber_t		agno;
+	struct xfs_perag	*pag;
+	int			error = 0;
+	int			err2;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		pag = xfs_perag_get(mp, agno);
+		err2 = xfs_ag_resv_init(pag);
+		xfs_perag_put(pag);
+		if (err2 && !error)
+			error = err2;
+	}
+
+	if (error && error != -ENOSPC) {
+		xfs_warn(mp,
+	"Error %d reserving per-AG metadata reserve pool.", error);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	}
+
+	return error;
+}
+
+/*
+ * Free space reserved for per-AG metadata.
+ */
+int
+xfs_fs_unreserve_ag_blocks(
+	struct xfs_mount	*mp)
+{
+	xfs_agnumber_t		agno;
+	struct xfs_perag	*pag;
+	int			error = 0;
+	int			err2;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		pag = xfs_perag_get(mp, agno);
+		err2 = xfs_ag_resv_free(pag);
+		xfs_perag_put(pag);
+		if (err2 && !error)
+			error = err2;
+	}
+
+	if (error)
+		xfs_warn(mp,
+	"Error %d freeing per-AG metadata reserve pool.", error);
+
+	return error;
+}
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index f32713f..f349158 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
 				xfs_fsop_resblks_t *outval);
 extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
 
+extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
+extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
+
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index caecbd2..b5da81d 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -986,10 +986,17 @@ xfs_mountfs(
 			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 			goto out_quota;
 		}
+
+		/* Reserve AG blocks for future btree expansion. */
+		error = xfs_fs_reserve_ag_blocks(mp);
+		if (error && error != -ENOSPC)
+			goto out_agresv;
 	}
 
 	return 0;
 
+ out_agresv:
+	xfs_fs_unreserve_ag_blocks(mp);
  out_quota:
 	xfs_qm_unmount_quotas(mp);
  out_rtunmount:
@@ -1034,6 +1041,7 @@ xfs_unmountfs(
 
 	cancel_delayed_work_sync(&mp->m_eofblocks_work);
 
+	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
 	xfs_rtunmount_inodes(mp);
 	IRELE(mp->m_rootip);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e6aaa91..875ab9f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1315,10 +1315,22 @@ xfs_fs_remount(
 			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 			return error;
 		}
+
+		/* Create the per-AG metadata reservation pool .*/
+		error = xfs_fs_reserve_ag_blocks(mp);
+		if (error && error != -ENOSPC)
+			return error;
 	}
 
 	/* rw -> ro */
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
+		/* Free the per-AG metadata reservation pool. */
+		error = xfs_fs_unreserve_ag_blocks(mp);
+		if (error) {
+			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+			return error;
+		}
+
 		/*
 		 * Before we sync the metadata, we need to free up the reserve
 		 * block pool so that the used block count in the superblock on


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 49/63] xfs: don't allow reflink when the AG is low on space
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (47 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
@ 2016-09-30  3:10 ` Darrick J. Wong
  2016-09-30  8:19   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 50/63] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
                   ` (14 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:10 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

If the AG free space is down to the reserves, refuse to reflink our
way out of space.  Hopefully userspace will make a real copy and/or go
elsewhere.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 75d49df..a0c5ada 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -54,6 +54,8 @@
 #include "xfs_reflink.h"
 #include "xfs_iomap.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_sb.h"
+#include "xfs_ag_resv.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -1108,6 +1110,31 @@ xfs_reflink_update_dest(
 }
 
 /*
+ * Do we have enough reserve in this AG to handle a reflink?  The refcount
+ * btree already reserved all the space it needs, but the rmap btree can grow
+ * infinitely, so we won't allow more reflinks when the AG is down to the
+ * btree reserves.
+ */
+static int
+xfs_reflink_ag_has_free_space(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_perag	*pag;
+	int			error = 0;
+
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return 0;
+
+	pag = xfs_perag_get(mp, agno);
+	if (xfs_ag_resv_critical(pag, XFS_AG_RESV_AGFL) ||
+	    xfs_ag_resv_critical(pag, XFS_AG_RESV_METADATA))
+		error = -ENOSPC;
+	xfs_perag_put(pag);
+	return error;
+}
+
+/*
  * Unmap a range of blocks from a file, then map other blocks into the hole.
  * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
  * The extent irec is mapped into dest at irec->br_startoff.
@@ -1139,6 +1166,14 @@ xfs_reflink_remap_extent(
 			irec->br_startblock != DELAYSTARTBLOCK &&
 			!ISUNWRITTEN(irec));
 
+	/* No reflinking if we're low on space */
+	if (real_extent) {
+		error = xfs_reflink_ag_has_free_space(mp,
+				XFS_FSB_TO_AGNO(mp, irec->br_startblock));
+		if (error)
+			goto out;
+	}
+
 	/* Start a rolling transaction to switch the mappings */
 	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 50/63] xfs: try other AGs to allocate a BMBT block
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (48 preceding siblings ...)
  2016-09-30  3:10 ` [PATCH 49/63] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:20   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 51/63] xfs: garbage collect old cowextsz reservations Darrick J. Wong
                   ` (13 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Prior to the introduction of reflink, allocating a block and mapping
it into a file was performed in a single transaction with a single
block reservation, and the allocator was supposed to find enough
blocks to allocate the extent and any BMBT blocks that might be
necessary (unless we're low on space).

However, due to the way copy on write works, allocation and mapping
have been split into two transactions, which means that we must be
able to handle the case where we allocate an extent for CoW but that
AG runs out of free space before the blocks can be mapped into a file,
and the mapping requires a new BMBT block.  When this happens, look in
one of the other AGs for a BMBT block instead of taking the FS down.

The same applies to the functions that convert a data fork to extents
and later btree format.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       |   30 ++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_bmap_btree.c |   17 +++++++++++++++++
 2 files changed, 47 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 69a6ae6..d87abc2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -753,6 +753,7 @@ xfs_bmap_extents_to_btree(
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino);
 	} else if (dfops->dop_low) {
+try_another_ag:
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		args.fsbno = *firstblock;
 	} else {
@@ -767,6 +768,21 @@ xfs_bmap_extents_to_btree(
 		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
 		return error;
 	}
+
+	/*
+	 * During a CoW operation, the allocation and bmbt updates occur in
+	 * different transactions.  The mapping code tries to put new bmbt
+	 * blocks near extents being mapped, but the only way to guarantee this
+	 * is if the alloc and the mapping happen in a single transaction that
+	 * has a block reservation.  That isn't the case here, so if we run out
+	 * of space we'll try again with another AG.
+	 */
+	if (xfs_sb_version_hasreflink(&cur->bc_mp->m_sb) &&
+	    args.fsbno == NULLFSBLOCK &&
+	    args.type == XFS_ALLOCTYPE_NEAR_BNO) {
+		dfops->dop_low = true;
+		goto try_another_ag;
+	}
 	/*
 	 * Allocation can't fail, the space was reserved.
 	 */
@@ -902,6 +918,7 @@ xfs_bmap_local_to_extents(
 	 * file currently fits in an inode.
 	 */
 	if (*firstblock == NULLFSBLOCK) {
+try_another_ag:
 		args.fsbno = XFS_INO_TO_FSB(args.mp, ip->i_ino);
 		args.type = XFS_ALLOCTYPE_START_BNO;
 	} else {
@@ -914,6 +931,19 @@ xfs_bmap_local_to_extents(
 	if (error)
 		goto done;
 
+	/*
+	 * During a CoW operation, the allocation and bmbt updates occur in
+	 * different transactions.  The mapping code tries to put new bmbt
+	 * blocks near extents being mapped, but the only way to guarantee this
+	 * is if the alloc and the mapping happen in a single transaction that
+	 * has a block reservation.  That isn't the case here, so if we run out
+	 * of space we'll try again with another AG.
+	 */
+	if (xfs_sb_version_hasreflink(&ip->i_mount->m_sb) &&
+	    args.fsbno == NULLFSBLOCK &&
+	    args.type == XFS_ALLOCTYPE_NEAR_BNO) {
+		goto try_another_ag;
+	}
 	/* Can't fail, the space was reserved. */
 	ASSERT(args.fsbno != NULLFSBLOCK);
 	ASSERT(args.len == 1);
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 37f0d9d..8007d2b 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -453,6 +453,7 @@ xfs_bmbt_alloc_block(
 
 	if (args.fsbno == NULLFSBLOCK) {
 		args.fsbno = be64_to_cpu(start->l);
+try_another_ag:
 		args.type = XFS_ALLOCTYPE_START_BNO;
 		/*
 		 * Make sure there is sufficient room left in the AG to
@@ -482,6 +483,22 @@ xfs_bmbt_alloc_block(
 	if (error)
 		goto error0;
 
+	/*
+	 * During a CoW operation, the allocation and bmbt updates occur in
+	 * different transactions.  The mapping code tries to put new bmbt
+	 * blocks near extents being mapped, but the only way to guarantee this
+	 * is if the alloc and the mapping happen in a single transaction that
+	 * has a block reservation.  That isn't the case here, so if we run out
+	 * of space we'll try again with another AG.
+	 */
+	if (xfs_sb_version_hasreflink(&cur->bc_mp->m_sb) &&
+	    args.fsbno == NULLFSBLOCK &&
+	    args.type == XFS_ALLOCTYPE_NEAR_BNO) {
+		cur->bc_private.b.dfops->dop_low = true;
+		args.fsbno = cur->bc_private.b.firstblock;
+		goto try_another_ag;
+	}
+
 	if (args.fsbno == NULLFSBLOCK && args.minleft) {
 		/*
 		 * Could not find an AG with enough free space to satisfy


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 51/63] xfs: garbage collect old cowextsz reservations
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (49 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 50/63] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:23   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 52/63] xfs: increase log reservations for reflink Darrick J. Wong
                   ` (12 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Trim CoW reservations made on behalf of a cowextsz hint if they get too
old or we run low on quota, so long as we don't have dirty data awaiting
writeback or directio operations in progress.

Garbage collection of the cowextsize extents are kept separate from
prealloc extent reaping because setting the CoW prealloc lifetime to a
(much) higher value than the regular prealloc extent lifetime has been
useful for combatting CoW fragmentation on VM hosts where the VMs
experience bursty write behaviors and we can keep the utilization ratios
low enough that we don't start to run out of space.  IOWs, it benefits
us to keep the CoW fork reservations around for as long as we can unless
we run out of blocks or hit inode reclaim.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |    2 
 fs/xfs/xfs_file.c      |    3 +
 fs/xfs/xfs_globals.c   |    5 +
 fs/xfs/xfs_icache.c    |  238 ++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_icache.h    |    7 +
 fs/xfs/xfs_inode.c     |    4 +
 fs/xfs/xfs_linux.h     |    1 
 fs/xfs/xfs_mount.c     |    1 
 fs/xfs/xfs_mount.h     |    2 
 fs/xfs/xfs_reflink.c   |   38 ++++++++
 fs/xfs/xfs_reflink.h   |    2 
 fs/xfs/xfs_super.c     |    1 
 fs/xfs/xfs_sysctl.c    |    9 ++
 fs/xfs/xfs_sysctl.h    |    1 
 fs/xfs/xfs_trace.h     |    5 +
 15 files changed, 287 insertions(+), 32 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index d91f406..fd4b6bb 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1883,6 +1883,8 @@ xfs_swap_extents(
 		cowfp = ip->i_cowfp;
 		ip->i_cowfp = tip->i_cowfp;
 		tip->i_cowfp = cowfp;
+		xfs_inode_set_cowblocks_tag(ip);
+		xfs_inode_set_cowblocks_tag(tip);
 	}
 
 	xfs_trans_log_inode(tp, ip,  src_log_flags);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f3e5cb1..b8d3a8c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -781,6 +781,9 @@ xfs_file_buffered_aio_write(
 		enospc = xfs_inode_free_quota_eofblocks(ip);
 		if (enospc)
 			goto write_retry;
+		enospc = xfs_inode_free_quota_cowblocks(ip);
+		if (enospc)
+			goto write_retry;
 	} else if (ret == -ENOSPC && !enospc) {
 		struct xfs_eofblocks eofb = {0};
 
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index 4d41b24..687a4b0 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -21,8 +21,8 @@
 /*
  * Tunable XFS parameters.  xfs_params is required even when CONFIG_SYSCTL=n,
  * other XFS code uses these values.  Times are measured in centisecs (i.e.
- * 100ths of a second) with the exception of eofb_timer, which is measured in
- * seconds.
+ * 100ths of a second) with the exception of eofb_timer and cowb_timer, which
+ * are measured in seconds.
  */
 xfs_param_t xfs_params = {
 			  /*	MIN		DFLT		MAX	*/
@@ -42,6 +42,7 @@ xfs_param_t xfs_params = {
 	.inherit_nodfrg	= {	0,		1,		1	},
 	.fstrm_timer	= {	1,		30*100,		3600*100},
 	.eofb_timer	= {	1,		300,		3600*24},
+	.cowb_timer	= {	1,		1800,		3600*24},
 };
 
 struct xfs_globals xfs_globals = {
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2d3de02..14796b7 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -33,6 +33,7 @@
 #include "xfs_bmap_util.h"
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
+#include "xfs_reflink.h"
 
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -792,6 +793,33 @@ xfs_eofblocks_worker(
 	xfs_queue_eofblocks(mp);
 }
 
+/*
+ * Background scanning to trim preallocated CoW space. This is queued
+ * based on the 'speculative_cow_prealloc_lifetime' tunable (5m by default).
+ * (We'll just piggyback on the post-EOF prealloc space workqueue.)
+ */
+STATIC void
+xfs_queue_cowblocks(
+	struct xfs_mount *mp)
+{
+	rcu_read_lock();
+	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_COWBLOCKS_TAG))
+		queue_delayed_work(mp->m_eofblocks_workqueue,
+				   &mp->m_cowblocks_work,
+				   msecs_to_jiffies(xfs_cowb_secs * 1000));
+	rcu_read_unlock();
+}
+
+void
+xfs_cowblocks_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+				struct xfs_mount, m_cowblocks_work);
+	xfs_icache_free_cowblocks(mp, NULL);
+	xfs_queue_cowblocks(mp);
+}
+
 int
 xfs_inode_ag_iterator(
 	struct xfs_mount	*mp,
@@ -1348,18 +1376,30 @@ xfs_inode_free_eofblocks(
 	return ret;
 }
 
-int
-xfs_icache_free_eofblocks(
+static int
+__xfs_icache_free_eofblocks(
 	struct xfs_mount	*mp,
-	struct xfs_eofblocks	*eofb)
+	struct xfs_eofblocks	*eofb,
+	int			(*execute)(struct xfs_inode *ip, int flags,
+					   void *args),
+	int			tag)
 {
 	int flags = SYNC_TRYLOCK;
 
 	if (eofb && (eofb->eof_flags & XFS_EOF_FLAGS_SYNC))
 		flags = SYNC_WAIT;
 
-	return xfs_inode_ag_iterator_tag(mp, xfs_inode_free_eofblocks, flags,
-					 eofb, XFS_ICI_EOFBLOCKS_TAG);
+	return xfs_inode_ag_iterator_tag(mp, execute, flags,
+					 eofb, tag);
+}
+
+int
+xfs_icache_free_eofblocks(
+	struct xfs_mount	*mp,
+	struct xfs_eofblocks	*eofb)
+{
+	return __xfs_icache_free_eofblocks(mp, eofb, xfs_inode_free_eofblocks,
+			XFS_ICI_EOFBLOCKS_TAG);
 }
 
 /*
@@ -1368,9 +1408,11 @@ xfs_icache_free_eofblocks(
  * failure. We make a best effort by including each quota under low free space
  * conditions (less than 1% free space) in the scan.
  */
-int
-xfs_inode_free_quota_eofblocks(
-	struct xfs_inode *ip)
+static int
+__xfs_inode_free_quota_eofblocks(
+	struct xfs_inode	*ip,
+	int			(*execute)(struct xfs_mount *mp,
+					   struct xfs_eofblocks	*eofb))
 {
 	int scan = 0;
 	struct xfs_eofblocks eofb = {0};
@@ -1406,14 +1448,25 @@ xfs_inode_free_quota_eofblocks(
 	}
 
 	if (scan)
-		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
+		execute(ip->i_mount, &eofb);
 
 	return scan;
 }
 
-void
-xfs_inode_set_eofblocks_tag(
-	xfs_inode_t	*ip)
+int
+xfs_inode_free_quota_eofblocks(
+	struct xfs_inode *ip)
+{
+	return __xfs_inode_free_quota_eofblocks(ip, xfs_icache_free_eofblocks);
+}
+
+static void
+__xfs_inode_set_eofblocks_tag(
+	xfs_inode_t	*ip,
+	void		(*execute)(struct xfs_mount *mp),
+	void		(*set_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
+				  int error, unsigned long caller_ip),
+	int		tag)
 {
 	struct xfs_mount *mp = ip->i_mount;
 	struct xfs_perag *pag;
@@ -1431,26 +1484,22 @@ xfs_inode_set_eofblocks_tag(
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
-	trace_xfs_inode_set_eofblocks_tag(ip);
 
-	tagged = radix_tree_tagged(&pag->pag_ici_root,
-				   XFS_ICI_EOFBLOCKS_TAG);
+	tagged = radix_tree_tagged(&pag->pag_ici_root, tag);
 	radix_tree_tag_set(&pag->pag_ici_root,
-			   XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
-			   XFS_ICI_EOFBLOCKS_TAG);
+			   XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
 	if (!tagged) {
 		/* propagate the eofblocks tag up into the perag radix tree */
 		spin_lock(&ip->i_mount->m_perag_lock);
 		radix_tree_tag_set(&ip->i_mount->m_perag_tree,
 				   XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
-				   XFS_ICI_EOFBLOCKS_TAG);
+				   tag);
 		spin_unlock(&ip->i_mount->m_perag_lock);
 
 		/* kick off background trimming */
-		xfs_queue_eofblocks(ip->i_mount);
+		execute(ip->i_mount);
 
-		trace_xfs_perag_set_eofblocks(ip->i_mount, pag->pag_agno,
-					      -1, _RET_IP_);
+		set_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
 	}
 
 	spin_unlock(&pag->pag_ici_lock);
@@ -1458,9 +1507,22 @@ xfs_inode_set_eofblocks_tag(
 }
 
 void
-xfs_inode_clear_eofblocks_tag(
+xfs_inode_set_eofblocks_tag(
 	xfs_inode_t	*ip)
 {
+	trace_xfs_inode_set_eofblocks_tag(ip);
+	return __xfs_inode_set_eofblocks_tag(ip, xfs_queue_eofblocks,
+			trace_xfs_perag_set_eofblocks,
+			XFS_ICI_EOFBLOCKS_TAG);
+}
+
+static void
+__xfs_inode_clear_eofblocks_tag(
+	xfs_inode_t	*ip,
+	void		(*clear_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
+				    int error, unsigned long caller_ip),
+	int		tag)
+{
 	struct xfs_mount *mp = ip->i_mount;
 	struct xfs_perag *pag;
 
@@ -1470,23 +1532,141 @@ xfs_inode_clear_eofblocks_tag(
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 	spin_lock(&pag->pag_ici_lock);
-	trace_xfs_inode_clear_eofblocks_tag(ip);
 
 	radix_tree_tag_clear(&pag->pag_ici_root,
-			     XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
-			     XFS_ICI_EOFBLOCKS_TAG);
-	if (!radix_tree_tagged(&pag->pag_ici_root, XFS_ICI_EOFBLOCKS_TAG)) {
+			     XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
+	if (!radix_tree_tagged(&pag->pag_ici_root, tag)) {
 		/* clear the eofblocks tag from the perag radix tree */
 		spin_lock(&ip->i_mount->m_perag_lock);
 		radix_tree_tag_clear(&ip->i_mount->m_perag_tree,
 				     XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
-				     XFS_ICI_EOFBLOCKS_TAG);
+				     tag);
 		spin_unlock(&ip->i_mount->m_perag_lock);
-		trace_xfs_perag_clear_eofblocks(ip->i_mount, pag->pag_agno,
-					       -1, _RET_IP_);
+		clear_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
 	}
 
 	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
 }
 
+void
+xfs_inode_clear_eofblocks_tag(
+	xfs_inode_t	*ip)
+{
+	trace_xfs_inode_clear_eofblocks_tag(ip);
+	return __xfs_inode_clear_eofblocks_tag(ip,
+			trace_xfs_perag_clear_eofblocks, XFS_ICI_EOFBLOCKS_TAG);
+}
+
+/*
+ * Automatic CoW Reservation Freeing
+ *
+ * These functions automatically garbage collect leftover CoW reservations
+ * that were made on behalf of a cowextsize hint when we start to run out
+ * of quota or when the reservations sit around for too long.  If the file
+ * has dirty pages or is undergoing writeback, its CoW reservations will
+ * be retained.
+ *
+ * The actual garbage collection piggybacks off the same code that runs
+ * the speculative EOF preallocation garbage collector.
+ */
+STATIC int
+xfs_inode_free_cowblocks(
+	struct xfs_inode	*ip,
+	int			flags,
+	void			*args)
+{
+	int ret;
+	struct xfs_eofblocks *eofb = args;
+	bool need_iolock = true;
+	int match;
+
+	ASSERT(!eofb || (eofb && eofb->eof_scan_owner != 0));
+
+	if (!xfs_reflink_has_real_cow_blocks(ip)) {
+		trace_xfs_inode_free_cowblocks_invalid(ip);
+		xfs_inode_clear_cowblocks_tag(ip);
+		return 0;
+	}
+
+	/*
+	 * If the mapping is dirty or under writeback we cannot touch the
+	 * CoW fork.  Leave it alone if we're in the midst of a directio.
+	 */
+	if (mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
+	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
+	    atomic_read(&VFS_I(ip)->i_dio_count))
+		return 0;
+
+	if (eofb) {
+		if (eofb->eof_flags & XFS_EOF_FLAGS_UNION)
+			match = xfs_inode_match_id_union(ip, eofb);
+		else
+			match = xfs_inode_match_id(ip, eofb);
+		if (!match)
+			return 0;
+
+		/* skip the inode if the file size is too small */
+		if (eofb->eof_flags & XFS_EOF_FLAGS_MINFILESIZE &&
+		    XFS_ISIZE(ip) < eofb->eof_min_file_size)
+			return 0;
+
+		/*
+		 * A scan owner implies we already hold the iolock. Skip it in
+		 * xfs_free_eofblocks() to avoid deadlock. This also eliminates
+		 * the possibility of EAGAIN being returned.
+		 */
+		if (eofb->eof_scan_owner == ip->i_ino)
+			need_iolock = false;
+	}
+
+	/* Free the CoW blocks */
+	if (need_iolock) {
+		xfs_ilock(ip, XFS_IOLOCK_EXCL);
+		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
+	}
+
+	ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
+
+	if (need_iolock) {
+		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	}
+
+	return ret;
+}
+
+int
+xfs_icache_free_cowblocks(
+	struct xfs_mount	*mp,
+	struct xfs_eofblocks	*eofb)
+{
+	return __xfs_icache_free_eofblocks(mp, eofb, xfs_inode_free_cowblocks,
+			XFS_ICI_COWBLOCKS_TAG);
+}
+
+int
+xfs_inode_free_quota_cowblocks(
+	struct xfs_inode *ip)
+{
+	return __xfs_inode_free_quota_eofblocks(ip, xfs_icache_free_cowblocks);
+}
+
+void
+xfs_inode_set_cowblocks_tag(
+	xfs_inode_t	*ip)
+{
+	trace_xfs_inode_set_eofblocks_tag(ip);
+	return __xfs_inode_set_eofblocks_tag(ip, xfs_queue_cowblocks,
+			trace_xfs_perag_set_eofblocks,
+			XFS_ICI_COWBLOCKS_TAG);
+}
+
+void
+xfs_inode_clear_cowblocks_tag(
+	xfs_inode_t	*ip)
+{
+	trace_xfs_inode_clear_eofblocks_tag(ip);
+	return __xfs_inode_clear_eofblocks_tag(ip,
+			trace_xfs_perag_clear_eofblocks, XFS_ICI_COWBLOCKS_TAG);
+}
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 05bac99..a1e02f4 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -40,6 +40,7 @@ struct xfs_eofblocks {
 					   in xfs_inode_ag_iterator */
 #define XFS_ICI_RECLAIM_TAG	0	/* inode is to be reclaimed */
 #define XFS_ICI_EOFBLOCKS_TAG	1	/* inode has blocks beyond EOF */
+#define XFS_ICI_COWBLOCKS_TAG	2	/* inode can have cow blocks to gc */
 
 /*
  * Flags for xfs_iget()
@@ -70,6 +71,12 @@ int xfs_inode_free_quota_eofblocks(struct xfs_inode *ip);
 void xfs_eofblocks_worker(struct work_struct *);
 void xfs_queue_eofblocks(struct xfs_mount *);
 
+void xfs_inode_set_cowblocks_tag(struct xfs_inode *ip);
+void xfs_inode_clear_cowblocks_tag(struct xfs_inode *ip);
+int xfs_icache_free_cowblocks(struct xfs_mount *, struct xfs_eofblocks *);
+int xfs_inode_free_quota_cowblocks(struct xfs_inode *ip);
+void xfs_cowblocks_worker(struct work_struct *);
+
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, int flags, void *args),
 	int flags, void *args);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a4b061e..07c300b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1629,8 +1629,10 @@ xfs_itruncate_extents(
 	/*
 	 * Clear the reflink flag if we truncated everything.
 	 */
-	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip))
+	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
 		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		xfs_inode_clear_cowblocks_tag(ip);
+	}
 
 	/*
 	 * Always re-log the inode so that our permanent transaction can keep
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index b8d64d5..68640fb 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -116,6 +116,7 @@ typedef __u32			xfs_nlink_t;
 #define xfs_inherit_nodefrag	xfs_params.inherit_nodfrg.val
 #define xfs_fstrm_centisecs	xfs_params.fstrm_timer.val
 #define xfs_eofb_secs		xfs_params.eofb_timer.val
+#define xfs_cowb_secs		xfs_params.cowb_timer.val
 
 #define current_cpu()		(raw_smp_processor_id())
 #define current_pid()		(current->pid)
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b5da81d..afbfae6 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1040,6 +1040,7 @@ xfs_unmountfs(
 	int			error;
 
 	cancel_delayed_work_sync(&mp->m_eofblocks_work);
+	cancel_delayed_work_sync(&mp->m_cowblocks_work);
 
 	xfs_fs_unreserve_ag_blocks(mp);
 	xfs_qm_unmount_quotas(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 0be14a7..819b80b 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -164,6 +164,8 @@ typedef struct xfs_mount {
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
 						     trimming */
+	struct delayed_work	m_cowblocks_work; /* background cow blocks
+						     trimming */
 	bool			m_update_sb;	/* sb needs update in mount */
 	int64_t			m_low_space[XFS_LOWSP_MAX];
 						/* low free space thresholds */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index a0c5ada..c7a3895 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -321,6 +321,9 @@ __xfs_reflink_reserve_cow(
 		goto out_unlock;
 	}
 
+	if (end_fsb != orig_end_fsb)
+		xfs_inode_set_cowblocks_tag(ip);
+
 	trace_xfs_reflink_cow_alloc(ip, &got);
 done:
 	*offset_fsb = end_fsb;
@@ -1702,6 +1705,7 @@ xfs_reflink_try_clear_inode_flag(
 	/* Clear the inode flag. */
 	trace_xfs_reflink_unset_inode_flag(ip);
 	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+	xfs_inode_clear_cowblocks_tag(ip);
 	xfs_trans_ijoin(tp, ip, 0);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 
@@ -1771,3 +1775,37 @@ xfs_reflink_unshare(
 	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
 	return error;
 }
+
+/*
+ * Does this inode have any real CoW reservations?
+ */
+bool
+xfs_reflink_has_real_cow_blocks(
+	struct xfs_inode		*ip)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_ifork		*ifp;
+	struct xfs_bmbt_rec_host	*gotp;
+	xfs_extnum_t			idx;
+
+	if (!xfs_is_reflink_inode(ip))
+		return false;
+
+	/* Go find the old extent in the CoW fork. */
+	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	gotp = xfs_iext_bno_to_ext(ifp, 0, &idx);
+	while (gotp) {
+		xfs_bmbt_get_all(gotp, &irec);
+
+		if (!isnullstartblock(irec.br_startblock))
+			return true;
+
+		/* Roll on... */
+		idx++;
+		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
+			break;
+		gotp = xfs_iext_get_ext(ifp, idx);
+	}
+
+	return false;
+}
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index ad4fc61..78760d6 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -51,4 +51,6 @@ extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
 extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t len);
 
+extern bool xfs_reflink_has_real_cow_blocks(struct xfs_inode *ip);
+
 #endif /* __XFS_REFLINK_H */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 875ab9f..2152ab6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1521,6 +1521,7 @@ xfs_fs_fill_super(
 	atomic_set(&mp->m_active_trans, 0);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
+	INIT_DELAYED_WORK(&mp->m_cowblocks_work, xfs_cowblocks_worker);
 	mp->m_kobj.kobject.kset = xfs_kset;
 
 	mp->m_super = sb;
diff --git a/fs/xfs/xfs_sysctl.c b/fs/xfs/xfs_sysctl.c
index aed74d3..afe1f66 100644
--- a/fs/xfs/xfs_sysctl.c
+++ b/fs/xfs/xfs_sysctl.c
@@ -184,6 +184,15 @@ static struct ctl_table xfs_table[] = {
 		.extra1		= &xfs_params.eofb_timer.min,
 		.extra2		= &xfs_params.eofb_timer.max,
 	},
+	{
+		.procname	= "speculative_cow_prealloc_lifetime",
+		.data		= &xfs_params.cowb_timer.val,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &xfs_params.cowb_timer.min,
+		.extra2		= &xfs_params.cowb_timer.max,
+	},
 	/* please keep this the last entry */
 #ifdef CONFIG_PROC_FS
 	{
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index ffef453..984a349 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -48,6 +48,7 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t inherit_nodfrg;/* Inherit the "nodefrag" inode flag. */
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
+	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
 } xfs_param_t;
 
 /*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ca0930b..d19d128 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -136,6 +136,8 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_set_reclaim);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_reclaim);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_eofblocks);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_eofblocks);
+DEFINE_PERAG_REF_EVENT(xfs_perag_set_cowblocks);
+DEFINE_PERAG_REF_EVENT(xfs_perag_clear_cowblocks);
 
 DECLARE_EVENT_CLASS(xfs_ag_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno),
@@ -687,6 +689,9 @@ DEFINE_INODE_EVENT(xfs_dquot_dqdetach);
 DEFINE_INODE_EVENT(xfs_inode_set_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
+DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag);
+DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag);
+DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid);
 
 DEFINE_INODE_EVENT(xfs_filemap_fault);
 DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 52/63] xfs: increase log reservations for reflink
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (50 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 51/63] xfs: garbage collect old cowextsz reservations Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:23   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 53/63] xfs: add shared rmap map/unmap/convert log item types Darrick J. Wong
                   ` (11 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Increase the log reservations to handle the increased rolling that
happens at the end of copy-on-write operations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |   16 +++++++++++++---
 fs/xfs/libxfs/xfs_trans_resv.h |    2 ++
 2 files changed, 15 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index a59838f..b456cca 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -812,11 +812,18 @@ xfs_trans_resv_calc(
 	 * require a permanent reservation on space.
 	 */
 	resp->tr_write.tr_logres = xfs_calc_write_reservation(mp);
-	resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
+	else
+		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
-	resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		resp->tr_itruncate.tr_logcount =
+				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
+	else
+		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
 	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
@@ -873,7 +880,10 @@ xfs_trans_resv_calc(
 	resp->tr_growrtalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_qm_dqalloc.tr_logres = xfs_calc_qm_dqalloc_reservation(mp);
-	resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
+	else
+		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	/*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 36a1511..b7e5357 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -87,6 +87,7 @@ struct xfs_trans_resv {
 #define	XFS_DEFAULT_LOG_COUNT		1
 #define	XFS_DEFAULT_PERM_LOG_COUNT	2
 #define	XFS_ITRUNCATE_LOG_COUNT		2
+#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
 #define XFS_INACTIVE_LOG_COUNT		2
 #define	XFS_CREATE_LOG_COUNT		2
 #define	XFS_CREATE_TMPFILE_LOG_COUNT	2
@@ -96,6 +97,7 @@ struct xfs_trans_resv {
 #define	XFS_LINK_LOG_COUNT		2
 #define	XFS_RENAME_LOG_COUNT		2
 #define	XFS_WRITE_LOG_COUNT		2
+#define	XFS_WRITE_LOG_COUNT_REFLINK	8
 #define	XFS_ADDAFORK_LOG_COUNT		2
 #define	XFS_ATTRINVAL_LOG_COUNT		1
 #define	XFS_ATTRSET_LOG_COUNT		3


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 53/63] xfs: add shared rmap map/unmap/convert log item types
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (51 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 52/63] xfs: increase log reservations for reflink Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:24   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 54/63] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
                   ` (10 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Wire up some rmap log redo item type codes to map, unmap, or convert
shared data block extents.  The actual log item recovery comes in a
later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_log_format.h |    3 +++
 fs/xfs/xfs_rmap_item.c         |    3 +++
 fs/xfs/xfs_trans_rmap.c        |    9 +++++++++
 3 files changed, 15 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 364ce6f..083cdd6 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -635,8 +635,11 @@ struct xfs_map_extent {
 
 /* rmap me_flags: upper bits are flags, lower byte is type code */
 #define XFS_RMAP_EXTENT_MAP		1
+#define XFS_RMAP_EXTENT_MAP_SHARED	2
 #define XFS_RMAP_EXTENT_UNMAP		3
+#define XFS_RMAP_EXTENT_UNMAP_SHARED	4
 #define XFS_RMAP_EXTENT_CONVERT		5
+#define XFS_RMAP_EXTENT_CONVERT_SHARED	6
 #define XFS_RMAP_EXTENT_ALLOC		7
 #define XFS_RMAP_EXTENT_FREE		8
 #define XFS_RMAP_EXTENT_TYPE_MASK	0xFF
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 0432a45..19d817e 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -441,8 +441,11 @@ xfs_rui_recover(
 				   XFS_FSB_TO_DADDR(mp, rmap->me_startblock));
 		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
 		case XFS_RMAP_EXTENT_MAP:
+		case XFS_RMAP_EXTENT_MAP_SHARED:
 		case XFS_RMAP_EXTENT_UNMAP:
+		case XFS_RMAP_EXTENT_UNMAP_SHARED:
 		case XFS_RMAP_EXTENT_CONVERT:
+		case XFS_RMAP_EXTENT_CONVERT_SHARED:
 		case XFS_RMAP_EXTENT_ALLOC:
 		case XFS_RMAP_EXTENT_FREE:
 			op_ok = true;
diff --git a/fs/xfs/xfs_trans_rmap.c b/fs/xfs/xfs_trans_rmap.c
index 5a50ef8..9ead064 100644
--- a/fs/xfs/xfs_trans_rmap.c
+++ b/fs/xfs/xfs_trans_rmap.c
@@ -48,12 +48,21 @@ xfs_trans_set_rmap_flags(
 	case XFS_RMAP_MAP:
 		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
 		break;
+	case XFS_RMAP_MAP_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
+		break;
 	case XFS_RMAP_UNMAP:
 		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
 		break;
+	case XFS_RMAP_UNMAP_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
+		break;
 	case XFS_RMAP_CONVERT:
 		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
 		break;
+	case XFS_RMAP_CONVERT_SHARED:
+		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
+		break;
 	case XFS_RMAP_ALLOC:
 		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
 		break;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 54/63] xfs: use interval query for rmap alloc operations on shared files
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (52 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 53/63] xfs: add shared rmap map/unmap/convert log item types Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:24   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 55/63] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
                   ` (9 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

When it's possible for reverse mappings to overlap (data fork extents
of files on reflink filesystems), use the interval query function to
find the left neighbor of an extent we're trying to add; and be
careful to use the lookup functions to update the neighbors and/or
add new extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: xfs_rmap_find_left_neighbor() needs to calculate the high key of a
query range correctly.  We can also add a few shortcuts -- there are
no left neighbors of a query at offset zero.
---
 fs/xfs/libxfs/xfs_rmap.c |  514 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap.h |    7 +
 fs/xfs/xfs_rmap_item.c   |    6 +
 fs/xfs/xfs_trace.h       |    5 
 4 files changed, 530 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 1c40b85..bb5e2f8 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -148,6 +148,37 @@ xfs_rmap_insert(
 	return error;
 }
 
+STATIC int
+xfs_rmap_delete(
+	struct xfs_btree_cur	*rcur,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags)
+{
+	int			i;
+	int			error;
+
+	trace_xfs_rmap_delete(rcur->bc_mp, rcur->bc_private.a.agno, agbno,
+			len, owner, offset, flags);
+
+	error = xfs_rmap_lookup_eq(rcur, agbno, len, owner, offset, flags, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
+
+	error = xfs_btree_delete(rcur, &i);
+	if (error)
+		goto done;
+	XFS_WANT_CORRUPTED_GOTO(rcur->bc_mp, i == 1, done);
+done:
+	if (error)
+		trace_xfs_rmap_delete_error(rcur->bc_mp,
+				rcur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
 static int
 xfs_rmap_btrec_to_irec(
 	union xfs_btree_rec	*rec,
@@ -180,6 +211,160 @@ xfs_rmap_get_rec(
 	return xfs_rmap_btrec_to_irec(rec, irec);
 }
 
+struct xfs_find_left_neighbor_info {
+	struct xfs_rmap_irec	high;
+	struct xfs_rmap_irec	*irec;
+	int			*stat;
+};
+
+/* For each rmap given, figure out if it matches the key we want. */
+STATIC int
+xfs_rmap_find_left_neighbor_helper(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xfs_find_left_neighbor_info	*info = priv;
+
+	trace_xfs_rmap_find_left_neighbor_candidate(cur->bc_mp,
+			cur->bc_private.a.agno, rec->rm_startblock,
+			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
+			rec->rm_flags);
+
+	if (rec->rm_owner != info->high.rm_owner)
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+	if (!XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) &&
+	    !(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
+	    rec->rm_offset + rec->rm_blockcount - 1 != info->high.rm_offset)
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+
+	*info->irec = *rec;
+	*info->stat = 1;
+	return XFS_BTREE_QUERY_RANGE_ABORT;
+}
+
+/*
+ * Find the record to the left of the given extent, being careful only to
+ * return a match with the same owner and adjacent physical and logical
+ * block ranges.
+ */
+int
+xfs_rmap_find_left_neighbor(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags,
+	struct xfs_rmap_irec	*irec,
+	int			*stat)
+{
+	struct xfs_find_left_neighbor_info	info;
+	int			error;
+
+	*stat = 0;
+	if (bno == 0)
+		return 0;
+	info.high.rm_startblock = bno - 1;
+	info.high.rm_owner = owner;
+	if (!XFS_RMAP_NON_INODE_OWNER(owner) &&
+	    !(flags & XFS_RMAP_BMBT_BLOCK)) {
+		if (offset == 0)
+			return 0;
+		info.high.rm_offset = offset - 1;
+	} else
+		info.high.rm_offset = 0;
+	info.high.rm_flags = flags;
+	info.high.rm_blockcount = 0;
+	info.irec = irec;
+	info.stat = stat;
+
+	trace_xfs_rmap_find_left_neighbor_query(cur->bc_mp,
+			cur->bc_private.a.agno, bno, 0, owner, offset, flags);
+
+	error = xfs_rmap_query_range(cur, &info.high, &info.high,
+			xfs_rmap_find_left_neighbor_helper, &info);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+	if (*stat)
+		trace_xfs_rmap_find_left_neighbor_result(cur->bc_mp,
+				cur->bc_private.a.agno, irec->rm_startblock,
+				irec->rm_blockcount, irec->rm_owner,
+				irec->rm_offset, irec->rm_flags);
+	return error;
+}
+
+/* For each rmap given, figure out if it matches the key we want. */
+STATIC int
+xfs_rmap_lookup_le_range_helper(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xfs_find_left_neighbor_info	*info = priv;
+
+	trace_xfs_rmap_lookup_le_range_candidate(cur->bc_mp,
+			cur->bc_private.a.agno, rec->rm_startblock,
+			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
+			rec->rm_flags);
+
+	if (rec->rm_owner != info->high.rm_owner)
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+	if (!XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) &&
+	    !(rec->rm_flags & XFS_RMAP_BMBT_BLOCK) &&
+	    (rec->rm_offset > info->high.rm_offset ||
+	     rec->rm_offset + rec->rm_blockcount <= info->high.rm_offset))
+		return XFS_BTREE_QUERY_RANGE_CONTINUE;
+
+	*info->irec = *rec;
+	*info->stat = 1;
+	return XFS_BTREE_QUERY_RANGE_ABORT;
+}
+
+/*
+ * Find the record to the left of the given extent, being careful only to
+ * return a match with the same owner and overlapping physical and logical
+ * block ranges.  This is the overlapping-interval version of
+ * xfs_rmap_lookup_le.
+ */
+int
+xfs_rmap_lookup_le_range(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	uint64_t		owner,
+	uint64_t		offset,
+	unsigned int		flags,
+	struct xfs_rmap_irec	*irec,
+	int			*stat)
+{
+	struct xfs_find_left_neighbor_info	info;
+	int			error;
+
+	info.high.rm_startblock = bno;
+	info.high.rm_owner = owner;
+	if (!XFS_RMAP_NON_INODE_OWNER(owner) && !(flags & XFS_RMAP_BMBT_BLOCK))
+		info.high.rm_offset = offset;
+	else
+		info.high.rm_offset = 0;
+	info.high.rm_flags = flags;
+	info.high.rm_blockcount = 0;
+	*stat = 0;
+	info.irec = irec;
+	info.stat = stat;
+
+	trace_xfs_rmap_lookup_le_range(cur->bc_mp,
+			cur->bc_private.a.agno, bno, 0, owner, offset, flags);
+	error = xfs_rmap_query_range(cur, &info.high, &info.high,
+			xfs_rmap_lookup_le_range_helper, &info);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+	if (*stat)
+		trace_xfs_rmap_lookup_le_range_result(cur->bc_mp,
+				cur->bc_private.a.agno, irec->rm_startblock,
+				irec->rm_blockcount, irec->rm_owner,
+				irec->rm_offset, irec->rm_flags);
+	return error;
+}
+
 /*
  * Find the extent in the rmap btree and remove it.
  *
@@ -1098,6 +1283,321 @@ xfs_rmap_convert(
 #undef	RIGHT
 #undef	PREV
 
+/*
+ * Find an extent in the rmap btree and unmap it.  For rmap extent types that
+ * can overlap (data fork rmaps on reflink filesystems) we must be careful
+ * that the prev/next records in the btree might belong to another owner.
+ * Therefore we must use delete+insert to alter any of the key fields.
+ *
+ * For every other situation there can only be one owner for a given extent,
+ * so we can call the regular _free function.
+ */
+STATIC int
+xfs_rmap_unmap_shared(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	ltrec;
+	uint64_t		ltoff;
+	int			error = 0;
+	int			i;
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	if (unwritten)
+		flags |= XFS_RMAP_UNWRITTEN;
+	trace_xfs_rmap_unmap(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/*
+	 * We should always have a left record because there's a static record
+	 * for the AG headers at rm_startblock == 0 created by mkfs/growfs that
+	 * will not ever be removed from the tree.
+	 */
+	error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags,
+			&ltrec, &i);
+	if (error)
+		goto out_error;
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+	ltoff = ltrec.rm_offset;
+
+	/* Make sure the extent we found covers the entire freeing range. */
+	XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_startblock <= bno &&
+		ltrec.rm_startblock + ltrec.rm_blockcount >=
+		bno + len, out_error);
+
+	/* Make sure the owner matches what we expect to find in the tree. */
+	XFS_WANT_CORRUPTED_GOTO(mp, owner == ltrec.rm_owner, out_error);
+
+	/* Make sure the unwritten flag matches. */
+	XFS_WANT_CORRUPTED_GOTO(mp, (flags & XFS_RMAP_UNWRITTEN) ==
+			(ltrec.rm_flags & XFS_RMAP_UNWRITTEN), out_error);
+
+	/* Check the offset. */
+	XFS_WANT_CORRUPTED_GOTO(mp, ltrec.rm_offset <= offset, out_error);
+	XFS_WANT_CORRUPTED_GOTO(mp, offset <= ltoff + ltrec.rm_blockcount,
+			out_error);
+
+	if (ltrec.rm_startblock == bno && ltrec.rm_blockcount == len) {
+		/* Exact match, simply remove the record from rmap tree. */
+		error = xfs_rmap_delete(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags);
+		if (error)
+			goto out_error;
+	} else if (ltrec.rm_startblock == bno) {
+		/*
+		 * Overlap left hand side of extent: move the start, trim the
+		 * length and update the current record.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing: |fffffffff|
+		 * Result:            |rrrrrrrrrr|
+		 *         bno       len
+		 */
+
+		/* Delete prev rmap. */
+		error = xfs_rmap_delete(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags);
+		if (error)
+			goto out_error;
+
+		/* Add an rmap at the new offset. */
+		ltrec.rm_startblock += len;
+		ltrec.rm_blockcount -= len;
+		ltrec.rm_offset += len;
+		error = xfs_rmap_insert(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags);
+		if (error)
+			goto out_error;
+	} else if (ltrec.rm_startblock + ltrec.rm_blockcount == bno + len) {
+		/*
+		 * Overlap right hand side of extent: trim the length and
+		 * update the current record.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing:            |fffffffff|
+		 * Result:  |rrrrrrrrrr|
+		 *                    bno       len
+		 */
+		error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+		ltrec.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+	} else {
+		/*
+		 * Overlap middle of extent: trim the length of the existing
+		 * record to the length of the new left-extent size, increment
+		 * the insertion position so we can insert a new record
+		 * containing the remaining right-extent space.
+		 *
+		 *       ltbno                ltlen
+		 * Orig:    |oooooooooooooooooooo|
+		 * Freeing:       |fffffffff|
+		 * Result:  |rrrrr|         |rrrr|
+		 *               bno       len
+		 */
+		xfs_extlen_t	orig_len = ltrec.rm_blockcount;
+
+		/* Shrink the left side of the rmap */
+		error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+		ltrec.rm_blockcount = bno - ltrec.rm_startblock;
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+
+		/* Add an rmap at the new offset */
+		error = xfs_rmap_insert(cur, bno + len,
+				orig_len - len - ltrec.rm_blockcount,
+				ltrec.rm_owner, offset + len,
+				ltrec.rm_flags);
+		if (error)
+			goto out_error;
+	}
+
+	trace_xfs_rmap_unmap_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+out_error:
+	if (error)
+		trace_xfs_rmap_unmap_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
+/*
+ * Find an extent in the rmap btree and map it.  For rmap extent types that
+ * can overlap (data fork rmaps on reflink filesystems) we must be careful
+ * that the prev/next records in the btree might belong to another owner.
+ * Therefore we must use delete+insert to alter any of the key fields.
+ *
+ * For every other situation there can only be one owner for a given extent,
+ * so we can call the regular _alloc function.
+ */
+STATIC int
+xfs_rmap_map_shared(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	ltrec;
+	struct xfs_rmap_irec	gtrec;
+	int			have_gt;
+	int			have_lt;
+	int			error = 0;
+	int			i;
+	uint64_t		owner;
+	uint64_t		offset;
+	unsigned int		flags = 0;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	if (unwritten)
+		flags |= XFS_RMAP_UNWRITTEN;
+	trace_xfs_rmap_map(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/* Is there a left record that abuts our range? */
+	error = xfs_rmap_find_left_neighbor(cur, bno, owner, offset, flags,
+			&ltrec, &have_lt);
+	if (error)
+		goto out_error;
+	if (have_lt &&
+	    !xfs_rmap_is_mergeable(&ltrec, owner, flags))
+		have_lt = 0;
+
+	/* Is there a right record that abuts our range? */
+	error = xfs_rmap_lookup_eq(cur, bno + len, len, owner, offset + len,
+			flags, &have_gt);
+	if (error)
+		goto out_error;
+	if (have_gt) {
+		error = xfs_rmap_get_rec(cur, &gtrec, &have_gt);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
+		trace_xfs_rmap_find_right_neighbor_result(cur->bc_mp,
+			cur->bc_private.a.agno, gtrec.rm_startblock,
+			gtrec.rm_blockcount, gtrec.rm_owner,
+			gtrec.rm_offset, gtrec.rm_flags);
+
+		if (!xfs_rmap_is_mergeable(&gtrec, owner, flags))
+			have_gt = 0;
+	}
+
+	if (have_lt &&
+	    ltrec.rm_startblock + ltrec.rm_blockcount == bno &&
+	    ltrec.rm_offset + ltrec.rm_blockcount == offset) {
+		/*
+		 * Left edge contiguous, merge into left record.
+		 *
+		 *       ltbno     ltlen
+		 * orig:   |ooooooooo|
+		 * adding:           |aaaaaaaaa|
+		 * result: |rrrrrrrrrrrrrrrrrrr|
+		 *                  bno       len
+		 */
+		ltrec.rm_blockcount += len;
+		if (have_gt &&
+		    bno + len == gtrec.rm_startblock &&
+		    offset + len == gtrec.rm_offset) {
+			/*
+			 * Right edge also contiguous, delete right record
+			 * and merge into left record.
+			 *
+			 *       ltbno     ltlen    gtbno     gtlen
+			 * orig:   |ooooooooo|         |ooooooooo|
+			 * adding:           |aaaaaaaaa|
+			 * result: |rrrrrrrrrrrrrrrrrrrrrrrrrrrrr|
+			 */
+			ltrec.rm_blockcount += gtrec.rm_blockcount;
+			error = xfs_rmap_delete(cur, gtrec.rm_startblock,
+					gtrec.rm_blockcount, gtrec.rm_owner,
+					gtrec.rm_offset, gtrec.rm_flags);
+			if (error)
+				goto out_error;
+		}
+
+		/* Point the cursor back to the left record and update. */
+		error = xfs_rmap_lookup_eq(cur, ltrec.rm_startblock,
+				ltrec.rm_blockcount, ltrec.rm_owner,
+				ltrec.rm_offset, ltrec.rm_flags, &i);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, out_error);
+
+		error = xfs_rmap_update(cur, &ltrec);
+		if (error)
+			goto out_error;
+	} else if (have_gt &&
+		   bno + len == gtrec.rm_startblock &&
+		   offset + len == gtrec.rm_offset) {
+		/*
+		 * Right edge contiguous, merge into right record.
+		 *
+		 *                 gtbno     gtlen
+		 * Orig:             |ooooooooo|
+		 * adding: |aaaaaaaaa|
+		 * Result: |rrrrrrrrrrrrrrrrrrr|
+		 *        bno       len
+		 */
+		/* Delete the old record. */
+		error = xfs_rmap_delete(cur, gtrec.rm_startblock,
+				gtrec.rm_blockcount, gtrec.rm_owner,
+				gtrec.rm_offset, gtrec.rm_flags);
+		if (error)
+			goto out_error;
+
+		/* Move the start and re-add it. */
+		gtrec.rm_startblock = bno;
+		gtrec.rm_blockcount += len;
+		gtrec.rm_offset = offset;
+		error = xfs_rmap_insert(cur, gtrec.rm_startblock,
+				gtrec.rm_blockcount, gtrec.rm_owner,
+				gtrec.rm_offset, gtrec.rm_flags);
+		if (error)
+			goto out_error;
+	} else {
+		/*
+		 * No contiguous edge with identical owner, insert
+		 * new record at current cursor position.
+		 */
+		error = xfs_rmap_insert(cur, bno, len, owner, offset, flags);
+		if (error)
+			goto out_error;
+	}
+
+	trace_xfs_rmap_map_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+out_error:
+	if (error)
+		trace_xfs_rmap_map_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
 struct xfs_rmap_query_range_info {
 	xfs_rmap_query_range_fn	fn;
 	void				*priv;
@@ -1237,11 +1737,19 @@ xfs_rmap_finish_one(
 	case XFS_RMAP_MAP:
 		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
 		break;
+	case XFS_RMAP_MAP_SHARED:
+		error = xfs_rmap_map_shared(rcur, bno, blockcount, unwritten,
+				&oinfo);
+		break;
 	case XFS_RMAP_FREE:
 	case XFS_RMAP_UNMAP:
 		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
 				&oinfo);
 		break;
+	case XFS_RMAP_UNMAP_SHARED:
+		error = xfs_rmap_unmap_shared(rcur, bno, blockcount, unwritten,
+				&oinfo);
+		break;
 	case XFS_RMAP_CONVERT:
 		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
 				&oinfo);
@@ -1315,7 +1823,8 @@ xfs_rmap_map_extent(
 	if (!xfs_rmap_update_is_needed(mp, whichfork))
 		return 0;
 
-	return __xfs_rmap_add(mp, dfops, XFS_RMAP_MAP, ip->i_ino,
+	return __xfs_rmap_add(mp, dfops, xfs_is_reflink_inode(ip) ?
+			XFS_RMAP_MAP_SHARED : XFS_RMAP_MAP, ip->i_ino,
 			whichfork, PREV);
 }
 
@@ -1331,7 +1840,8 @@ xfs_rmap_unmap_extent(
 	if (!xfs_rmap_update_is_needed(mp, whichfork))
 		return 0;
 
-	return __xfs_rmap_add(mp, dfops, XFS_RMAP_UNMAP, ip->i_ino,
+	return __xfs_rmap_add(mp, dfops, xfs_is_reflink_inode(ip) ?
+			XFS_RMAP_UNMAP_SHARED : XFS_RMAP_UNMAP, ip->i_ino,
 			whichfork, PREV);
 }
 
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 71cf99a..7899305 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -206,4 +206,11 @@ int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
 		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
 		xfs_exntst_t state, struct xfs_btree_cur **pcur);
 
+int xfs_rmap_find_left_neighbor(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		uint64_t owner, uint64_t offset, unsigned int flags,
+		struct xfs_rmap_irec *irec, int	*stat);
+int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		uint64_t owner, uint64_t offset, unsigned int flags,
+		struct xfs_rmap_irec *irec, int	*stat);
+
 #endif	/* __XFS_RMAP_H__ */
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 19d817e..3b8742e 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -484,9 +484,15 @@ xfs_rui_recover(
 		case XFS_RMAP_EXTENT_MAP:
 			type = XFS_RMAP_MAP;
 			break;
+		case XFS_RMAP_EXTENT_MAP_SHARED:
+			type = XFS_RMAP_MAP_SHARED;
+			break;
 		case XFS_RMAP_EXTENT_UNMAP:
 			type = XFS_RMAP_UNMAP;
 			break;
+		case XFS_RMAP_EXTENT_UNMAP_SHARED:
+			type = XFS_RMAP_UNMAP_SHARED;
+			break;
 		case XFS_RMAP_EXTENT_CONVERT:
 			type = XFS_RMAP_CONVERT;
 			break;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d19d128..30778ad 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2588,6 +2588,11 @@ DEFINE_RMAPBT_EVENT(xfs_rmap_delete);
 DEFINE_AG_ERROR_EVENT(xfs_rmap_insert_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmap_delete_error);
 DEFINE_AG_ERROR_EVENT(xfs_rmap_update_error);
+
+DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_candidate);
+DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_query);
+DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_candidate);
+DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range);
 DEFINE_RMAPBT_EVENT(xfs_rmap_lookup_le_range_result);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_right_neighbor_result);
 DEFINE_RMAPBT_EVENT(xfs_rmap_find_left_neighbor_result);


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 55/63] xfs: convert unwritten status of reverse mappings for shared files
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (53 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 54/63] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:25   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 56/63] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
                   ` (8 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Provide a function to convert an unwritten extent to a real one and
vice versa when shared extents are possible.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Move rmap unwritten bit to rm_offset.
---
 fs/xfs/libxfs/xfs_rmap.c |  385 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_rmap_item.c   |    3 
 2 files changed, 387 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index bb5e2f8..3a8cc71 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -1278,6 +1278,384 @@ xfs_rmap_convert(
 	return error;
 }
 
+/*
+ * Convert an unwritten extent to a real extent or vice versa.  If there is no
+ * possibility of overlapping extents, delegate to the simpler convert
+ * function.
+ */
+STATIC int
+xfs_rmap_convert_shared(
+	struct xfs_btree_cur	*cur,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	bool			unwritten,
+	struct xfs_owner_info	*oinfo)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_rmap_irec	r[4];	/* neighbor extent entries */
+					/* left is 0, right is 1, prev is 2 */
+					/* new is 3 */
+	uint64_t		owner;
+	uint64_t		offset;
+	uint64_t		new_endoff;
+	unsigned int		oldext;
+	unsigned int		newext;
+	unsigned int		flags = 0;
+	int			i;
+	int			state = 0;
+	int			error;
+
+	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
+	ASSERT(!(XFS_RMAP_NON_INODE_OWNER(owner) ||
+			(flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))));
+	oldext = unwritten ? XFS_RMAP_UNWRITTEN : 0;
+	new_endoff = offset + len;
+	trace_xfs_rmap_convert(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+
+	/*
+	 * For the initial lookup, look for and exact match or the left-adjacent
+	 * record for our insertion point. This will also give us the record for
+	 * start block contiguity tests.
+	 */
+	error = xfs_rmap_lookup_le_range(cur, bno, owner, offset, flags,
+			&PREV, &i);
+	XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+
+	ASSERT(PREV.rm_offset <= offset);
+	ASSERT(PREV.rm_offset + PREV.rm_blockcount >= new_endoff);
+	ASSERT((PREV.rm_flags & XFS_RMAP_UNWRITTEN) == oldext);
+	newext = ~oldext & XFS_RMAP_UNWRITTEN;
+
+	/*
+	 * Set flags determining what part of the previous oldext allocation
+	 * extent is being replaced by a newext allocation.
+	 */
+	if (PREV.rm_offset == offset)
+		state |= RMAP_LEFT_FILLING;
+	if (PREV.rm_offset + PREV.rm_blockcount == new_endoff)
+		state |= RMAP_RIGHT_FILLING;
+
+	/* Is there a left record that abuts our range? */
+	error = xfs_rmap_find_left_neighbor(cur, bno, owner, offset, newext,
+			&LEFT, &i);
+	if (error)
+		goto done;
+	if (i) {
+		state |= RMAP_LEFT_VALID;
+		XFS_WANT_CORRUPTED_GOTO(mp,
+				LEFT.rm_startblock + LEFT.rm_blockcount <= bno,
+				done);
+		if (xfs_rmap_is_mergeable(&LEFT, owner, newext))
+			state |= RMAP_LEFT_CONTIG;
+	}
+
+	/* Is there a right record that abuts our range? */
+	error = xfs_rmap_lookup_eq(cur, bno + len, len, owner, offset + len,
+			newext, &i);
+	if (error)
+		goto done;
+	if (i) {
+		state |= RMAP_RIGHT_VALID;
+		error = xfs_rmap_get_rec(cur, &RIGHT, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		XFS_WANT_CORRUPTED_GOTO(mp, bno + len <= RIGHT.rm_startblock,
+				done);
+		trace_xfs_rmap_find_right_neighbor_result(cur->bc_mp,
+				cur->bc_private.a.agno, RIGHT.rm_startblock,
+				RIGHT.rm_blockcount, RIGHT.rm_owner,
+				RIGHT.rm_offset, RIGHT.rm_flags);
+		if (xfs_rmap_is_mergeable(&RIGHT, owner, newext))
+			state |= RMAP_RIGHT_CONTIG;
+	}
+
+	/* check that left + prev + right is not too long */
+	if ((state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) ==
+	    (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG) &&
+	    (unsigned long)LEFT.rm_blockcount + len +
+	     RIGHT.rm_blockcount > XFS_RMAP_LEN_MAX)
+		state &= ~RMAP_RIGHT_CONTIG;
+
+	trace_xfs_rmap_convert_state(mp, cur->bc_private.a.agno, state,
+			_RET_IP_);
+	/*
+	 * Switch out based on the FILLING and CONTIG state bits.
+	 */
+	switch (state & (RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+			 RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG)) {
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG |
+	     RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The left and right neighbors are both contiguous with new.
+		 */
+		error = xfs_rmap_delete(cur, RIGHT.rm_startblock,
+				RIGHT.rm_blockcount, RIGHT.rm_owner,
+				RIGHT.rm_offset, RIGHT.rm_flags);
+		if (error)
+			goto done;
+		error = xfs_rmap_delete(cur, PREV.rm_startblock,
+				PREV.rm_blockcount, PREV.rm_owner,
+				PREV.rm_offset, PREV.rm_flags);
+		if (error)
+			goto done;
+		NEW = LEFT;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += PREV.rm_blockcount + RIGHT.rm_blockcount;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The left neighbor is contiguous, the right is not.
+		 */
+		error = xfs_rmap_delete(cur, PREV.rm_startblock,
+				PREV.rm_blockcount, PREV.rm_owner,
+				PREV.rm_offset, PREV.rm_flags);
+		if (error)
+			goto done;
+		NEW = LEFT;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += PREV.rm_blockcount;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * The right neighbor is contiguous, the left is not.
+		 */
+		error = xfs_rmap_delete(cur, RIGHT.rm_startblock,
+				RIGHT.rm_blockcount, RIGHT.rm_owner,
+				RIGHT.rm_offset, RIGHT.rm_flags);
+		if (error)
+			goto done;
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += RIGHT.rm_blockcount;
+		NEW.rm_flags = RIGHT.rm_flags;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_FILLING:
+		/*
+		 * Setting all of a previous oldext extent to newext.
+		 * Neither the left nor right neighbors are contiguous with
+		 * the new one.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_flags = newext;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG:
+		/*
+		 * Setting the first part of a previous oldext extent to newext.
+		 * The left neighbor is contiguous.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_delete(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW.rm_offset += len;
+		NEW.rm_startblock += len;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW = LEFT;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount += len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING:
+		/*
+		 * Setting the first part of a previous oldext extent to newext.
+		 * The left neighbor is not contiguous.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_delete(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW.rm_offset += len;
+		NEW.rm_startblock += len;
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		error = xfs_rmap_insert(cur, bno, len, owner, offset, newext);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_RIGHT_FILLING | RMAP_RIGHT_CONTIG:
+		/*
+		 * Setting the last part of a previous oldext extent to newext.
+		 * The right neighbor is contiguous with the new allocation.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount = offset - NEW.rm_offset;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		NEW = RIGHT;
+		error = xfs_rmap_delete(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		NEW.rm_offset = offset;
+		NEW.rm_startblock = bno;
+		NEW.rm_blockcount += len;
+		error = xfs_rmap_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_RIGHT_FILLING:
+		/*
+		 * Setting the last part of a previous oldext extent to newext.
+		 * The right neighbor is not contiguous.
+		 */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount -= len;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		error = xfs_rmap_insert(cur, bno, len, owner, offset, newext);
+		if (error)
+			goto done;
+		break;
+
+	case 0:
+		/*
+		 * Setting the middle part of a previous oldext extent to
+		 * newext.  Contiguity is impossible here.
+		 * One extent becomes three extents.
+		 */
+		/* new right extent - oldext */
+		NEW.rm_startblock = bno + len;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = new_endoff;
+		NEW.rm_blockcount = PREV.rm_offset + PREV.rm_blockcount -
+				new_endoff;
+		NEW.rm_flags = PREV.rm_flags;
+		error = xfs_rmap_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner, NEW.rm_offset,
+				NEW.rm_flags);
+		if (error)
+			goto done;
+		/* new left extent - oldext */
+		NEW = PREV;
+		error = xfs_rmap_lookup_eq(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner,
+				NEW.rm_offset, NEW.rm_flags, &i);
+		if (error)
+			goto done;
+		XFS_WANT_CORRUPTED_GOTO(mp, i == 1, done);
+		NEW.rm_blockcount = offset - NEW.rm_offset;
+		error = xfs_rmap_update(cur, &NEW);
+		if (error)
+			goto done;
+		/* new middle extent - newext */
+		NEW.rm_startblock = bno;
+		NEW.rm_blockcount = len;
+		NEW.rm_owner = owner;
+		NEW.rm_offset = offset;
+		NEW.rm_flags = newext;
+		error = xfs_rmap_insert(cur, NEW.rm_startblock,
+				NEW.rm_blockcount, NEW.rm_owner, NEW.rm_offset,
+				NEW.rm_flags);
+		if (error)
+			goto done;
+		break;
+
+	case RMAP_LEFT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_LEFT_FILLING | RMAP_RIGHT_CONTIG:
+	case RMAP_RIGHT_FILLING | RMAP_LEFT_CONTIG:
+	case RMAP_LEFT_CONTIG | RMAP_RIGHT_CONTIG:
+	case RMAP_LEFT_CONTIG:
+	case RMAP_RIGHT_CONTIG:
+		/*
+		 * These cases are all impossible.
+		 */
+		ASSERT(0);
+	}
+
+	trace_xfs_rmap_convert_done(mp, cur->bc_private.a.agno, bno, len,
+			unwritten, oinfo);
+done:
+	if (error)
+		trace_xfs_rmap_convert_error(cur->bc_mp,
+				cur->bc_private.a.agno, error, _RET_IP_);
+	return error;
+}
+
 #undef	NEW
 #undef	LEFT
 #undef	RIGHT
@@ -1754,6 +2132,10 @@ xfs_rmap_finish_one(
 		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
 				&oinfo);
 		break;
+	case XFS_RMAP_CONVERT_SHARED:
+		error = xfs_rmap_convert_shared(rcur, bno, blockcount,
+				!unwritten, &oinfo);
+		break;
 	default:
 		ASSERT(0);
 		error = -EFSCORRUPTED;
@@ -1857,7 +2239,8 @@ xfs_rmap_convert_extent(
 	if (!xfs_rmap_update_is_needed(mp, whichfork))
 		return 0;
 
-	return __xfs_rmap_add(mp, dfops, XFS_RMAP_CONVERT, ip->i_ino,
+	return __xfs_rmap_add(mp, dfops, xfs_is_reflink_inode(ip) ?
+			XFS_RMAP_CONVERT_SHARED : XFS_RMAP_CONVERT, ip->i_ino,
 			whichfork, PREV);
 }
 
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 3b8742e..73c8278 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -496,6 +496,9 @@ xfs_rui_recover(
 		case XFS_RMAP_EXTENT_CONVERT:
 			type = XFS_RMAP_CONVERT;
 			break;
+		case XFS_RMAP_EXTENT_CONVERT_SHARED:
+			type = XFS_RMAP_CONVERT_SHARED;
+			break;
 		case XFS_RMAP_EXTENT_ALLOC:
 			type = XFS_RMAP_ALLOC;
 			break;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 56/63] xfs: set a default CoW extent size of 32 blocks
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (54 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 55/63] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:25   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 57/63] xfs: check for invalid inode reflink flags Darrick J. Wong
                   ` (7 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

If the admin doesn't set a CoW extent size or a regular extent size
hint, default to creating CoW reservations 32 blocks long to reduce
fragmentation.

Signed-off-by: DarricK J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 07c300b..a9fb223 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -80,7 +80,8 @@ xfs_get_extsz_hint(
 /*
  * Helper function to extract CoW extent size hint from inode.
  * Between the extent size hint and the CoW extent size hint, we
- * return the greater of the two.
+ * return the greater of the two.  If the value is zero (automatic),
+ * default to 32 blocks.
  */
 xfs_extlen_t
 xfs_get_cowextsz_hint(
@@ -93,9 +94,10 @@ xfs_get_cowextsz_hint(
 		a = ip->i_d.di_cowextsize;
 	b = xfs_get_extsz_hint(ip);
 
-	if (a > b)
-		return a;
-	return b;
+	a = max(a, b);
+	if (a == 0)
+		return 32;
+	return a;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 57/63] xfs: check for invalid inode reflink flags
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (55 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 56/63] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:26   ` Christoph Hellwig
  2016-09-30  3:11 ` [PATCH 58/63] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
                   ` (6 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

We don't support sharing blocks on the realtime device.  Flag inodes
with the reflink or cowextsize flags set when the reflink feature is
disabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |   16 ++++++++++++++++
 fs/xfs/xfs_ioctl.c            |    4 ++++
 2 files changed, 20 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index a3e8038..f1b9d97 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -375,6 +375,9 @@ xfs_dinode_verify(
 	struct xfs_inode	*ip,
 	struct xfs_dinode	*dip)
 {
+	uint16_t		flags;
+	uint64_t		flags2;
+
 	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
 		return false;
 
@@ -391,6 +394,19 @@ xfs_dinode_verify(
 		return false;
 	if (!uuid_equal(&dip->di_uuid, &mp->m_sb.sb_meta_uuid))
 		return false;
+
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	/* don't allow reflink/cowextsize if we don't have reflink */
+	if ((flags2 & (XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)) &&
+            !xfs_sb_version_hasreflink(&mp->m_sb))
+		return false;
+
+	/* don't let reflink and realtime mix */
+	if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags & XFS_DIFLAG_REALTIME))
+		return false;
+
 	return true;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 1388a127..c65d9ea 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1034,6 +1034,10 @@ xfs_ioctl_setattr_xflags(
 			return -EINVAL;
 	}
 
+	/* Don't allow us to set realtime mode for a reflinked file. */
+	if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
+		return -EINVAL;
+
 	/*
 	 * Can't modify an immutable/append-only file unless
 	 * we have appropriate permission.


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 58/63] xfs: don't mix reflink and DAX mode for now
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (56 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 57/63] xfs: check for invalid inode reflink flags Darrick J. Wong
@ 2016-09-30  3:11 ` Darrick J. Wong
  2016-09-30  8:26   ` Christoph Hellwig
  2016-09-30  3:12 ` [PATCH 59/63] xfs: simulate per-AG reservations being critically low Darrick J. Wong
                   ` (5 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:11 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Since we don't have a strategy for handling both DAX and reflink,
for now we'll just prohibit both being set at the same time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |    4 ++++
 fs/xfs/xfs_file.c             |    4 ++++
 fs/xfs/xfs_ioctl.c            |    4 ++++
 fs/xfs/xfs_iops.c             |    1 +
 4 files changed, 13 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index f1b9d97..8de9a3a 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -407,6 +407,10 @@ xfs_dinode_verify(
 	if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags & XFS_DIFLAG_REALTIME))
 		return false;
 
+	/* don't let reflink and dax mix */
+	if ((flags2 & XFS_DIFLAG2_REFLINK) && (flags2 & XFS_DIFLAG2_DAX))
+		return false;
+
 	return true;
 }
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b8d3a8c..ef01bd3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1052,6 +1052,10 @@ xfs_file_share_range(
 	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
 		return -EINVAL;
 
+	/* Don't share DAX file data for now. */
+	if (IS_DAX(inode_in) || IS_DAX(inode_out))
+		return -EINVAL;
+
 	/* Are we going all the way to the end? */
 	isize = i_size_read(inode_in);
 	if (isize == 0)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index c65d9ea..8b9f31c5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1038,6 +1038,10 @@ xfs_ioctl_setattr_xflags(
 	if ((fa->fsx_xflags & FS_XFLAG_REALTIME) && xfs_is_reflink_inode(ip))
 		return -EINVAL;
 
+	/* Don't allow us to set DAX mode for a reflinked file for now. */
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) && xfs_is_reflink_inode(ip))
+		return -EINVAL;
+
 	/*
 	 * Can't modify an immutable/append-only file unless
 	 * we have appropriate permission.
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index b24c310..5945b64 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1144,6 +1144,7 @@ xfs_diflags_to_iflags(
 		inode->i_flags |= S_NOATIME;
 	if (S_ISREG(inode->i_mode) &&
 	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
+	    !xfs_is_reflink_inode(ip) &&
 	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
 	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		inode->i_flags |= S_DAX;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 59/63] xfs: simulate per-AG reservations being critically low
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (57 preceding siblings ...)
  2016-09-30  3:11 ` [PATCH 58/63] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
@ 2016-09-30  3:12 ` Darrick J. Wong
  2016-09-30  8:27   ` Christoph Hellwig
  2016-09-30  3:12 ` [PATCH 60/63] xfs: recognize the reflink feature bit Darrick J. Wong
                   ` (4 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:12 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Create an error injection point that enables us to simulate being
critically low on per-AG block reservations.  This should enable us to
simulate this specific ENOSPC condition so that we can test falling back
to a regular file copy.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_ag_resv.c |    4 +++-
 fs/xfs/xfs_error.h          |    4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index adf770f..e5ebc37 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -109,7 +109,9 @@ xfs_ag_resv_critical(
 	trace_xfs_ag_resv_critical(pag, type, avail);
 
 	/* Critically low if less than 10% or max btree height remains. */
-	return avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS;
+	return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
+			pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL,
+			XFS_RANDOM_AG_RESV_CRITICAL);
 }
 
 /*
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index 8d8e1b07..05f8666 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -95,7 +95,8 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_ERRTAG_REFCOUNT_CONTINUE_UPDATE		24
 #define XFS_ERRTAG_REFCOUNT_FINISH_ONE			25
 #define XFS_ERRTAG_BMAP_FINISH_ONE			26
-#define XFS_ERRTAG_MAX					27
+#define XFS_ERRTAG_AG_RESV_CRITICAL			27
+#define XFS_ERRTAG_MAX					28
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -127,6 +128,7 @@ extern void xfs_verifier_error(struct xfs_buf *bp);
 #define XFS_RANDOM_REFCOUNT_CONTINUE_UPDATE		1
 #define XFS_RANDOM_REFCOUNT_FINISH_ONE			1
 #define XFS_RANDOM_BMAP_FINISH_ONE			1
+#define XFS_RANDOM_AG_RESV_CRITICAL			4
 
 #ifdef DEBUG
 extern int xfs_error_test_active;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 60/63] xfs: recognize the reflink feature bit
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (58 preceding siblings ...)
  2016-09-30  3:12 ` [PATCH 59/63] xfs: simulate per-AG reservations being critically low Darrick J. Wong
@ 2016-09-30  3:12 ` Darrick J. Wong
  2016-09-30  8:27   ` Christoph Hellwig
  2016-09-30  3:12 ` [PATCH 61/63] xfs: various swapext cleanups Darrick J. Wong
                   ` (3 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:12 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Add the reflink feature flag to the set of recognized feature flags.
This enables users to write to reflink filesystems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h |    3 ++-
 fs/xfs/xfs_super.c         |    7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 94f610a..7a40855 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -459,7 +459,8 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
-		 XFS_SB_FEAT_RO_COMPAT_RMAPBT)
+		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2152ab6..0e95485 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1604,6 +1604,9 @@ xfs_fs_fill_super(
 			"DAX unsupported by block device. Turning off DAX.");
 			mp->m_flags &= ~XFS_MOUNT_DAX;
 		}
+		if (xfs_sb_version_hasreflink(&mp->m_sb))
+			xfs_alert(mp,
+		"DAX and reflink have not been tested together!");
 	}
 
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
@@ -1617,6 +1620,10 @@ xfs_fs_fill_super(
 	"EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!");
 	}
 
+	if (xfs_sb_version_hasreflink(&mp->m_sb))
+		xfs_alert(mp,
+	"EXPERIMENTAL reflink feature enabled. Use at your own risk!");
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 61/63] xfs: various swapext cleanups
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (59 preceding siblings ...)
  2016-09-30  3:12 ` [PATCH 60/63] xfs: recognize the reflink feature bit Darrick J. Wong
@ 2016-09-30  3:12 ` Darrick J. Wong
  2016-09-30  8:28   ` Christoph Hellwig
  2016-09-30  3:12 ` [PATCH 62/63] xfs: refactor swapext code Darrick J. Wong
                   ` (2 subsequent siblings)
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:12 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Replace structure typedefs with struct expressions and fix some
whitespace issues that result.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fd4b6bb..afe0319 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1556,8 +1556,8 @@ xfs_insert_file_space(
  */
 static int
 xfs_swap_extents_check_format(
-	xfs_inode_t	*ip,	/* target inode */
-	xfs_inode_t	*tip)	/* tmp inode */
+	struct xfs_inode	*ip,	/* target inode */
+	struct xfs_inode	*tip)	/* tmp inode */
 {
 
 	/* Should never get a local format */
@@ -1643,22 +1643,22 @@ xfs_swap_extent_flush(
 
 int
 xfs_swap_extents(
-	xfs_inode_t	*ip,	/* target inode */
-	xfs_inode_t	*tip,	/* tmp inode */
-	xfs_swapext_t	*sxp)
+	struct xfs_inode	*ip,	/* target inode */
+	struct xfs_inode	*tip,	/* tmp inode */
+	struct xfs_swapext	*sxp)
 {
-	xfs_mount_t	*mp = ip->i_mount;
-	xfs_trans_t	*tp;
-	xfs_bstat_t	*sbp = &sxp->sx_stat;
-	xfs_ifork_t	*tempifp, *ifp, *tifp;
-	int		src_log_flags, target_log_flags;
-	int		error = 0;
-	int		aforkblks = 0;
-	int		taforkblks = 0;
-	__uint64_t	tmp;
-	int		lock_flags;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	struct xfs_bstat	*sbp = &sxp->sx_stat;
+	struct xfs_ifork	*tempifp, *ifp, *tifp;
+	int			src_log_flags, target_log_flags;
+	int			error = 0;
+	int			aforkblks = 0;
+	int			taforkblks = 0;
+	__uint64_t		tmp;
+	int			lock_flags;
 	struct xfs_ifork	*cowfp;
-	__uint64_t	f;
+	__uint64_t		f;
 
 	/* XXX: we can't do this with rmap, will fix later */
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 62/63] xfs: refactor swapext code
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (60 preceding siblings ...)
  2016-09-30  3:12 ` [PATCH 61/63] xfs: various swapext cleanups Darrick J. Wong
@ 2016-09-30  3:12 ` Darrick J. Wong
  2016-09-30  8:28   ` Christoph Hellwig
  2016-09-30  3:12 ` [PATCH 63/63] xfs: implement swapext for rmap filesystems Darrick J. Wong
  2016-09-30  9:00 ` [PATCH v10 00/63] xfs: add reflink and dedupe support Christoph Hellwig
  63 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:12 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Refactor the swapext function to pull out the fork swapping piece
into a separate function.  In the next patch we'll add in the bit
we need to make it work with rmap filesystems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  275 +++++++++++++++++++++++++-----------------------
 1 file changed, 144 insertions(+), 131 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index afe0319..b5564f2 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1641,127 +1641,37 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
-int
-xfs_swap_extents(
-	struct xfs_inode	*ip,	/* target inode */
-	struct xfs_inode	*tip,	/* tmp inode */
-	struct xfs_swapext	*sxp)
+/* Swap the extents of two files by swapping data forks. */
+STATIC int
+xfs_swap_extent_forks(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_inode	*tip,
+	int			*src_log_flags,
+	int			*target_log_flags)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
-	struct xfs_bstat	*sbp = &sxp->sx_stat;
-	struct xfs_ifork	*tempifp, *ifp, *tifp;
-	int			src_log_flags, target_log_flags;
-	int			error = 0;
+	struct xfs_ifork	tempifp, *ifp, *tifp;
 	int			aforkblks = 0;
 	int			taforkblks = 0;
 	__uint64_t		tmp;
-	int			lock_flags;
-	struct xfs_ifork	*cowfp;
-	__uint64_t		f;
-
-	/* XXX: we can't do this with rmap, will fix later */
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
-		return -EOPNOTSUPP;
-
-	tempifp = kmem_alloc(sizeof(xfs_ifork_t), KM_MAYFAIL);
-	if (!tempifp) {
-		error = -ENOMEM;
-		goto out;
-	}
-
-	/*
-	 * Lock the inodes against other IO, page faults and truncate to
-	 * begin with.  Then we can ensure the inodes are flushed and have no
-	 * page cache safely. Once we have done this we can take the ilocks and
-	 * do the rest of the checks.
-	 */
-	lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
-	xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
-	xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL);
-
-	/* Verify that both files have the same format */
-	if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	/* Verify both files are either real-time or non-realtime */
-	if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	error = xfs_swap_extent_flush(ip);
-	if (error)
-		goto out_unlock;
-	error = xfs_swap_extent_flush(tip);
-	if (error)
-		goto out_unlock;
-
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
-	if (error)
-		goto out_unlock;
-
-	/*
-	 * Lock and join the inodes to the tansaction so that transaction commit
-	 * or cancel will unlock the inodes from this point onwards.
-	 */
-	xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
-	lock_flags |= XFS_ILOCK_EXCL;
-	xfs_trans_ijoin(tp, ip, lock_flags);
-	xfs_trans_ijoin(tp, tip, lock_flags);
-
-
-	/* Verify all data are being swapped */
-	if (sxp->sx_offset != 0 ||
-	    sxp->sx_length != ip->i_d.di_size ||
-	    sxp->sx_length != tip->i_d.di_size) {
-		error = -EFAULT;
-		goto out_trans_cancel;
-	}
-
-	trace_xfs_swap_extent_before(ip, 0);
-	trace_xfs_swap_extent_before(tip, 1);
-
-	/* check inode formats now that data is flushed */
-	error = xfs_swap_extents_check_format(ip, tip);
-	if (error) {
-		xfs_notice(mp,
-		    "%s: inode 0x%llx format is incompatible for exchanging.",
-				__func__, ip->i_ino);
-		goto out_trans_cancel;
-	}
+	int			error;
 
 	/*
-	 * Compare the current change & modify times with that
-	 * passed in.  If they differ, we abort this swap.
-	 * This is the mechanism used to ensure the calling
-	 * process that the file was not changed out from
-	 * under it.
-	 */
-	if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
-	    (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
-	    (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
-	    (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
-		error = -EBUSY;
-		goto out_trans_cancel;
-	}
-	/*
 	 * Count the number of extended attribute blocks
 	 */
 	if ( ((XFS_IFORK_Q(ip) != 0) && (ip->i_d.di_anextents > 0)) &&
 	     (ip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) {
-		error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK, &aforkblks);
+		error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK,
+				&aforkblks);
 		if (error)
-			goto out_trans_cancel;
+			return error;
 	}
 	if ( ((XFS_IFORK_Q(tip) != 0) && (tip->i_d.di_anextents > 0)) &&
 	     (tip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) {
 		error = xfs_bmap_count_blocks(tp, tip, XFS_ATTR_FORK,
-			&taforkblks);
+				&taforkblks);
 		if (error)
-			goto out_trans_cancel;
+			return error;
 	}
 
 	/*
@@ -1770,31 +1680,23 @@ xfs_swap_extents(
 	 * buffers, and so the validation done on read will expect the owner
 	 * field to be correctly set. Once we change the owners, we can swap the
 	 * inode forks.
-	 *
-	 * Note the trickiness in setting the log flags - we set the owner log
-	 * flag on the opposite inode (i.e. the inode we are setting the new
-	 * owner to be) because once we swap the forks and log that, log
-	 * recovery is going to see the fork as owned by the swapped inode,
-	 * not the pre-swapped inodes.
 	 */
-	src_log_flags = XFS_ILOG_CORE;
-	target_log_flags = XFS_ILOG_CORE;
 	if (ip->i_d.di_version == 3 &&
 	    ip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
-		target_log_flags |= XFS_ILOG_DOWNER;
+		(*target_log_flags) |= XFS_ILOG_DOWNER;
 		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK,
 					      tip->i_ino, NULL);
 		if (error)
-			goto out_trans_cancel;
+			return error;
 	}
 
 	if (tip->i_d.di_version == 3 &&
 	    tip->i_d.di_format == XFS_DINODE_FMT_BTREE) {
-		src_log_flags |= XFS_ILOG_DOWNER;
+		(*src_log_flags) |= XFS_ILOG_DOWNER;
 		error = xfs_bmbt_change_owner(tp, tip, XFS_DATA_FORK,
 					      ip->i_ino, NULL);
 		if (error)
-			goto out_trans_cancel;
+			return error;
 	}
 
 	/*
@@ -1802,9 +1704,9 @@ xfs_swap_extents(
 	 */
 	ifp = &ip->i_df;
 	tifp = &tip->i_df;
-	*tempifp = *ifp;	/* struct copy */
+	tempifp = *ifp;		/* struct copy */
 	*ifp = *tifp;		/* struct copy */
-	*tifp = *tempifp;	/* struct copy */
+	*tifp = tempifp;	/* struct copy */
 
 	/*
 	 * Fix the on-disk inode values
@@ -1844,12 +1746,12 @@ xfs_swap_extents(
 			ifp->if_u1.if_extents =
 				ifp->if_u2.if_inline_ext;
 		}
-		src_log_flags |= XFS_ILOG_DEXT;
+		(*src_log_flags) |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
 		ASSERT(ip->i_d.di_version < 3 ||
-		       (src_log_flags & XFS_ILOG_DOWNER));
-		src_log_flags |= XFS_ILOG_DBROOT;
+		       (*src_log_flags & XFS_ILOG_DOWNER));
+		(*src_log_flags) |= XFS_ILOG_DBROOT;
 		break;
 	}
 
@@ -1863,15 +1765,126 @@ xfs_swap_extents(
 			tifp->if_u1.if_extents =
 				tifp->if_u2.if_inline_ext;
 		}
-		target_log_flags |= XFS_ILOG_DEXT;
+		(*target_log_flags) |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		target_log_flags |= XFS_ILOG_DBROOT;
+		(*target_log_flags) |= XFS_ILOG_DBROOT;
 		ASSERT(tip->i_d.di_version < 3 ||
-		       (target_log_flags & XFS_ILOG_DOWNER));
+		       (*target_log_flags & XFS_ILOG_DOWNER));
 		break;
 	}
 
+	return 0;
+}
+
+int
+xfs_swap_extents(
+	struct xfs_inode	*ip,	/* target inode */
+	struct xfs_inode	*tip,	/* tmp inode */
+	struct xfs_swapext	*sxp)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	struct xfs_bstat	*sbp = &sxp->sx_stat;
+	int			src_log_flags, target_log_flags;
+	int			error = 0;
+	int			lock_flags;
+	struct xfs_ifork	*cowfp;
+	__uint64_t		f;
+
+	/*
+	 * Lock the inodes against other IO, page faults and truncate to
+	 * begin with.  Then we can ensure the inodes are flushed and have no
+	 * page cache safely. Once we have done this we can take the ilocks and
+	 * do the rest of the checks.
+	 */
+	lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
+	xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL);
+
+	/* Verify that both files have the same format */
+	if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
+		error = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Verify both files are either real-time or non-realtime */
+	if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) {
+		error = -EINVAL;
+		goto out_unlock;
+	}
+
+	error = xfs_swap_extent_flush(ip);
+	if (error)
+		goto out_unlock;
+	error = xfs_swap_extent_flush(tip);
+	if (error)
+		goto out_unlock;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * Lock and join the inodes to the tansaction so that transaction commit
+	 * or cancel will unlock the inodes from this point onwards.
+	 */
+	xfs_lock_two_inodes(ip, tip, XFS_ILOCK_EXCL);
+	lock_flags |= XFS_ILOCK_EXCL;
+	xfs_trans_ijoin(tp, ip, 0);
+	xfs_trans_ijoin(tp, tip, 0);
+
+
+	/* Verify all data are being swapped */
+	if (sxp->sx_offset != 0 ||
+	    sxp->sx_length != ip->i_d.di_size ||
+	    sxp->sx_length != tip->i_d.di_size) {
+		error = -EFAULT;
+		goto out_trans_cancel;
+	}
+
+	trace_xfs_swap_extent_before(ip, 0);
+	trace_xfs_swap_extent_before(tip, 1);
+
+	/* check inode formats now that data is flushed */
+	error = xfs_swap_extents_check_format(ip, tip);
+	if (error) {
+		xfs_notice(mp,
+		    "%s: inode 0x%llx format is incompatible for exchanging.",
+				__func__, ip->i_ino);
+		goto out_trans_cancel;
+	}
+
+	/*
+	 * Compare the current change & modify times with that
+	 * passed in.  If they differ, we abort this swap.
+	 * This is the mechanism used to ensure the calling
+	 * process that the file was not changed out from
+	 * under it.
+	 */
+	if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
+	    (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
+	    (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
+	    (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
+		error = -EBUSY;
+		goto out_trans_cancel;
+	}
+
+	/*
+	 * Note the trickiness in setting the log flags - we set the owner log
+	 * flag on the opposite inode (i.e. the inode we are setting the new
+	 * owner to be) because once we swap the forks and log that, log
+	 * recovery is going to see the fork as owned by the swapped inode,
+	 * not the pre-swapped inodes.
+	 */
+	src_log_flags = XFS_ILOG_CORE;
+	target_log_flags = XFS_ILOG_CORE;
+
+	error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
+			&target_log_flags);
+	if (error)
+		goto out_trans_cancel;
+
 	/* Do we have to swap reflink flags? */
 	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
 	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
@@ -1901,16 +1914,16 @@ xfs_swap_extents(
 
 	trace_xfs_swap_extent_after(ip, 0);
 	trace_xfs_swap_extent_after(tip, 1);
-out:
-	kmem_free(tempifp);
-	return error;
 
-out_unlock:
 	xfs_iunlock(ip, lock_flags);
 	xfs_iunlock(tip, lock_flags);
-	goto out;
+	return error;
 
 out_trans_cancel:
 	xfs_trans_cancel(tp);
-	goto out;
+
+out_unlock:
+	xfs_iunlock(ip, lock_flags);
+	xfs_iunlock(tip, lock_flags);
+	return error;
 }


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* [PATCH 63/63] xfs: implement swapext for rmap filesystems
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (61 preceding siblings ...)
  2016-09-30  3:12 ` [PATCH 62/63] xfs: refactor swapext code Darrick J. Wong
@ 2016-09-30  3:12 ` Darrick J. Wong
  2016-09-30  9:00 ` [PATCH v10 00/63] xfs: add reflink and dedupe support Christoph Hellwig
  63 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30  3:12 UTC (permalink / raw)
  To: david, darrick.wong; +Cc: linux-xfs, linux-fsdevel

Implement swapext for filesystems that have reverse mapping.  Back in
the reflink patches, we augmented the bmap code with a 'REMAP' flag
that updates only the bmbt and doesn't touch the allocator and
implemented log redo items for those two operations.  Now we can
rewrite extent swapping as a (looong) series of remap operations.

This is far less efficient than the fork swapping method implemented
in the past, so we only switch this on for rmap.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_space.h |    9 ++
 fs/xfs/xfs_bmap_util.c          |  162 ++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.h              |    5 +
 3 files changed, 173 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 41e0428..7917f6e 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -21,6 +21,8 @@
 /*
  * Components of space reservations.
  */
+#define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)    \
+		(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
 #define XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)    \
 		(((mp)->m_alloc_mxr[0]) - ((mp)->m_alloc_mnr[0]))
 #define	XFS_EXTENTADD_SPACE_RES(mp,w)	(XFS_BM_MAXLEVELS(mp,w) - 1)
@@ -28,6 +30,13 @@
 	(((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
 	  XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
 	  XFS_EXTENTADD_SPACE_RES(mp,w))
+#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)\
+	(((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
+	  XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
+	  XFS_EXTENTADD_SPACE_RES(mp,w) + \
+	 ((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \
+	  XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) * \
+	  (mp)->m_rmap_maxlevels)
 #define	XFS_DAENTER_1B(mp,w)	\
 	((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
 #define	XFS_DAENTER_DBS(mp,w)	\
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index b5564f2..0bd217f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1573,6 +1573,13 @@ xfs_swap_extents_check_format(
 		return -EINVAL;
 
 	/*
+	 * If we have to use the (expensive) rmap swap method, we can
+	 * handle any number of extents and any format.
+	 */
+	if (xfs_sb_version_hasrmapbt(&ip->i_mount->m_sb))
+		return 0;
+
+	/*
 	 * if the target inode is in extent form and the temp inode is in btree
 	 * form then we will end up with the target inode in the wrong format
 	 * as we already know there are less extents in the temp inode.
@@ -1641,6 +1648,130 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
+/*
+ * Move extents from one file to another, when rmap is enabled.
+ */
+STATIC int
+xfs_swap_extent_rmap(
+	struct xfs_trans		**tpp,
+	struct xfs_inode		*ip,
+	struct xfs_inode		*tip)
+{
+	struct xfs_bmbt_irec		irec;
+	struct xfs_bmbt_irec		uirec;
+	struct xfs_bmbt_irec		tirec;
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	xfs_filblks_t			count_fsb;
+	xfs_fsblock_t			firstfsb;
+	struct xfs_defer_ops		dfops;
+	int				error;
+	xfs_filblks_t			ilen;
+	xfs_filblks_t			rlen;
+	int				nimaps;
+	__uint64_t			tip_flags2;
+
+	/*
+	 * If the source file has shared blocks, we must flag the donor
+	 * file as having shared blocks so that we get the shared-block
+	 * rmap functions when we go to fix up the rmaps.  The flags
+	 * will be switch for reals later.
+	 */
+	tip_flags2 = tip->i_d.di_flags2;
+	if (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)
+		tip->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+
+	offset_fsb = 0;
+	end_fsb = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
+	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
+
+	while (count_fsb) {
+		/* Read extent from the donor file */
+		nimaps = 1;
+		error = xfs_bmapi_read(tip, offset_fsb, count_fsb, &tirec,
+				&nimaps, 0);
+		if (error)
+			goto out;
+		ASSERT(nimaps == 1);
+		ASSERT(tirec.br_startblock != DELAYSTARTBLOCK);
+
+		trace_xfs_swap_extent_rmap_remap(tip, &tirec);
+		ilen = tirec.br_blockcount;
+
+		/* Unmap the old blocks in the source file. */
+		while (tirec.br_blockcount) {
+			xfs_defer_init(&dfops, &firstfsb);
+			trace_xfs_swap_extent_rmap_remap_piece(tip, &tirec);
+
+			/* Read extent from the source file */
+			nimaps = 1;
+			error = xfs_bmapi_read(ip, tirec.br_startoff,
+					tirec.br_blockcount, &irec,
+					&nimaps, 0);
+			if (error)
+				goto out_defer;
+			ASSERT(nimaps == 1);
+			ASSERT(tirec.br_startoff == irec.br_startoff);
+			trace_xfs_swap_extent_rmap_remap_piece(ip, &irec);
+
+			/* Trim the extent. */
+			uirec = tirec;
+			uirec.br_blockcount = rlen = min_t(xfs_filblks_t,
+					tirec.br_blockcount,
+					irec.br_blockcount);
+			trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
+
+			/* Remove the mapping from the donor file. */
+			error = xfs_bmap_unmap_extent((*tpp)->t_mountp, &dfops,
+					tip, XFS_DATA_FORK, &uirec);
+			if (error)
+				goto out_defer;
+
+			/* Remove the mapping from the source file. */
+			error = xfs_bmap_unmap_extent((*tpp)->t_mountp, &dfops,
+					ip, XFS_DATA_FORK, &irec);
+			if (error)
+				goto out_defer;
+
+			/* Map the donor file's blocks into the source file. */
+			error = xfs_bmap_map_extent((*tpp)->t_mountp, &dfops,
+					ip, XFS_DATA_FORK, &uirec);
+			if (error)
+				goto out_defer;
+
+			/* Map the source file's blocks into the donor file. */
+			error = xfs_bmap_map_extent((*tpp)->t_mountp, &dfops,
+					tip, XFS_DATA_FORK, &irec);
+			if (error)
+				goto out_defer;
+
+			error = xfs_defer_finish(tpp, &dfops, ip);
+			if (error)
+				goto out_defer;
+
+			tirec.br_startoff += rlen;
+			if (tirec.br_startblock != HOLESTARTBLOCK &&
+			    tirec.br_startblock != DELAYSTARTBLOCK)
+				tirec.br_startblock += rlen;
+			tirec.br_blockcount -= rlen;
+		}
+
+		/* Roll on... */
+		count_fsb -= ilen;
+		offset_fsb += ilen;
+	}
+
+	tip->i_d.di_flags2 = tip_flags2;
+	return 0;
+
+out_defer:
+	xfs_defer_cancel(&dfops);
+out:
+	trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
+	tip->i_d.di_flags2 = tip_flags2;
+	return error;
+}
+
 /* Swap the extents of two files by swapping data forks. */
 STATIC int
 xfs_swap_extent_forks(
@@ -1791,6 +1922,7 @@ xfs_swap_extents(
 	int			lock_flags;
 	struct xfs_ifork	*cowfp;
 	__uint64_t		f;
+	int			resblks;
 
 	/*
 	 * Lock the inodes against other IO, page faults and truncate to
@@ -1821,7 +1953,28 @@ xfs_swap_extents(
 	if (error)
 		goto out_unlock;
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	/*
+	 * Extent "swapping" with rmap requires a permanent reservation and
+	 * a block reservation because it's really just a remap operation
+	 * performed with log redo items!
+	 */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		/*
+		 * Conceptually this shouldn't affect the shape of either
+		 * bmbt, but since we atomically move extents one by one,
+		 * we reserve enough space to rebuild both trees.
+		 */
+		resblks = XFS_SWAP_RMAP_SPACE_RES(mp,
+				XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK),
+				XFS_DATA_FORK) +
+			  XFS_SWAP_RMAP_SPACE_RES(mp,
+				XFS_IFORK_NEXTENTS(tip, XFS_DATA_FORK),
+				XFS_DATA_FORK);
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks,
+				0, 0, &tp);
+	} else
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0,
+				0, 0, &tp);
 	if (error)
 		goto out_unlock;
 
@@ -1880,8 +2033,11 @@ xfs_swap_extents(
 	src_log_flags = XFS_ILOG_CORE;
 	target_log_flags = XFS_ILOG_CORE;
 
-	error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
-			&target_log_flags);
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		error = xfs_swap_extent_rmap(&tp, ip, tip);
+	else
+		error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
+				&target_log_flags);
 	if (error)
 		goto out_trans_cancel;
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 30778ad..f7f104c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3370,6 +3370,11 @@ DEFINE_INODE_EVENT(xfs_reflink_cancel_pending_cow);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_pending_cow_error);
 
+/* rmap swapext tracepoints */
+DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
+DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
+DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 187+ messages in thread

* Re: [PATCH 02/63] vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks
  2016-09-30  3:05 ` [PATCH 02/63] vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks Darrick J. Wong
@ 2016-09-30  7:08   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:05:48PM -0700, Darrick J. Wong wrote:
> Add a new fallocate mode flag that explicitly unshares blocks on
> filesystems that support such features.  The new flag can only
> be used with an allocate-mode fallocate call.

We will also need a manpage addition for this.  And while we're at
it an explanation of fallocate on COW files for btrfs, ocfs2 and
XFS as I don't think we are very coherent or obvious at the moment.

Well, maybe that's just future work.  For now:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/63] xfs: adjust refcount of an extent of blocks in refcount btree
  2016-09-30  3:06 ` [PATCH 13/63] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
@ 2016-09-30  7:11   ` Christoph Hellwig
  2016-09-30 17:53     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

> +#define RCNEXT(rc)	((rc).rc_startblock + (rc).rc_blockcount)

> +#define IS_VALID_RCEXT(ext)	((ext).rc_startblock != NULLAGBLOCK)

I would turn these into inline helpers with nice lower case names.
Not critical for the merge, though, so:

Signed-off-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/63] xfs: connect refcount adjust functions to upper layers
  2016-09-30  3:07 ` [PATCH 14/63] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
@ 2016-09-30  7:13   ` Christoph Hellwig
  2016-09-30 16:21   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

>  DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
>  DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
> +#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
> +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
> +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);

What is the value add of the #define above?

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 15/63] xfs: adjust refcount when unmapping file blocks
  2016-09-30  3:07 ` [PATCH 15/63] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
@ 2016-09-30  7:14   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:07:13PM -0700, Darrick J. Wong wrote:
> When we're unmapping blocks from a reflinked file, decrease the
> refcount of the affected blocks and free the extents that are no
> longer in use.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 16/63] xfs: add refcount btree block detection to log recovery
  2016-09-30  3:07 ` [PATCH 16/63] xfs: add refcount btree block detection to log recovery Darrick J. Wong
@ 2016-09-30  7:15   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 17/63] xfs: refcount btree requires more reserved space
  2016-09-30  3:07 ` [PATCH 17/63] xfs: refcount btree requires more reserved space Darrick J. Wong
@ 2016-09-30  7:15   ` Christoph Hellwig
  2016-09-30 16:46   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 18/63] xfs: introduce reflink utility functions
  2016-09-30  3:07   ` Darrick J. Wong
  (?)
@ 2016-09-30  7:16   ` Christoph Hellwig
  -1 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 22/63] xfs: pass bmapi flags through to bmap_del_extent
  2016-09-30  3:08 ` [PATCH 22/63] xfs: pass bmapi flags through to bmap_del_extent Darrick J. Wong
@ 2016-09-30  7:16   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation
  2016-09-30  3:08 ` [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
@ 2016-09-30  7:19   ` Christoph Hellwig
  2016-10-03 19:04   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:08:23PM -0700, Darrick J. Wong wrote:
> Return the range of file blocks that bunmapi didn't free.  This hint
> is used by CoW and reflink to figure out what part of an extent
> actually got freed so that it can set up the appropriate atomic
> remapping of just the freed range.

FYI, I'd much prefer having xfs_bunmapi use the new __xfs_bunmapi
calling convention and use that everywhere.

Not really urgent enough to block the merge for now, especially as I
have some major xfs_bunmapi surgery on my plate anyway and can take
care of that in the next merge window.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 26/63] xfs: define tracepoints for reflink activities
  2016-09-30  3:08 ` [PATCH 26/63] xfs: define tracepoints for reflink activities Darrick J. Wong
@ 2016-09-30  7:20   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:08:29PM -0700, Darrick J. Wong wrote:
> Define all the tracepoints we need to inspect the runtime operation
> of reflink/dedupe/copy-on-write.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 27/63] xfs: add reflink feature flag to geometry
  2016-09-30  3:08 ` [PATCH 27/63] xfs: add reflink feature flag to geometry Darrick J. Wong
@ 2016-09-30  7:20   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 28/63] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files
  2016-09-30  3:08 ` [PATCH 28/63] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
@ 2016-09-30  7:20   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:08:42PM -0700, Darrick J. Wong wrote:
> Only non-rt files can be reflinked, so check that when we load an
> inode.  Also, don't leak the attr fork if there's a failure.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 19/63] xfs: create bmbt update intent log items
  2016-09-30  3:07 ` [PATCH 19/63] xfs: create bmbt update intent log items Darrick J. Wong
@ 2016-09-30  7:24   ` Christoph Hellwig
  2016-09-30 17:24     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:07:39PM -0700, Darrick J. Wong wrote:
> Create bmbt update intent/done log items to record redo information in
> the log.  Because we roll transactions multiple times for reflink
> operations, we also have to track the status of the metadata updates
> that will be recorded in the post-roll transactions in case we crash
> before committing the final transaction.  This mechanism enables log
> recovery to finish what was already started.

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

but the amount of boilerplate code we add for each log item really
worries me.  We should think of a way to avoid all that code
duplication.


^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 20/63] xfs: log bmap intent items
  2016-09-30  3:07 ` [PATCH 20/63] xfs: log bmap intent items Darrick J. Wong
@ 2016-09-30  7:26   ` Christoph Hellwig
  2016-09-30 17:26     ` Darrick J. Wong
  2016-09-30 19:22   ` Brian Foster
  1 sibling, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

> +/* The XFS_BMAP_EXTENT_* in xfs_log_format.h must match these. */
> +enum xfs_bmap_intent_type {
> +	XFS_BMAP_MAP = 1,
> +	XFS_BMAP_UNMAP,
> +};

Meh.  Please just use it directly then.

Otherwise this looks ok, so a conditional:

Reviewed-by: Christoph Hellwig <hch@lst.de>

based on fixing the above.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/63] xfs: map an inode's offset to an exact physical block
  2016-09-30  3:07 ` [PATCH 21/63] xfs: map an inode's offset to an exact physical block Darrick J. Wong
@ 2016-09-30  7:31   ` Christoph Hellwig
  2016-09-30 17:30     ` Darrick J. Wong
  2016-10-03 19:03   ` Brian Foster
  1 sibling, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

> + * For a remap operation, just "allocate" an extent at the address that the
> + * caller passed in, and ensure that the AGFL is the right size.  The caller
> + * will then map the "allocated" extent into the file somewhere.
> + */
> +STATIC int
> +xfs_bmap_remap_alloc(
> +	struct xfs_bmalloca	*ap)
> +{
> +	struct xfs_trans	*tp = ap->tp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	xfs_agblock_t		bno;
> +	struct xfs_alloc_arg	args;
> +	int			error;
> +
> +	/*
> +	 * validate that the block number is legal - the enables us to detect
> +	 * and handle a silent filesystem corruption rather than crashing.
> +	 */
> +	memset(&args, 0, sizeof(struct xfs_alloc_arg));
> +	args.tp = ap->tp;
> +	args.mp = ap->tp->t_mountp;
> +	bno = *ap->firstblock;
> +	args.agno = XFS_FSB_TO_AGNO(mp, bno);
> +	ASSERT(args.agno < mp->m_sb.sb_agcount);
> +	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
> +	ASSERT(args.agbno < mp->m_sb.sb_agblocks);

Shouldn't this return -EFSCORRUPED instead?  Otherwise the comment
above isn't really true.

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations
  2016-09-30  3:08 ` [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
@ 2016-09-30  7:34   ` Christoph Hellwig
  2016-09-30 17:38     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

> +/* Deferred mapping is only for real extents in the data fork. */
> +static bool
> +xfs_bmap_is_update_needed(
> +	int			whichfork,
> +	struct xfs_bmbt_irec	*bmap)
> +{
> +	ASSERT(whichfork == XFS_DATA_FORK);
> +
> +	return  bmap->br_startblock != HOLESTARTBLOCK &&
> +		bmap->br_startblock != DELAYSTARTBLOCK;
> +}

Passing in an argument just to assert on it seems weird.
And except for that a better name might be xfs_bmbt_is_real or similar,
and I bet we'd have other users for it as well.

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped
  2016-09-30  3:08 ` [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped Darrick J. Wong
@ 2016-09-30  7:35   ` Christoph Hellwig
  2016-10-03 19:04   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:08:17PM -0700, Darrick J. Wong wrote:
> Log recovery will iget an inode to replay BUI items and iput the inode
> when it's done.  Unfortunately, the iput will see that i_nlink == 0
> and decide to truncate & free the inode, which prevents us from
> replaying subsequent BUIs.  We can't skip the BUIs because we have to
> replay all the redo items to ensure that atomic operations complete.
> 
> Since unlinked inode recovery will reap the inode anyway, we can
> safely introduce a new inode flag to indicate that an inode is in this
> 'unlinked recovery' state and should not be auto-reaped in the
> drop_inode path.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 29/63] xfs: introduce the CoW fork
  2016-09-30  3:08 ` [PATCH 29/63] xfs: introduce the CoW fork Darrick J. Wong
@ 2016-09-30  7:39   ` Christoph Hellwig
  2016-09-30 17:48     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

> +/* XFS_IEXT_STATE_TO_FORK() -- Convert BMAP state flags to an inode fork. */
> +xfs_ifork_t *
> +XFS_IEXT_STATE_TO_FORK(
> +	struct xfs_inode	*ip,
> +	int			state)
> +{
> +	if (state & BMAP_COWFORK)
> +		return ip->i_cowfp;
> +	else if (state & BMAP_ATTRFORK)
> +		return ip->i_afp;
> +	return &ip->i_df;
> +}

Would be nice to have a lower ase name for this.  And while we're at
it drop duplicating the function name in the top of the function
comment.

Othwerise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 30/63] xfs: support bmapping delalloc extents in the CoW fork
  2016-09-30  3:08 ` [PATCH 30/63] xfs: support bmapping delalloc extents in " Darrick J. Wong
@ 2016-09-30  7:42   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 32/63] xfs: support allocating delayed extents in CoW fork
  2016-09-30  3:09 ` [PATCH 32/63] xfs: support allocating delayed " Darrick J. Wong
@ 2016-09-30  7:42   ` Christoph Hellwig
  2016-10-04 16:38   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:09:08PM -0700, Darrick J. Wong wrote:
> Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
> allocation extents in the CoW fork to real allocations, and wire this
> up all the way back to xfs_iomap_write_allocate().  In a subsequent
> patch, we'll modify the writepage handler to call this.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 34/63] xfs: support removing extents from CoW fork
  2016-09-30  3:09 ` [PATCH 34/63] xfs: support removing extents from " Darrick J. Wong
@ 2016-09-30  7:46   ` Christoph Hellwig
  2016-09-30 18:00     ` Darrick J. Wong
  2016-10-05 18:26   ` Brian Foster
  1 sibling, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

>  /*
> + * xfs_bunmapi_cow() -- Remove the relevant parts of the CoW fork.
> + *			See xfs_bmap_del_extent.
> + * @ip: XFS inode.
> + * @idx: Extent number to delete.
> + * @del: Extent to remove.
> + */

Not quite a kerneldoc comment, not quite a normal one either..

That being sai the function seems mostly like a copy of the delay
part of xfs_bmap_del_extent and the duplication seems unfortunate.

As I plan to do a major rework of that area for the next merge
window I don't mind it for now, though.  

> +	xfs_bmbt_irec_t		*del)

Same for these uses of the old typedefs in new code.  I'd rather
avoid those, but instead of introducing churn now I'll sort this
out later.

So:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks
  2016-09-30  3:09 ` [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
@ 2016-09-30  7:47   ` Christoph Hellwig
  2016-10-06 16:44   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:09:46PM -0700, Darrick J. Wong wrote:
> When we're freeing blocks (truncate, punch, etc.), clear all CoW
> reservations in the range being freed.  If the file block count
> drops to zero, also clear the inode reflink flag.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes
  2016-09-30  3:09 ` [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
@ 2016-09-30  7:47   ` Christoph Hellwig
  2016-10-06 16:44   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree
  2016-09-30  3:09 ` [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
@ 2016-09-30  7:49   ` Christoph Hellwig
  2016-10-07 18:04   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 41/63] xfs: reflink extents from one file to another
  2016-09-30  3:10 ` [PATCH 41/63] xfs: reflink extents from one file to another Darrick J. Wong
@ 2016-09-30  7:50   ` Christoph Hellwig
  2016-10-07 18:04   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 42/63] xfs: add clone file and clone range vfs functions
  2016-09-30  3:10 ` [PATCH 42/63] xfs: add clone file and clone range vfs functions Darrick J. Wong
@ 2016-09-30  7:51   ` Christoph Hellwig
  2016-09-30 18:04     ` Darrick J. Wong
  2016-10-07 18:04   ` Brian Foster
  1 sibling, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:10:14PM -0700, Darrick J. Wong wrote:
> Define two VFS functions which allow userspace to reflink a range of
> blocks between two files or to reflink one file's contents to another.
> These functions fit the new VFS ioctls that standardize the checking
> for the btrfs CLONE and CLONE RANGE ioctls.

FYI, I really believe the way forward is to make sure vfs_copy_range
calls the clone handler first if present and handles all the differences
between the two.  The only thing that held me back from doing that
is the complete lack of test coverage for the copy functionality.

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 43/63] xfs: add dedupe range vfs function
  2016-09-30  3:10 ` [PATCH 43/63] xfs: add dedupe range vfs function Darrick J. Wong
@ 2016-09-30  7:53   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:10:20PM -0700, Darrick J. Wong wrote:
> Define a VFS function which allows userspace to request that the
> kernel reflink a range of blocks between two files if the ranges'
> contents match.  The function fits the new VFS ioctl that standardizes
> the checking for the btrfs EXTENT SAME ioctl.

Nothing in the compare functionality is really XFS specific, so it
might be a nice idea to share this between btrfs and XFS in the future
and move it common code.  Except for that this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 44/63] xfs: teach get_bmapx about shared extents and the CoW fork
  2016-09-30  3:10 ` [PATCH 44/63] xfs: teach get_bmapx about shared extents and the CoW fork Darrick J. Wong
@ 2016-09-30  7:53   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:10:27PM -0700, Darrick J. Wong wrote:
> Teach xfs_getbmapx how to report shared extents and CoW fork contents
> accurately in the bmap output by querying the refcount btree
> appropriately.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 45/63] xfs: swap inode reflink flags when swapping inode extents
  2016-09-30  3:10 ` [PATCH 45/63] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
@ 2016-09-30  7:54   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:10:33PM -0700, Darrick J. Wong wrote:
> When we're swapping the extents of two inodes, be sure to swap the
> reflink inode flag too.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-09-30  3:10 ` [PATCH 46/63] xfs: unshare a range of blocks via fallocate Darrick J. Wong
@ 2016-09-30  7:54   ` Christoph Hellwig
  2016-10-07 18:05   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel, Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 47/63] xfs: create a separate cow extent size hint for the allocator
  2016-09-30  3:10 ` [PATCH 47/63] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
@ 2016-09-30  7:55   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  7:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-09-30  3:10 ` [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
@ 2016-09-30  8:19   ` Christoph Hellwig
  2016-10-12 18:44   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 49/63] xfs: don't allow reflink when the AG is low on space
  2016-09-30  3:10 ` [PATCH 49/63] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
@ 2016-09-30  8:19   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 50/63] xfs: try other AGs to allocate a BMBT block
  2016-09-30  3:11 ` [PATCH 50/63] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
@ 2016-09-30  8:20   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 51/63] xfs: garbage collect old cowextsz reservations
  2016-09-30  3:11 ` [PATCH 51/63] xfs: garbage collect old cowextsz reservations Darrick J. Wong
@ 2016-09-30  8:23   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

I don't really like the eof_scan_owner magic, but that's just copying
the existing eofblock scan code, so I guess I shouldn't complain here.

Besides that and what appears to generally be a lot of duplication of
the eofblock scanning code this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 52/63] xfs: increase log reservations for reflink
  2016-09-30  3:11 ` [PATCH 52/63] xfs: increase log reservations for reflink Darrick J. Wong
@ 2016-09-30  8:23   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:11:18PM -0700, Darrick J. Wong wrote:
> Increase the log reservations to handle the increased rolling that
> happens at the end of copy-on-write operations.

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 53/63] xfs: add shared rmap map/unmap/convert log item types
  2016-09-30  3:11 ` [PATCH 53/63] xfs: add shared rmap map/unmap/convert log item types Darrick J. Wong
@ 2016-09-30  8:24   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:11:24PM -0700, Darrick J. Wong wrote:
> Wire up some rmap log redo item type codes to map, unmap, or convert
> shared data block extents.  The actual log item recovery comes in a
> later patch.

Looks fine, although I don't really understand what the point of this
fine grained patch split is..

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 54/63] xfs: use interval query for rmap alloc operations on shared files
  2016-09-30  3:11 ` [PATCH 54/63] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
@ 2016-09-30  8:24   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 55/63] xfs: convert unwritten status of reverse mappings for shared files
  2016-09-30  3:11 ` [PATCH 55/63] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
@ 2016-09-30  8:25   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 56/63] xfs: set a default CoW extent size of 32 blocks
  2016-09-30  3:11 ` [PATCH 56/63] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
@ 2016-09-30  8:25   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:11:44PM -0700, Darrick J. Wong wrote:
> If the admin doesn't set a CoW extent size or a regular extent size
> hint, default to creating CoW reservations 32 blocks long to reduce
> fragmentation.

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Although a constant for the magic number of 32 would have been nice.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 57/63] xfs: check for invalid inode reflink flags
  2016-09-30  3:11 ` [PATCH 57/63] xfs: check for invalid inode reflink flags Darrick J. Wong
@ 2016-09-30  8:26   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:11:50PM -0700, Darrick J. Wong wrote:
> We don't support sharing blocks on the realtime device.  Flag inodes
> with the reflink or cowextsize flags set when the reflink feature is
> disabled.

Looks fine, but why didn't this go into the patches adding reflink
support earlier?

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 58/63] xfs: don't mix reflink and DAX mode for now
  2016-09-30  3:11 ` [PATCH 58/63] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
@ 2016-09-30  8:26   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:11:57PM -0700, Darrick J. Wong wrote:
> Since we don't have a strategy for handling both DAX and reflink,
> for now we'll just prohibit both being set at the same time.

I think we're about ready to lift this limitation, but let's not delay
the series for that:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 59/63] xfs: simulate per-AG reservations being critically low
  2016-09-30  3:12 ` [PATCH 59/63] xfs: simulate per-AG reservations being critically low Darrick J. Wong
@ 2016-09-30  8:27   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:12:03PM -0700, Darrick J. Wong wrote:
> Create an error injection point that enables us to simulate being
> critically low on per-AG block reservations.  This should enable us to
> simulate this specific ENOSPC condition so that we can test falling back
> to a regular file copy.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 60/63] xfs: recognize the reflink feature bit
  2016-09-30  3:12 ` [PATCH 60/63] xfs: recognize the reflink feature bit Darrick J. Wong
@ 2016-09-30  8:27   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:12:10PM -0700, Darrick J. Wong wrote:
> Add the reflink feature flag to the set of recognized feature flags.
> This enables users to write to reflink filesystems.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 61/63] xfs: various swapext cleanups
  2016-09-30  3:12 ` [PATCH 61/63] xfs: various swapext cleanups Darrick J. Wong
@ 2016-09-30  8:28   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:12:19PM -0700, Darrick J. Wong wrote:
> Replace structure typedefs with struct expressions and fix some
> whitespace issues that result.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 62/63] xfs: refactor swapext code
  2016-09-30  3:12 ` [PATCH 62/63] xfs: refactor swapext code Darrick J. Wong
@ 2016-09-30  8:28   ` Christoph Hellwig
  0 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  8:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:12:26PM -0700, Darrick J. Wong wrote:
> Refactor the swapext function to pull out the fork swapping piece
> into a separate function.  In the next patch we'll add in the bit
> we need to make it work with rmap filesystems.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH v10 00/63] xfs: add reflink and dedupe support
  2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
                   ` (62 preceding siblings ...)
  2016-09-30  3:12 ` [PATCH 63/63] xfs: implement swapext for rmap filesystems Darrick J. Wong
@ 2016-09-30  9:00 ` Christoph Hellwig
  63 siblings, 0 replies; 187+ messages in thread
From: Christoph Hellwig @ 2016-09-30  9:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

This now compiles without warnings and still passes local and NFS
testing.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/63] xfs: connect refcount adjust functions to upper layers
  2016-09-30  3:07 ` [PATCH 14/63] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
  2016-09-30  7:13   ` Christoph Hellwig
@ 2016-09-30 16:21   ` Brian Foster
  2016-09-30 19:40     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-09-30 16:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, linux-fsdevel

On Thu, Sep 29, 2016 at 08:07:05PM -0700, Darrick J. Wong wrote:
> Plumb in the upper level interface to schedule and finish deferred
> refcount operations via the deferred ops mechanism.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_defer.h    |    1 
>  fs/xfs/libxfs/xfs_refcount.c |  170 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_refcount.h |   12 +++
>  fs/xfs/xfs_error.h           |    4 +
>  fs/xfs/xfs_refcount_item.c   |   73 ++++++++++++++++
>  fs/xfs/xfs_super.c           |    1 
>  fs/xfs/xfs_trace.h           |    3 +
>  fs/xfs/xfs_trans.h           |    8 +-
>  fs/xfs/xfs_trans_refcount.c  |  186 ++++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 452 insertions(+), 6 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 599a8d2..e44007a 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -396,9 +396,19 @@ xfs_cui_recover(
>  {
>  	int				i;
>  	int				error = 0;
> +	unsigned int			refc_type;
>  	struct xfs_phys_extent		*refc;
>  	xfs_fsblock_t			startblock_fsb;
>  	bool				op_ok;
> +	struct xfs_cud_log_item		*cudp;
> +	struct xfs_trans		*tp;
> +	struct xfs_btree_cur		*rcur = NULL;
> +	enum xfs_refcount_intent_type	type;
> +	xfs_fsblock_t			firstfsb;
> +	xfs_extlen_t			adjusted;
> +	struct xfs_bmbt_irec		irec;
> +	struct xfs_defer_ops		dfops;
> +	bool				requeue_only = false;
>  
>  	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
>  
> @@ -437,7 +447,68 @@ xfs_cui_recover(
>  		}
>  	}
>  
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
> +	cudp = xfs_trans_get_cud(tp, cuip);
> +
> +	xfs_defer_init(&dfops, &firstfsb);

A comment would be nice here to point out the approach. E.g., that
refcount updates are initially deferred under normal runtime
circumstances, they handle reservation usage internally/dynamically, and
that since we're in recovery, we start the initial update directly and
defer the rest that won't fit in the transaction (worded better and
assuming I understand all that correctly ;P).

(Sorry for the comment requests and whatnot, BTW. I'm catching up from a
couple weeks of PTO, probably late to the game and not up to speed on
the latest status of the patchset. Feel free to defer, drop, or
conditionalize any of the aesthetic stuff to whenever is opportune if
this stuff is otherwise close to merge).

> +	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> +		refc = &cuip->cui_format.cui_extents[i];
> +		refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
> +		switch (refc_type) {
> +		case XFS_REFCOUNT_INCREASE:
> +		case XFS_REFCOUNT_DECREASE:
> +		case XFS_REFCOUNT_ALLOC_COW:
> +		case XFS_REFCOUNT_FREE_COW:
> +			type = refc_type;
> +			break;
> +		default:
> +			error = -EFSCORRUPTED;
> +			goto abort_error;
> +		}
> +		if (requeue_only)
> +			adjusted = 0;
> +		else
> +			error = xfs_trans_log_finish_refcount_update(tp, cudp,
> +				&dfops, type, refc->pe_startblock, refc->pe_len,
> +				&adjusted, &rcur);
> +		if (error)
> +			goto abort_error;
> +
> +		/* Requeue what we didn't finish. */
> +		if (adjusted < refc->pe_len) {
> +			irec.br_startblock = refc->pe_startblock + adjusted;
> +			irec.br_blockcount = refc->pe_len - adjusted;

Hmm, so it appears we walk the range of blocks from beginning to end,
but the refcount update code doesn't necessarily always work that way.
It merges the boundaries and walks the middle range from start to end.
So what happens if the call above ends up doing a right merge and then
skips out on any other changes due to the transaction reservation?

Brian

P.S., Even if I'm missing something and this is not an issue, do we have
any log recovery oriented reflink xfstests in the current test pile? If
not, I'd suggest that something as simple as a "do a bunch of reflinks +
xfs_io -c 'shutdown -f' + umount/mount" loop could go a long way towards
shaking out any issues. Log recovery can be a pita and otherwise
problems therein can go undetected for a surprising amount of time.

> +			switch (type) {
> +			case XFS_REFCOUNT_INCREASE:
> +				error = xfs_refcount_increase_extent(
> +						tp->t_mountp, &dfops, &irec);
> +				break;
> +			case XFS_REFCOUNT_DECREASE:
> +				error = xfs_refcount_decrease_extent(
> +						tp->t_mountp, &dfops, &irec);
> +				break;
> +			default:
> +				ASSERT(0);
> +			}
> +			if (error)
> +				goto abort_error;
> +			requeue_only = true;
> +		}
> +	}
> +
> +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> +	error = xfs_defer_finish(&tp, &dfops, NULL);
> +	if (error)
> +		goto abort_error;
>  	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
> -	xfs_cui_release(cuip);
> +	error = xfs_trans_commit(tp);
> +	return error;
> +
> +abort_error:
> +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> +	xfs_defer_cancel(&dfops);
> +	xfs_trans_cancel(tp);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index abe69c6..6234622 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1903,6 +1903,7 @@ init_xfs_fs(void)
>  
>  	xfs_extent_free_init_defer_op();
>  	xfs_rmap_update_init_defer_op();
> +	xfs_refcount_update_init_defer_op();
>  
>  	xfs_dir_startup();
>  
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index fed1906..195a168 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2931,6 +2931,9 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
>  DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
>  DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
>  DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
> +#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
> +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
> +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
>  
>  TRACE_EVENT(xfs_refcount_finish_one_leftover,
>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index fe69e20..a7a87d2 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -37,6 +37,8 @@ struct xfs_rud_log_item;
>  struct xfs_rui_log_item;
>  struct xfs_btree_cur;
>  struct xfs_cui_log_item;
> +struct xfs_cud_log_item;
> +struct xfs_defer_ops;
>  
>  typedef struct xfs_log_item {
>  	struct list_head		li_ail;		/* AIL pointers */
> @@ -252,11 +254,13 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
>  /* refcount updates */
>  enum xfs_refcount_intent_type;
>  
> +void xfs_refcount_update_init_defer_op(void);
>  struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
>  		struct xfs_cui_log_item *cuip);
>  int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
> -		struct xfs_cud_log_item *cudp,
> +		struct xfs_cud_log_item *cudp, struct xfs_defer_ops *dfops,
>  		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
> -		xfs_extlen_t blockcount, struct xfs_btree_cur **pcur);
> +		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
> +		struct xfs_btree_cur **pcur);
>  
>  #endif	/* __XFS_TRANS_H__ */
> diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
> index b18d548..e3ac994 100644
> --- a/fs/xfs/xfs_trans_refcount.c
> +++ b/fs/xfs/xfs_trans_refcount.c
> @@ -56,15 +56,17 @@ int
>  xfs_trans_log_finish_refcount_update(
>  	struct xfs_trans		*tp,
>  	struct xfs_cud_log_item		*cudp,
> +	struct xfs_defer_ops		*dop,
>  	enum xfs_refcount_intent_type	type,
>  	xfs_fsblock_t			startblock,
>  	xfs_extlen_t			blockcount,
> +	xfs_extlen_t			*adjusted,
>  	struct xfs_btree_cur		**pcur)
>  {
>  	int				error;
>  
> -	/* XXX: leave this empty for now */
> -	error = -EFSCORRUPTED;
> +	error = xfs_refcount_finish_one(tp, dop, type, startblock,
> +			blockcount, adjusted, pcur);
>  
>  	/*
>  	 * Mark the transaction dirty, even on error. This ensures the
> @@ -78,3 +80,183 @@ xfs_trans_log_finish_refcount_update(
>  
>  	return error;
>  }
> +
> +/* Sort refcount intents by AG. */
> +static int
> +xfs_refcount_update_diff_items(
> +	void				*priv,
> +	struct list_head		*a,
> +	struct list_head		*b)
> +{
> +	struct xfs_mount		*mp = priv;
> +	struct xfs_refcount_intent	*ra;
> +	struct xfs_refcount_intent	*rb;
> +
> +	ra = container_of(a, struct xfs_refcount_intent, ri_list);
> +	rb = container_of(b, struct xfs_refcount_intent, ri_list);
> +	return  XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
> +		XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
> +}
> +
> +/* Get an CUI. */
> +STATIC void *
> +xfs_refcount_update_create_intent(
> +	struct xfs_trans		*tp,
> +	unsigned int			count)
> +{
> +	struct xfs_cui_log_item		*cuip;
> +
> +	ASSERT(tp != NULL);
> +	ASSERT(count > 0);
> +
> +	cuip = xfs_cui_init(tp->t_mountp, count);
> +	ASSERT(cuip != NULL);
> +
> +	/*
> +	 * Get a log_item_desc to point at the new item.
> +	 */
> +	xfs_trans_add_item(tp, &cuip->cui_item);
> +	return cuip;
> +}
> +
> +/* Set the phys extent flags for this reverse mapping. */
> +static void
> +xfs_trans_set_refcount_flags(
> +	struct xfs_phys_extent		*refc,
> +	enum xfs_refcount_intent_type	type)
> +{
> +	refc->pe_flags = 0;
> +	switch (type) {
> +	case XFS_REFCOUNT_INCREASE:
> +	case XFS_REFCOUNT_DECREASE:
> +	case XFS_REFCOUNT_ALLOC_COW:
> +	case XFS_REFCOUNT_FREE_COW:
> +		refc->pe_flags |= type;
> +		break;
> +	default:
> +		ASSERT(0);
> +	}
> +}
> +
> +/* Log refcount updates in the intent item. */
> +STATIC void
> +xfs_refcount_update_log_item(
> +	struct xfs_trans		*tp,
> +	void				*intent,
> +	struct list_head		*item)
> +{
> +	struct xfs_cui_log_item		*cuip = intent;
> +	struct xfs_refcount_intent	*refc;
> +	uint				next_extent;
> +	struct xfs_phys_extent		*ext;
> +
> +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> +
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	cuip->cui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> +
> +	/*
> +	 * atomic_inc_return gives us the value after the increment;
> +	 * we want to use it as an array index so we need to subtract 1 from
> +	 * it.
> +	 */
> +	next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1;
> +	ASSERT(next_extent < cuip->cui_format.cui_nextents);
> +	ext = &cuip->cui_format.cui_extents[next_extent];
> +	ext->pe_startblock = refc->ri_startblock;
> +	ext->pe_len = refc->ri_blockcount;
> +	xfs_trans_set_refcount_flags(ext, refc->ri_type);
> +}
> +
> +/* Get an CUD so we can process all the deferred refcount updates. */
> +STATIC void *
> +xfs_refcount_update_create_done(
> +	struct xfs_trans		*tp,
> +	void				*intent,
> +	unsigned int			count)
> +{
> +	return xfs_trans_get_cud(tp, intent);
> +}
> +
> +/* Process a deferred refcount update. */
> +STATIC int
> +xfs_refcount_update_finish_item(
> +	struct xfs_trans		*tp,
> +	struct xfs_defer_ops		*dop,
> +	struct list_head		*item,
> +	void				*done_item,
> +	void				**state)
> +{
> +	struct xfs_refcount_intent	*refc;
> +	xfs_extlen_t			adjusted;
> +	int				error;
> +
> +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> +	error = xfs_trans_log_finish_refcount_update(tp, done_item, dop,
> +			refc->ri_type,
> +			refc->ri_startblock,
> +			refc->ri_blockcount,
> +			&adjusted,
> +			(struct xfs_btree_cur **)state);
> +	/* Did we run out of reservation?  Requeue what we didn't finish. */
> +	if (!error && adjusted < refc->ri_blockcount) {
> +		ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE ||
> +		       refc->ri_type == XFS_REFCOUNT_DECREASE);
> +		refc->ri_startblock += adjusted;
> +		refc->ri_blockcount -= adjusted;
> +		return -EAGAIN;
> +	}
> +	kmem_free(refc);
> +	return error;
> +}
> +
> +/* Clean up after processing deferred refcounts. */
> +STATIC void
> +xfs_refcount_update_finish_cleanup(
> +	struct xfs_trans	*tp,
> +	void			*state,
> +	int			error)
> +{
> +	struct xfs_btree_cur	*rcur = state;
> +
> +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> +}
> +
> +/* Abort all pending CUIs. */
> +STATIC void
> +xfs_refcount_update_abort_intent(
> +	void				*intent)
> +{
> +	xfs_cui_release(intent);
> +}
> +
> +/* Cancel a deferred refcount update. */
> +STATIC void
> +xfs_refcount_update_cancel_item(
> +	struct list_head		*item)
> +{
> +	struct xfs_refcount_intent	*refc;
> +
> +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> +	kmem_free(refc);
> +}
> +
> +static const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
> +	.type		= XFS_DEFER_OPS_TYPE_REFCOUNT,
> +	.max_items	= XFS_CUI_MAX_FAST_EXTENTS,
> +	.diff_items	= xfs_refcount_update_diff_items,
> +	.create_intent	= xfs_refcount_update_create_intent,
> +	.abort_intent	= xfs_refcount_update_abort_intent,
> +	.log_item	= xfs_refcount_update_log_item,
> +	.create_done	= xfs_refcount_update_create_done,
> +	.finish_item	= xfs_refcount_update_finish_item,
> +	.finish_cleanup = xfs_refcount_update_finish_cleanup,
> +	.cancel_item	= xfs_refcount_update_cancel_item,
> +};
> +
> +/* Register the deferred op type. */
> +void
> +xfs_refcount_update_init_defer_op(void)
> +{
> +	xfs_defer_init_op_type(&xfs_refcount_update_defer_type);
> +}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 17/63] xfs: refcount btree requires more reserved space
  2016-09-30  3:07 ` [PATCH 17/63] xfs: refcount btree requires more reserved space Darrick J. Wong
  2016-09-30  7:15   ` Christoph Hellwig
@ 2016-09-30 16:46   ` Brian Foster
  2016-09-30 18:41     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-09-30 16:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

(Dropping fsdevel CC from future replies to reduce review spam.)

On Thu, Sep 29, 2016 at 08:07:26PM -0700, Darrick J. Wong wrote:
> The reference count btree is allocated from the free space, which
> means that we have to ensure that an AG can't run out of free space
> while performing a refcount operation.  In the pathological case each
> AG block has its own refcntbt record, so we have to keep that much
> space available.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: Calculate the maximum possible size of the rmap and refcount
> btrees based on minimally-full btree blocks.  This increases the
> per-AG block reservations to handle the worst case btree size.
> ---

The code seems fine but I'm thrown off a bit by the commit log
description because the patch doesn't actually change behavior (wrt to
keeping space available for the refcountbt), but rather adds support
functions for such a mechanism to come (I presume).

Brian

>  fs/xfs/libxfs/xfs_alloc.c          |    3 +++
>  fs/xfs/libxfs/xfs_refcount_btree.c |   23 +++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_refcount_btree.h |    4 ++++
>  3 files changed, 30 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index be7e3fc..9d9a46e 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -38,6 +38,7 @@
>  #include "xfs_buf_item.h"
>  #include "xfs_log.h"
>  #include "xfs_ag_resv.h"
> +#include "xfs_refcount_btree.h"
>  
>  struct workqueue_struct *xfs_alloc_wq;
>  
> @@ -128,6 +129,8 @@ xfs_alloc_ag_max_usable(
>  		blocks++;		/* finobt root block */
>  	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
>  		blocks++; 		/* rmap root block */
> +	if (xfs_sb_version_hasreflink(&mp->m_sb))
> +		blocks++;		/* refcount root block */
>  
>  	return mp->m_sb.sb_agblocks - blocks;
>  }
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 81d58b0..6b5e82b9 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -387,3 +387,26 @@ xfs_refcountbt_compute_maxlevels(
>  	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(mp,
>  			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
>  }
> +
> +/* Calculate the refcount btree size for some records. */
> +xfs_extlen_t
> +xfs_refcountbt_calc_size(
> +	struct xfs_mount	*mp,
> +	unsigned long long	len)
> +{
> +	return xfs_btree_calc_size(mp, mp->m_refc_mnr, len);
> +}
> +
> +/*
> + * Calculate the maximum refcount btree size.
> + */
> +xfs_extlen_t
> +xfs_refcountbt_max_size(
> +	struct xfs_mount	*mp)
> +{
> +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> +	if (mp->m_refc_mxr[0] == 0)
> +		return 0;
> +
> +	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> +}
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> index 9e9ad7c..780b02f 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> @@ -64,4 +64,8 @@ extern int xfs_refcountbt_maxrecs(struct xfs_mount *mp, int blocklen,
>  		bool leaf);
>  extern void xfs_refcountbt_compute_maxlevels(struct xfs_mount *mp);
>  
> +extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> +		unsigned long long len);
> +extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> +
>  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 19/63] xfs: create bmbt update intent log items
  2016-09-30  7:24   ` Christoph Hellwig
@ 2016-09-30 17:24     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 17:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:24:40AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 29, 2016 at 08:07:39PM -0700, Darrick J. Wong wrote:
> > Create bmbt update intent/done log items to record redo information in
> > the log.  Because we roll transactions multiple times for reflink
> > operations, we also have to track the status of the metadata updates
> > that will be recorded in the post-roll transactions in case we crash
> > before committing the final transaction.  This mechanism enables log
> > recovery to finish what was already started.
> 
> Looks fine:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> but the amount of boilerplate code we add for each log item really
> worries me.  We should think of a way to avoid all that code
> duplication.

Yeah, the boilerplate both here and especially for btree rebuilding in
xfs_repair phase 5 worry me.  I've heard chatter about creating another
redo item type to handle directory parent pointers; that's probably a
good time to refactor the log item things.

Not sure about repair, maybe that's tackleable as part of xfsprogs 4.9.

--D

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 20/63] xfs: log bmap intent items
  2016-09-30  7:26   ` Christoph Hellwig
@ 2016-09-30 17:26     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 17:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:26:50AM -0700, Christoph Hellwig wrote:
> > +/* The XFS_BMAP_EXTENT_* in xfs_log_format.h must match these. */
> > +enum xfs_bmap_intent_type {
> > +	XFS_BMAP_MAP = 1,
> > +	XFS_BMAP_UNMAP,
> > +};
> 
> Meh.  Please just use it directly then.
> 
> Otherwise this looks ok, so a conditional:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> based on fixing the above.

The log format items actually /do/ store the enum values directly into a
u32 now.  Evidently I forgot to update the comment.

--D

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/63] xfs: map an inode's offset to an exact physical block
  2016-09-30  7:31   ` Christoph Hellwig
@ 2016-09-30 17:30     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 17:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:31:07AM -0700, Christoph Hellwig wrote:
> > + * For a remap operation, just "allocate" an extent at the address that the
> > + * caller passed in, and ensure that the AGFL is the right size.  The caller
> > + * will then map the "allocated" extent into the file somewhere.
> > + */
> > +STATIC int
> > +xfs_bmap_remap_alloc(
> > +	struct xfs_bmalloca	*ap)
> > +{
> > +	struct xfs_trans	*tp = ap->tp;
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	xfs_agblock_t		bno;
> > +	struct xfs_alloc_arg	args;
> > +	int			error;
> > +
> > +	/*
> > +	 * validate that the block number is legal - the enables us to detect
> > +	 * and handle a silent filesystem corruption rather than crashing.
> > +	 */
> > +	memset(&args, 0, sizeof(struct xfs_alloc_arg));
> > +	args.tp = ap->tp;
> > +	args.mp = ap->tp->t_mountp;
> > +	bno = *ap->firstblock;
> > +	args.agno = XFS_FSB_TO_AGNO(mp, bno);
> > +	ASSERT(args.agno < mp->m_sb.sb_agcount);
> > +	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
> > +	ASSERT(args.agbno < mp->m_sb.sb_agblocks);
> 
> Shouldn't this return -EFSCORRUPED instead?  Otherwise the comment
> above isn't really true.

Hmm, yes.  I'll fix that up.

--D

> 
> Otherwise this looks fine to me:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations
  2016-09-30  7:34   ` Christoph Hellwig
@ 2016-09-30 17:38     ` Darrick J. Wong
  2016-09-30 20:34       ` Roger Willcocks
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 17:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:34:04AM -0700, Christoph Hellwig wrote:
> > +/* Deferred mapping is only for real extents in the data fork. */
> > +static bool
> > +xfs_bmap_is_update_needed(
> > +	int			whichfork,
> > +	struct xfs_bmbt_irec	*bmap)
> > +{
> > +	ASSERT(whichfork == XFS_DATA_FORK);
> > +
> > +	return  bmap->br_startblock != HOLESTARTBLOCK &&
> > +		bmap->br_startblock != DELAYSTARTBLOCK;
> > +}
> 
> Passing in an argument just to assert on it seems weird.
> And except for that a better name might be xfs_bmbt_is_real or similar,
> and I bet we'd have other users for it as well.

xfs_bmap_*map_extent are the only callers, and the only whichfork
values are XFS_DATA_FORK.  I might as well just tear out all those
asserts since they're never going to trigger anyway.

--D

> 
> Otherwise this looks fine to me:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 29/63] xfs: introduce the CoW fork
  2016-09-30  7:39   ` Christoph Hellwig
@ 2016-09-30 17:48     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 17:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:39:34AM -0700, Christoph Hellwig wrote:
> > +/* XFS_IEXT_STATE_TO_FORK() -- Convert BMAP state flags to an inode fork. */
> > +xfs_ifork_t *
> > +XFS_IEXT_STATE_TO_FORK(
> > +	struct xfs_inode	*ip,
> > +	int			state)
> > +{
> > +	if (state & BMAP_COWFORK)
> > +		return ip->i_cowfp;
> > +	else if (state & BMAP_ATTRFORK)
> > +		return ip->i_afp;
> > +	return &ip->i_df;
> > +}
> 
> Would be nice to have a lower ase name for this.  And while we're at
> it drop duplicating the function name in the top of the function
> comment.

Ok, done.

--D

> 
> Othwerise looks fine:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 13/63] xfs: adjust refcount of an extent of blocks in refcount btree
  2016-09-30  7:11   ` Christoph Hellwig
@ 2016-09-30 17:53     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 17:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:11:15AM -0700, Christoph Hellwig wrote:
> > +#define RCNEXT(rc)	((rc).rc_startblock + (rc).rc_blockcount)
> 
> > +#define IS_VALID_RCEXT(ext)	((ext).rc_startblock != NULLAGBLOCK)
> 
> I would turn these into inline helpers with nice lower case names.
> Not critical for the merge, though, so:

It's a fairly easy change, so I'll make it anyway.

--D

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 34/63] xfs: support removing extents from CoW fork
  2016-09-30  7:46   ` Christoph Hellwig
@ 2016-09-30 18:00     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 18:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:46:42AM -0700, Christoph Hellwig wrote:
> >  /*
> > + * xfs_bunmapi_cow() -- Remove the relevant parts of the CoW fork.
> > + *			See xfs_bmap_del_extent.
> > + * @ip: XFS inode.
> > + * @idx: Extent number to delete.
> > + * @del: Extent to remove.
> > + */
> 
> Not quite a kerneldoc comment, not quite a normal one either..
> 
> That being sai the function seems mostly like a copy of the delay
> part of xfs_bmap_del_extent and the duplication seems unfortunate.
> 
> As I plan to do a major rework of that area for the next merge
> window I don't mind it for now, though.  

Yes, it is cobbled together from bmap_del_extent.  I wasn't sure at the
time whether it was worse to have two annoyingly similar functions, or
to sprinkle if statements all throughout the data fork one to special
case the COW fork.  Anyhow, when you rework the whole mess, cc: me and
I'll review it.

> > +	xfs_bmbt_irec_t		*del)
> 
> Same for these uses of the old typedefs in new code.  I'd rather
> avoid those, but instead of introducing churn now I'll sort this
> out later.

I'll at least fix those while I'm running through the entire tree.

--D

> 
> So:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 42/63] xfs: add clone file and clone range vfs functions
  2016-09-30  7:51   ` Christoph Hellwig
@ 2016-09-30 18:04     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 18:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:51:41AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 29, 2016 at 08:10:14PM -0700, Darrick J. Wong wrote:
> > Define two VFS functions which allow userspace to reflink a range of
> > blocks between two files or to reflink one file's contents to another.
> > These functions fit the new VFS ioctls that standardize the checking
> > for the btrfs CLONE and CLONE RANGE ioctls.
> 
> FYI, I really believe the way forward is to make sure vfs_copy_range
> calls the clone handler first if present and handles all the differences
> between the two.  The only thing that held me back from doing that
> is the complete lack of test coverage for the copy functionality.

I've been pondering if there's some way to leverage the existing pile of
clone/clonerange interface tests for copy_file_range without essentially
duplicating all the existing generic/* clone tests, but this time calling
c_f_r instead of _cp_reflink or _reflink_range.  OTOH maybe it's good to
keep them separate since they are separate interfaces.

--D

> 
> Otherwise this looks fine to me:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 17/63] xfs: refcount btree requires more reserved space
  2016-09-30 16:46   ` Brian Foster
@ 2016-09-30 18:41     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 18:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Fri, Sep 30, 2016 at 12:46:35PM -0400, Brian Foster wrote:
> (Dropping fsdevel CC from future replies to reduce review spam.)
> 
> On Thu, Sep 29, 2016 at 08:07:26PM -0700, Darrick J. Wong wrote:
> > The reference count btree is allocated from the free space, which
> > means that we have to ensure that an AG can't run out of free space
> > while performing a refcount operation.  In the pathological case each
> > AG block has its own refcntbt record, so we have to keep that much
> > space available.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: Calculate the maximum possible size of the rmap and refcount
> > btrees based on minimally-full btree blocks.  This increases the
> > per-AG block reservations to handle the worst case btree size.
> > ---
> 
> The code seems fine but I'm thrown off a bit by the commit log
> description because the patch doesn't actually change behavior (wrt to
> keeping space available for the refcountbt), but rather adds support
> functions for such a mechanism to come (I presume).

Ah, right, this patch is a leftover from before the per-AG reservation
thing.  I think the xfs_refcount_btree.c hunks can go into the patch
that sets up the reservations for the refcountbt, and I'll change the
description to match the remaining piece.

--D

> 
> Brian
> 
> >  fs/xfs/libxfs/xfs_alloc.c          |    3 +++
> >  fs/xfs/libxfs/xfs_refcount_btree.c |   23 +++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_refcount_btree.h |    4 ++++
> >  3 files changed, 30 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index be7e3fc..9d9a46e 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -38,6 +38,7 @@
> >  #include "xfs_buf_item.h"
> >  #include "xfs_log.h"
> >  #include "xfs_ag_resv.h"
> > +#include "xfs_refcount_btree.h"
> >  
> >  struct workqueue_struct *xfs_alloc_wq;
> >  
> > @@ -128,6 +129,8 @@ xfs_alloc_ag_max_usable(
> >  		blocks++;		/* finobt root block */
> >  	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> >  		blocks++; 		/* rmap root block */
> > +	if (xfs_sb_version_hasreflink(&mp->m_sb))
> > +		blocks++;		/* refcount root block */
> >  
> >  	return mp->m_sb.sb_agblocks - blocks;
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > index 81d58b0..6b5e82b9 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > @@ -387,3 +387,26 @@ xfs_refcountbt_compute_maxlevels(
> >  	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(mp,
> >  			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
> >  }
> > +
> > +/* Calculate the refcount btree size for some records. */
> > +xfs_extlen_t
> > +xfs_refcountbt_calc_size(
> > +	struct xfs_mount	*mp,
> > +	unsigned long long	len)
> > +{
> > +	return xfs_btree_calc_size(mp, mp->m_refc_mnr, len);
> > +}
> > +
> > +/*
> > + * Calculate the maximum refcount btree size.
> > + */
> > +xfs_extlen_t
> > +xfs_refcountbt_max_size(
> > +	struct xfs_mount	*mp)
> > +{
> > +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > +	if (mp->m_refc_mxr[0] == 0)
> > +		return 0;
> > +
> > +	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > index 9e9ad7c..780b02f 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > @@ -64,4 +64,8 @@ extern int xfs_refcountbt_maxrecs(struct xfs_mount *mp, int blocklen,
> >  		bool leaf);
> >  extern void xfs_refcountbt_compute_maxlevels(struct xfs_mount *mp);
> >  
> > +extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> > +		unsigned long long len);
> > +extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> > +
> >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 18/63] xfs: introduce reflink utility functions
  2016-09-30  3:07   ` Darrick J. Wong
  (?)
  (?)
@ 2016-09-30 19:22   ` Brian Foster
  2016-09-30 19:50     ` Darrick J. Wong
  -1 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-09-30 19:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:07:32PM -0700, Darrick J. Wong wrote:
> These functions will be used by the other reflink functions to find
> the maximum length of a range of shared blocks.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
> ---
>  fs/xfs/libxfs/xfs_refcount.c |  100 ++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_refcount.h |    4 ++
>  2 files changed, 104 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 49d8c6f..0748c9c 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -1142,3 +1142,103 @@ xfs_refcount_decrease_extent(
>  	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_DECREASE,
>  			PREV->br_startblock, PREV->br_blockcount);
>  }
> +
> +/*
> + * Given an AG extent, find the lowest-numbered run of shared blocks within
> + * that range and return the range in fbno/flen.  If find_maximal is set,
> + * return the longest extent of shared blocks; if not, just return the first
> + * extent we find.  If no shared blocks are found, flen will be set to zero.
> + */
> +int
> +xfs_refcount_find_shared(
> +	struct xfs_btree_cur		*cur,
> +	xfs_agblock_t			agbno,
> +	xfs_extlen_t			aglen,
> +	xfs_agblock_t			*fbno,
> +	xfs_extlen_t			*flen,
> +	bool				find_maximal)
> +{
> +	struct xfs_refcount_irec	tmp;
> +	int				i;
> +	int				have;
> +	int				error;
> +
> +	trace_xfs_refcount_find_shared(cur->bc_mp, cur->bc_private.a.agno,
> +			agbno, aglen);
> +
> +	/* By default, skip the whole range */
> +	*fbno = agbno + aglen;
> +	*flen = 0;
> +
> +	/* Try to find a refcount extent that crosses the start */
> +	error = xfs_refcount_lookup_le(cur, agbno, &have);
> +	if (error)
> +		goto out_error;
> +	if (!have) {
> +		/* No left extent, look at the next one */
> +		error = xfs_btree_increment(cur, 0, &have);
> +		if (error)
> +			goto out_error;
> +		if (!have)
> +			goto done;
> +	}
> +	error = xfs_refcount_get_rec(cur, &tmp, &i);
> +	if (error)
> +		goto out_error;
> +	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
> +
> +	/* If the extent ends before the start, look at the next one */
> +	if (tmp.rc_startblock + tmp.rc_blockcount <= agbno) {
> +		error = xfs_btree_increment(cur, 0, &have);
> +		if (error)
> +			goto out_error;
> +		if (!have)
> +			goto done;
> +		error = xfs_refcount_get_rec(cur, &tmp, &i);
> +		if (error)
> +			goto out_error;
> +		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
> +	}
> +
> +	/* If the extent ends after the range we want, bail out */

Nit:			 starts

> +	if (tmp.rc_startblock >= agbno + aglen)
> +		goto done;
> +
> +	/* We found the start of a shared extent! */
> +	if (tmp.rc_startblock < agbno) {
> +		tmp.rc_blockcount -= (agbno - tmp.rc_startblock);
> +		tmp.rc_startblock = agbno;
> +	}
> +
> +	*fbno = tmp.rc_startblock;
> +	*flen = min(tmp.rc_blockcount, agbno + aglen - *fbno);
> +	if (!find_maximal)
> +		goto done;
> +
> +	/* Otherwise, find the end of this shared extent */
> +	while (*fbno + *flen < agbno + aglen) {
> +		error = xfs_btree_increment(cur, 0, &have);
> +		if (error)
> +			goto out_error;
> +		if (!have)
> +			break;
> +		error = xfs_refcount_get_rec(cur, &tmp, &i);
> +		if (error)
> +			goto out_error;
> +		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
> +		if (tmp.rc_startblock >= agbno + aglen ||
> +		    tmp.rc_startblock != *fbno + *flen)
> +			break;

FWIW, my impression from the comment above the function is that
find_maximal means to find the longest shared extent in the range of
blocks. Rather, this appears to extend the first found shared range to
include separate, but physically contiguous records (which I assume
means that only the refcounts differ). Perhaps changing "find_maximal"
to "ignore_refcount" or something better might be more clear?

Brian

> +		*flen = min(*flen + tmp.rc_blockcount, agbno + aglen - *fbno);
> +	}
> +
> +done:
> +	trace_xfs_refcount_find_shared_result(cur->bc_mp,
> +			cur->bc_private.a.agno, *fbno, *flen);
> +
> +out_error:
> +	if (error)
> +		trace_xfs_refcount_find_shared_error(cur->bc_mp,
> +				cur->bc_private.a.agno, error, _RET_IP_);
> +	return error;
> +}
> diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> index 7e750a5..48c576c 100644
> --- a/fs/xfs/libxfs/xfs_refcount.h
> +++ b/fs/xfs/libxfs/xfs_refcount.h
> @@ -54,4 +54,8 @@ extern int xfs_refcount_finish_one(struct xfs_trans *tp,
>  		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
>  		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
>  
> +extern int xfs_refcount_find_shared(struct xfs_btree_cur *cur,
> +		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
> +		xfs_extlen_t *flen, bool find_maximal);
> +
>  #endif	/* __XFS_REFCOUNT_H__ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 20/63] xfs: log bmap intent items
  2016-09-30  3:07 ` [PATCH 20/63] xfs: log bmap intent items Darrick J. Wong
  2016-09-30  7:26   ` Christoph Hellwig
@ 2016-09-30 19:22   ` Brian Foster
  2016-09-30 19:52     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-09-30 19:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:07:45PM -0700, Darrick J. Wong wrote:
> Provide a mechanism for higher levels to create BUI/BUD items, submit
> them to the log, and a stub function to deal with recovered BUI items.
> These parts will be connected to the rmapbt in a later patch.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: Only support one item per BUI.
> ---
>  fs/xfs/Makefile          |    1 
>  fs/xfs/libxfs/xfs_bmap.h |   14 ++++
>  fs/xfs/xfs_bmap_item.c   |   69 ++++++++++++++++++
>  fs/xfs/xfs_bmap_item.h   |    1 
>  fs/xfs/xfs_log_recover.c |  177 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans.h       |   13 +++
>  fs/xfs/xfs_trans_bmap.c  |   84 ++++++++++++++++++++++
>  7 files changed, 359 insertions(+)
>  create mode 100644 fs/xfs/xfs_trans_bmap.c
> 
> 
...
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index ea736af..4e46b63 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
...
> @@ -372,3 +378,66 @@ xfs_bud_init(
>  
>  	return budp;
>  }
> +
> +/*
> + * Process a bmap update intent item that was recovered from the log.
> + * We need to update some inode's bmbt.
> + */
> +int
> +xfs_bui_recover(
> +	struct xfs_mount		*mp,
> +	struct xfs_bui_log_item		*buip)
> +{
> +	int				error = 0;
> +	struct xfs_map_extent		*bmap;
> +	xfs_fsblock_t			startblock_fsb;
> +	xfs_fsblock_t			inode_fsb;
> +	bool				op_ok;
> +
> +	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
> +
> +	/* Only one mapping operation per BUI... */
> +	if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) {
> +		set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> +		xfs_bui_release(buip);
> +		return -EIO;
> +	}
> +
> +	/*
> +	 * First check the validity of the extent described by the
> +	 * BUI.  If anything is bad, then toss the BUI.
> +	 */
> +	bmap = &buip->bui_format.bui_extents[0];
> +	startblock_fsb = XFS_BB_TO_FSB(mp,
> +			   XFS_FSB_TO_DADDR(mp, bmap->me_startblock));
> +	inode_fsb = XFS_BB_TO_FSB(mp, XFS_FSB_TO_DADDR(mp,
> +			XFS_INO_TO_FSB(mp, bmap->me_owner)));
> +	switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
> +	case XFS_BMAP_MAP:
> +	case XFS_BMAP_UNMAP:
> +		op_ok = true;
> +		break;
> +	default:
> +		op_ok = false;
> +		break;
> +	}
> +	if (!op_ok || startblock_fsb == 0 ||
> +	    bmap->me_len == 0 ||
> +	    inode_fsb == 0 ||
> +	    startblock_fsb >= mp->m_sb.sb_dblocks ||
> +	    bmap->me_len >= mp->m_sb.sb_agblocks ||
> +	    inode_fsb >= mp->m_sb.sb_agblocks ||

Did you mean sb_dblocks here?

Brian

> +	    (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) {
> +		/*
> +		 * This will pull the BUI from the AIL and
> +		 * free the memory associated with it.
> +		 */
> +		set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> +		xfs_bui_release(buip);
> +		return -EIO;
> +	}
> +
> +	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> +	xfs_bui_release(buip);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
> index 57c13d3..c867daa 100644
> --- a/fs/xfs/xfs_bmap_item.h
> +++ b/fs/xfs/xfs_bmap_item.h
> @@ -93,5 +93,6 @@ struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *,
>  		struct xfs_bui_log_item *);
>  void xfs_bui_item_free(struct xfs_bui_log_item *);
>  void xfs_bui_release(struct xfs_bui_log_item *);
> +int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip);
>  
>  #endif	/* __XFS_BMAP_ITEM_H__ */
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 622881a..9697e94 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -46,6 +46,7 @@
>  #include "xfs_rmap_item.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_refcount_item.h"
> +#include "xfs_bmap_item.h"
>  
>  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
>  
> @@ -1927,6 +1928,8 @@ xlog_recover_reorder_trans(
>  		case XFS_LI_RUD:
>  		case XFS_LI_CUI:
>  		case XFS_LI_CUD:
> +		case XFS_LI_BUI:
> +		case XFS_LI_BUD:
>  			trace_xfs_log_recover_item_reorder_tail(log,
>  							trans, item, pass);
>  			list_move_tail(&item->ri_list, &inode_list);
> @@ -3671,6 +3674,125 @@ xlog_recover_cud_pass2(
>  }
>  
>  /*
> + * Copy an BUI format buffer from the given buf, and into the destination
> + * BUI format structure.  The BUI/BUD items were designed not to need any
> + * special alignment handling.
> + */
> +static int
> +xfs_bui_copy_format(
> +	struct xfs_log_iovec		*buf,
> +	struct xfs_bui_log_format	*dst_bui_fmt)
> +{
> +	struct xfs_bui_log_format	*src_bui_fmt;
> +	uint				len;
> +
> +	src_bui_fmt = buf->i_addr;
> +	len = xfs_bui_log_format_sizeof(src_bui_fmt->bui_nextents);
> +
> +	if (buf->i_len == len) {
> +		memcpy(dst_bui_fmt, src_bui_fmt, len);
> +		return 0;
> +	}
> +	return -EFSCORRUPTED;
> +}
> +
> +/*
> + * This routine is called to create an in-core extent bmap update
> + * item from the bui format structure which was logged on disk.
> + * It allocates an in-core bui, copies the extents from the format
> + * structure into it, and adds the bui to the AIL with the given
> + * LSN.
> + */
> +STATIC int
> +xlog_recover_bui_pass2(
> +	struct xlog			*log,
> +	struct xlog_recover_item	*item,
> +	xfs_lsn_t			lsn)
> +{
> +	int				error;
> +	struct xfs_mount		*mp = log->l_mp;
> +	struct xfs_bui_log_item		*buip;
> +	struct xfs_bui_log_format	*bui_formatp;
> +
> +	bui_formatp = item->ri_buf[0].i_addr;
> +
> +	if (bui_formatp->bui_nextents != XFS_BUI_MAX_FAST_EXTENTS)
> +		return -EFSCORRUPTED;
> +	buip = xfs_bui_init(mp);
> +	error = xfs_bui_copy_format(&item->ri_buf[0], &buip->bui_format);
> +	if (error) {
> +		xfs_bui_item_free(buip);
> +		return error;
> +	}
> +	atomic_set(&buip->bui_next_extent, bui_formatp->bui_nextents);
> +
> +	spin_lock(&log->l_ailp->xa_lock);
> +	/*
> +	 * The RUI has two references. One for the RUD and one for RUI to ensure
> +	 * it makes it into the AIL. Insert the RUI into the AIL directly and
> +	 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
> +	 * AIL lock.
> +	 */
> +	xfs_trans_ail_update(log->l_ailp, &buip->bui_item, lsn);
> +	xfs_bui_release(buip);
> +	return 0;
> +}
> +
> +
> +/*
> + * This routine is called when an BUD format structure is found in a committed
> + * transaction in the log. Its purpose is to cancel the corresponding BUI if it
> + * was still in the log. To do this it searches the AIL for the BUI with an id
> + * equal to that in the BUD format structure. If we find it we drop the BUD
> + * reference, which removes the BUI from the AIL and frees it.
> + */
> +STATIC int
> +xlog_recover_bud_pass2(
> +	struct xlog			*log,
> +	struct xlog_recover_item	*item)
> +{
> +	struct xfs_bud_log_format	*bud_formatp;
> +	struct xfs_bui_log_item		*buip = NULL;
> +	struct xfs_log_item		*lip;
> +	__uint64_t			bui_id;
> +	struct xfs_ail_cursor		cur;
> +	struct xfs_ail			*ailp = log->l_ailp;
> +
> +	bud_formatp = item->ri_buf[0].i_addr;
> +	if (item->ri_buf[0].i_len != sizeof(struct xfs_bud_log_format))
> +		return -EFSCORRUPTED;
> +	bui_id = bud_formatp->bud_bui_id;
> +
> +	/*
> +	 * Search for the BUI with the id in the BUD format structure in the
> +	 * AIL.
> +	 */
> +	spin_lock(&ailp->xa_lock);
> +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> +	while (lip != NULL) {
> +		if (lip->li_type == XFS_LI_BUI) {
> +			buip = (struct xfs_bui_log_item *)lip;
> +			if (buip->bui_format.bui_id == bui_id) {
> +				/*
> +				 * Drop the BUD reference to the BUI. This
> +				 * removes the BUI from the AIL and frees it.
> +				 */
> +				spin_unlock(&ailp->xa_lock);
> +				xfs_bui_release(buip);
> +				spin_lock(&ailp->xa_lock);
> +				break;
> +			}
> +		}
> +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> +	}
> +
> +	xfs_trans_ail_cursor_done(&cur);
> +	spin_unlock(&ailp->xa_lock);
> +
> +	return 0;
> +}
> +
> +/*
>   * This routine is called when an inode create format structure is found in a
>   * committed transaction in the log.  It's purpose is to initialise the inodes
>   * being allocated on disk. This requires us to get inode cluster buffers that
> @@ -3899,6 +4021,8 @@ xlog_recover_ra_pass2(
>  	case XFS_LI_RUD:
>  	case XFS_LI_CUI:
>  	case XFS_LI_CUD:
> +	case XFS_LI_BUI:
> +	case XFS_LI_BUD:
>  	default:
>  		break;
>  	}
> @@ -3926,6 +4050,8 @@ xlog_recover_commit_pass1(
>  	case XFS_LI_RUD:
>  	case XFS_LI_CUI:
>  	case XFS_LI_CUD:
> +	case XFS_LI_BUI:
> +	case XFS_LI_BUD:
>  		/* nothing to do in pass 1 */
>  		return 0;
>  	default:
> @@ -3964,6 +4090,10 @@ xlog_recover_commit_pass2(
>  		return xlog_recover_cui_pass2(log, item, trans->r_lsn);
>  	case XFS_LI_CUD:
>  		return xlog_recover_cud_pass2(log, item);
> +	case XFS_LI_BUI:
> +		return xlog_recover_bui_pass2(log, item, trans->r_lsn);
> +	case XFS_LI_BUD:
> +		return xlog_recover_bud_pass2(log, item);
>  	case XFS_LI_DQUOT:
>  		return xlog_recover_dquot_pass2(log, buffer_list, item,
>  						trans->r_lsn);
> @@ -4591,6 +4721,46 @@ xlog_recover_cancel_cui(
>  	spin_lock(&ailp->xa_lock);
>  }
>  
> +/* Recover the BUI if necessary. */
> +STATIC int
> +xlog_recover_process_bui(
> +	struct xfs_mount		*mp,
> +	struct xfs_ail			*ailp,
> +	struct xfs_log_item		*lip)
> +{
> +	struct xfs_bui_log_item		*buip;
> +	int				error;
> +
> +	/*
> +	 * Skip BUIs that we've already processed.
> +	 */
> +	buip = container_of(lip, struct xfs_bui_log_item, bui_item);
> +	if (test_bit(XFS_BUI_RECOVERED, &buip->bui_flags))
> +		return 0;
> +
> +	spin_unlock(&ailp->xa_lock);
> +	error = xfs_bui_recover(mp, buip);
> +	spin_lock(&ailp->xa_lock);
> +
> +	return error;
> +}
> +
> +/* Release the BUI since we're cancelling everything. */
> +STATIC void
> +xlog_recover_cancel_bui(
> +	struct xfs_mount		*mp,
> +	struct xfs_ail			*ailp,
> +	struct xfs_log_item		*lip)
> +{
> +	struct xfs_bui_log_item		*buip;
> +
> +	buip = container_of(lip, struct xfs_bui_log_item, bui_item);
> +
> +	spin_unlock(&ailp->xa_lock);
> +	xfs_bui_release(buip);
> +	spin_lock(&ailp->xa_lock);
> +}
> +
>  /* Is this log item a deferred action intent? */
>  static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
>  {
> @@ -4598,6 +4768,7 @@ static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
>  	case XFS_LI_EFI:
>  	case XFS_LI_RUI:
>  	case XFS_LI_CUI:
> +	case XFS_LI_BUI:
>  		return true;
>  	default:
>  		return false;
> @@ -4664,6 +4835,9 @@ xlog_recover_process_intents(
>  		case XFS_LI_CUI:
>  			error = xlog_recover_process_cui(log->l_mp, ailp, lip);
>  			break;
> +		case XFS_LI_BUI:
> +			error = xlog_recover_process_bui(log->l_mp, ailp, lip);
> +			break;
>  		}
>  		if (error)
>  			goto out;
> @@ -4714,6 +4888,9 @@ xlog_recover_cancel_intents(
>  		case XFS_LI_CUI:
>  			xlog_recover_cancel_cui(log->l_mp, ailp, lip);
>  			break;
> +		case XFS_LI_BUI:
> +			xlog_recover_cancel_bui(log->l_mp, ailp, lip);
> +			break;
>  		}
>  
>  		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index a7a87d2..7cf02d3 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -39,6 +39,7 @@ struct xfs_btree_cur;
>  struct xfs_cui_log_item;
>  struct xfs_cud_log_item;
>  struct xfs_defer_ops;
> +struct xfs_bui_log_item;
>  
>  typedef struct xfs_log_item {
>  	struct list_head		li_ail;		/* AIL pointers */
> @@ -263,4 +264,16 @@ int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
>  		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
>  		struct xfs_btree_cur **pcur);
>  
> +/* mapping updates */
> +enum xfs_bmap_intent_type;
> +
> +void xfs_bmap_update_init_defer_op(void);
> +struct xfs_bud_log_item *xfs_trans_get_bud(struct xfs_trans *tp,
> +		struct xfs_bui_log_item *buip);
> +int xfs_trans_log_finish_bmap_update(struct xfs_trans *tp,
> +		struct xfs_bud_log_item *rudp, struct xfs_defer_ops *dfops,
> +		enum xfs_bmap_intent_type type, struct xfs_inode *ip,
> +		int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
> +		xfs_filblks_t blockcount, xfs_exntst_t state);
> +
>  #endif	/* __XFS_TRANS_H__ */
> diff --git a/fs/xfs/xfs_trans_bmap.c b/fs/xfs/xfs_trans_bmap.c
> new file mode 100644
> index 0000000..656d669
> --- /dev/null
> +++ b/fs/xfs/xfs_trans_bmap.c
> @@ -0,0 +1,84 @@
> +/*
> + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write the Free Software Foundation,
> + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> +#include "xfs_bmap_item.h"
> +#include "xfs_alloc.h"
> +#include "xfs_bmap.h"
> +#include "xfs_inode.h"
> +
> +/*
> + * This routine is called to allocate a "bmap update done"
> + * log item.
> + */
> +struct xfs_bud_log_item *
> +xfs_trans_get_bud(
> +	struct xfs_trans		*tp,
> +	struct xfs_bui_log_item		*buip)
> +{
> +	struct xfs_bud_log_item		*budp;
> +
> +	budp = xfs_bud_init(tp->t_mountp, buip);
> +	xfs_trans_add_item(tp, &budp->bud_item);
> +	return budp;
> +}
> +
> +/*
> + * Finish an bmap update and log it to the BUD. Note that the
> + * transaction is marked dirty regardless of whether the bmap update
> + * succeeds or fails to support the BUI/BUD lifecycle rules.
> + */
> +int
> +xfs_trans_log_finish_bmap_update(
> +	struct xfs_trans		*tp,
> +	struct xfs_bud_log_item		*budp,
> +	struct xfs_defer_ops		*dop,
> +	enum xfs_bmap_intent_type	type,
> +	struct xfs_inode		*ip,
> +	int				whichfork,
> +	xfs_fileoff_t			startoff,
> +	xfs_fsblock_t			startblock,
> +	xfs_filblks_t			blockcount,
> +	xfs_exntst_t			state)
> +{
> +	int				error;
> +
> +	error = -EFSCORRUPTED;
> +
> +	/*
> +	 * Mark the transaction dirty, even on error. This ensures the
> +	 * transaction is aborted, which:
> +	 *
> +	 * 1.) releases the BUI and frees the BUD
> +	 * 2.) shuts down the filesystem
> +	 */
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	budp->bud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> +
> +	return error;
> +}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/63] xfs: connect refcount adjust functions to upper layers
  2016-09-30 16:21   ` Brian Foster
@ 2016-09-30 19:40     ` Darrick J. Wong
  2016-09-30 20:11       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 19:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 12:21:03PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:07:05PM -0700, Darrick J. Wong wrote:
> > Plumb in the upper level interface to schedule and finish deferred
> > refcount operations via the deferred ops mechanism.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_defer.h    |    1 
> >  fs/xfs/libxfs/xfs_refcount.c |  170 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_refcount.h |   12 +++
> >  fs/xfs/xfs_error.h           |    4 +
> >  fs/xfs/xfs_refcount_item.c   |   73 ++++++++++++++++
> >  fs/xfs/xfs_super.c           |    1 
> >  fs/xfs/xfs_trace.h           |    3 +
> >  fs/xfs/xfs_trans.h           |    8 +-
> >  fs/xfs/xfs_trans_refcount.c  |  186 ++++++++++++++++++++++++++++++++++++++++++
> >  9 files changed, 452 insertions(+), 6 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> > index 599a8d2..e44007a 100644
> > --- a/fs/xfs/xfs_refcount_item.c
> > +++ b/fs/xfs/xfs_refcount_item.c
> > @@ -396,9 +396,19 @@ xfs_cui_recover(
> >  {
> >  	int				i;
> >  	int				error = 0;
> > +	unsigned int			refc_type;
> >  	struct xfs_phys_extent		*refc;
> >  	xfs_fsblock_t			startblock_fsb;
> >  	bool				op_ok;
> > +	struct xfs_cud_log_item		*cudp;
> > +	struct xfs_trans		*tp;
> > +	struct xfs_btree_cur		*rcur = NULL;
> > +	enum xfs_refcount_intent_type	type;
> > +	xfs_fsblock_t			firstfsb;
> > +	xfs_extlen_t			adjusted;
> > +	struct xfs_bmbt_irec		irec;
> > +	struct xfs_defer_ops		dfops;
> > +	bool				requeue_only = false;
> >  
> >  	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
> >  
> > @@ -437,7 +447,68 @@ xfs_cui_recover(
> >  		}
> >  	}
> >  
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > +	if (error)
> > +		return error;
> > +	cudp = xfs_trans_get_cud(tp, cuip);
> > +
> > +	xfs_defer_init(&dfops, &firstfsb);
> 
> A comment would be nice here to point out the approach. E.g., that
> refcount updates are initially deferred under normal runtime
> circumstances, they handle reservation usage internally/dynamically, and
> that since we're in recovery, we start the initial update directly and
> defer the rest that won't fit in the transaction (worded better and
> assuming I understand all that correctly ;P).

Yep, your understanding is correct.  I'll put that in as a comment.

> (Sorry for the comment requests and whatnot, BTW. I'm catching up from a
> couple weeks of PTO, probably late to the game and not up to speed on
> the latest status of the patchset. Feel free to defer, drop, or
> conditionalize any of the aesthetic stuff to whenever is opportune if
> this stuff is otherwise close to merge).

NP.  I appreciate review whenever I can get it. :)

(Plus, you found a bug! :) :))

> > +	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> > +		refc = &cuip->cui_format.cui_extents[i];
> > +		refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
> > +		switch (refc_type) {
> > +		case XFS_REFCOUNT_INCREASE:
> > +		case XFS_REFCOUNT_DECREASE:
> > +		case XFS_REFCOUNT_ALLOC_COW:
> > +		case XFS_REFCOUNT_FREE_COW:
> > +			type = refc_type;
> > +			break;
> > +		default:
> > +			error = -EFSCORRUPTED;
> > +			goto abort_error;
> > +		}
> > +		if (requeue_only)
> > +			adjusted = 0;
> > +		else
> > +			error = xfs_trans_log_finish_refcount_update(tp, cudp,
> > +				&dfops, type, refc->pe_startblock, refc->pe_len,
> > +				&adjusted, &rcur);
> > +		if (error)
> > +			goto abort_error;
> > +
> > +		/* Requeue what we didn't finish. */
> > +		if (adjusted < refc->pe_len) {
> > +			irec.br_startblock = refc->pe_startblock + adjusted;
> > +			irec.br_blockcount = refc->pe_len - adjusted;
> 
> Hmm, so it appears we walk the range of blocks from beginning to end,
> but the refcount update code doesn't necessarily always work that way.
> It merges the boundaries and walks the middle range from start to end.
> So what happens if the call above ends up doing a right merge and then
> skips out on any other changes due to the transaction reservation?

D'oh!  You've found a bug!  _refcount_adjust needs to communicate to
its caller how much work is left, which does by incrementing *adjusted
every time it finishes more work.  The caller then moves the start of
the extent upwards by *adjusted.  Unfortunately, as you point out, a
right merge actually does work at the upper end of the extent, and this
is not correctly accounted for.

To fix this, I'll change _refcount_adjust to report the unfinished
extent directly to the caller, which will simplify both the function and
its callers' accounting considerably.

Good catch!

> Brian
> 
> P.S., Even if I'm missing something and this is not an issue, do we have
> any log recovery oriented reflink xfstests in the current test pile? If
> not, I'd suggest that something as simple as a "do a bunch of reflinks +
> xfs_io -c 'shutdown -f' + umount/mount" loop could go a long way towards
> shaking out any issues. Log recovery can be a pita and otherwise
> problems therein can go undetected for a surprising amount of time.

xfs/{313,316,321,324,326} use the error injection mechanism to test log
recovery.

--D

> 
> > +			switch (type) {
> > +			case XFS_REFCOUNT_INCREASE:
> > +				error = xfs_refcount_increase_extent(
> > +						tp->t_mountp, &dfops, &irec);
> > +				break;
> > +			case XFS_REFCOUNT_DECREASE:
> > +				error = xfs_refcount_decrease_extent(
> > +						tp->t_mountp, &dfops, &irec);
> > +				break;
> > +			default:
> > +				ASSERT(0);
> > +			}
> > +			if (error)
> > +				goto abort_error;
> > +			requeue_only = true;
> > +		}
> > +	}
> > +
> > +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > +	if (error)
> > +		goto abort_error;
> >  	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
> > -	xfs_cui_release(cuip);
> > +	error = xfs_trans_commit(tp);
> > +	return error;
> > +
> > +abort_error:
> > +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > +	xfs_defer_cancel(&dfops);
> > +	xfs_trans_cancel(tp);
> >  	return error;
> >  }
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index abe69c6..6234622 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1903,6 +1903,7 @@ init_xfs_fs(void)
> >  
> >  	xfs_extent_free_init_defer_op();
> >  	xfs_rmap_update_init_defer_op();
> > +	xfs_refcount_update_init_defer_op();
> >  
> >  	xfs_dir_startup();
> >  
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index fed1906..195a168 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2931,6 +2931,9 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
> >  DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
> >  DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
> >  DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
> > +#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
> > +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
> > +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
> >  
> >  TRACE_EVENT(xfs_refcount_finish_one_leftover,
> >  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index fe69e20..a7a87d2 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -37,6 +37,8 @@ struct xfs_rud_log_item;
> >  struct xfs_rui_log_item;
> >  struct xfs_btree_cur;
> >  struct xfs_cui_log_item;
> > +struct xfs_cud_log_item;
> > +struct xfs_defer_ops;
> >  
> >  typedef struct xfs_log_item {
> >  	struct list_head		li_ail;		/* AIL pointers */
> > @@ -252,11 +254,13 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> >  /* refcount updates */
> >  enum xfs_refcount_intent_type;
> >  
> > +void xfs_refcount_update_init_defer_op(void);
> >  struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
> >  		struct xfs_cui_log_item *cuip);
> >  int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
> > -		struct xfs_cud_log_item *cudp,
> > +		struct xfs_cud_log_item *cudp, struct xfs_defer_ops *dfops,
> >  		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
> > -		xfs_extlen_t blockcount, struct xfs_btree_cur **pcur);
> > +		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
> > +		struct xfs_btree_cur **pcur);
> >  
> >  #endif	/* __XFS_TRANS_H__ */
> > diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
> > index b18d548..e3ac994 100644
> > --- a/fs/xfs/xfs_trans_refcount.c
> > +++ b/fs/xfs/xfs_trans_refcount.c
> > @@ -56,15 +56,17 @@ int
> >  xfs_trans_log_finish_refcount_update(
> >  	struct xfs_trans		*tp,
> >  	struct xfs_cud_log_item		*cudp,
> > +	struct xfs_defer_ops		*dop,
> >  	enum xfs_refcount_intent_type	type,
> >  	xfs_fsblock_t			startblock,
> >  	xfs_extlen_t			blockcount,
> > +	xfs_extlen_t			*adjusted,
> >  	struct xfs_btree_cur		**pcur)
> >  {
> >  	int				error;
> >  
> > -	/* XXX: leave this empty for now */
> > -	error = -EFSCORRUPTED;
> > +	error = xfs_refcount_finish_one(tp, dop, type, startblock,
> > +			blockcount, adjusted, pcur);
> >  
> >  	/*
> >  	 * Mark the transaction dirty, even on error. This ensures the
> > @@ -78,3 +80,183 @@ xfs_trans_log_finish_refcount_update(
> >  
> >  	return error;
> >  }
> > +
> > +/* Sort refcount intents by AG. */
> > +static int
> > +xfs_refcount_update_diff_items(
> > +	void				*priv,
> > +	struct list_head		*a,
> > +	struct list_head		*b)
> > +{
> > +	struct xfs_mount		*mp = priv;
> > +	struct xfs_refcount_intent	*ra;
> > +	struct xfs_refcount_intent	*rb;
> > +
> > +	ra = container_of(a, struct xfs_refcount_intent, ri_list);
> > +	rb = container_of(b, struct xfs_refcount_intent, ri_list);
> > +	return  XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
> > +		XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
> > +}
> > +
> > +/* Get an CUI. */
> > +STATIC void *
> > +xfs_refcount_update_create_intent(
> > +	struct xfs_trans		*tp,
> > +	unsigned int			count)
> > +{
> > +	struct xfs_cui_log_item		*cuip;
> > +
> > +	ASSERT(tp != NULL);
> > +	ASSERT(count > 0);
> > +
> > +	cuip = xfs_cui_init(tp->t_mountp, count);
> > +	ASSERT(cuip != NULL);
> > +
> > +	/*
> > +	 * Get a log_item_desc to point at the new item.
> > +	 */
> > +	xfs_trans_add_item(tp, &cuip->cui_item);
> > +	return cuip;
> > +}
> > +
> > +/* Set the phys extent flags for this reverse mapping. */
> > +static void
> > +xfs_trans_set_refcount_flags(
> > +	struct xfs_phys_extent		*refc,
> > +	enum xfs_refcount_intent_type	type)
> > +{
> > +	refc->pe_flags = 0;
> > +	switch (type) {
> > +	case XFS_REFCOUNT_INCREASE:
> > +	case XFS_REFCOUNT_DECREASE:
> > +	case XFS_REFCOUNT_ALLOC_COW:
> > +	case XFS_REFCOUNT_FREE_COW:
> > +		refc->pe_flags |= type;
> > +		break;
> > +	default:
> > +		ASSERT(0);
> > +	}
> > +}
> > +
> > +/* Log refcount updates in the intent item. */
> > +STATIC void
> > +xfs_refcount_update_log_item(
> > +	struct xfs_trans		*tp,
> > +	void				*intent,
> > +	struct list_head		*item)
> > +{
> > +	struct xfs_cui_log_item		*cuip = intent;
> > +	struct xfs_refcount_intent	*refc;
> > +	uint				next_extent;
> > +	struct xfs_phys_extent		*ext;
> > +
> > +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> > +
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	cuip->cui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > +
> > +	/*
> > +	 * atomic_inc_return gives us the value after the increment;
> > +	 * we want to use it as an array index so we need to subtract 1 from
> > +	 * it.
> > +	 */
> > +	next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1;
> > +	ASSERT(next_extent < cuip->cui_format.cui_nextents);
> > +	ext = &cuip->cui_format.cui_extents[next_extent];
> > +	ext->pe_startblock = refc->ri_startblock;
> > +	ext->pe_len = refc->ri_blockcount;
> > +	xfs_trans_set_refcount_flags(ext, refc->ri_type);
> > +}
> > +
> > +/* Get an CUD so we can process all the deferred refcount updates. */
> > +STATIC void *
> > +xfs_refcount_update_create_done(
> > +	struct xfs_trans		*tp,
> > +	void				*intent,
> > +	unsigned int			count)
> > +{
> > +	return xfs_trans_get_cud(tp, intent);
> > +}
> > +
> > +/* Process a deferred refcount update. */
> > +STATIC int
> > +xfs_refcount_update_finish_item(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_defer_ops		*dop,
> > +	struct list_head		*item,
> > +	void				*done_item,
> > +	void				**state)
> > +{
> > +	struct xfs_refcount_intent	*refc;
> > +	xfs_extlen_t			adjusted;
> > +	int				error;
> > +
> > +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> > +	error = xfs_trans_log_finish_refcount_update(tp, done_item, dop,
> > +			refc->ri_type,
> > +			refc->ri_startblock,
> > +			refc->ri_blockcount,
> > +			&adjusted,
> > +			(struct xfs_btree_cur **)state);
> > +	/* Did we run out of reservation?  Requeue what we didn't finish. */
> > +	if (!error && adjusted < refc->ri_blockcount) {
> > +		ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE ||
> > +		       refc->ri_type == XFS_REFCOUNT_DECREASE);
> > +		refc->ri_startblock += adjusted;
> > +		refc->ri_blockcount -= adjusted;
> > +		return -EAGAIN;
> > +	}
> > +	kmem_free(refc);
> > +	return error;
> > +}
> > +
> > +/* Clean up after processing deferred refcounts. */
> > +STATIC void
> > +xfs_refcount_update_finish_cleanup(
> > +	struct xfs_trans	*tp,
> > +	void			*state,
> > +	int			error)
> > +{
> > +	struct xfs_btree_cur	*rcur = state;
> > +
> > +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > +}
> > +
> > +/* Abort all pending CUIs. */
> > +STATIC void
> > +xfs_refcount_update_abort_intent(
> > +	void				*intent)
> > +{
> > +	xfs_cui_release(intent);
> > +}
> > +
> > +/* Cancel a deferred refcount update. */
> > +STATIC void
> > +xfs_refcount_update_cancel_item(
> > +	struct list_head		*item)
> > +{
> > +	struct xfs_refcount_intent	*refc;
> > +
> > +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> > +	kmem_free(refc);
> > +}
> > +
> > +static const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
> > +	.type		= XFS_DEFER_OPS_TYPE_REFCOUNT,
> > +	.max_items	= XFS_CUI_MAX_FAST_EXTENTS,
> > +	.diff_items	= xfs_refcount_update_diff_items,
> > +	.create_intent	= xfs_refcount_update_create_intent,
> > +	.abort_intent	= xfs_refcount_update_abort_intent,
> > +	.log_item	= xfs_refcount_update_log_item,
> > +	.create_done	= xfs_refcount_update_create_done,
> > +	.finish_item	= xfs_refcount_update_finish_item,
> > +	.finish_cleanup = xfs_refcount_update_finish_cleanup,
> > +	.cancel_item	= xfs_refcount_update_cancel_item,
> > +};
> > +
> > +/* Register the deferred op type. */
> > +void
> > +xfs_refcount_update_init_defer_op(void)
> > +{
> > +	xfs_defer_init_op_type(&xfs_refcount_update_defer_type);
> > +}
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 18/63] xfs: introduce reflink utility functions
  2016-09-30 19:22   ` Brian Foster
@ 2016-09-30 19:50     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 19:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Fri, Sep 30, 2016 at 03:22:04PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:07:32PM -0700, Darrick J. Wong wrote:
> > These functions will be used by the other reflink functions to find
> > the maximum length of a range of shared blocks.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
> > ---
> >  fs/xfs/libxfs/xfs_refcount.c |  100 ++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_refcount.h |    4 ++
> >  2 files changed, 104 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index 49d8c6f..0748c9c 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -1142,3 +1142,103 @@ xfs_refcount_decrease_extent(
> >  	return __xfs_refcount_add(mp, dfops, XFS_REFCOUNT_DECREASE,
> >  			PREV->br_startblock, PREV->br_blockcount);
> >  }
> > +
> > +/*
> > + * Given an AG extent, find the lowest-numbered run of shared blocks within
> > + * that range and return the range in fbno/flen.  If find_maximal is set,
> > + * return the longest extent of shared blocks; if not, just return the first
> > + * extent we find.  If no shared blocks are found, flen will be set to zero.
> > + */
> > +int
> > +xfs_refcount_find_shared(
> > +	struct xfs_btree_cur		*cur,
> > +	xfs_agblock_t			agbno,
> > +	xfs_extlen_t			aglen,
> > +	xfs_agblock_t			*fbno,
> > +	xfs_extlen_t			*flen,
> > +	bool				find_maximal)
> > +{
> > +	struct xfs_refcount_irec	tmp;
> > +	int				i;
> > +	int				have;
> > +	int				error;
> > +
> > +	trace_xfs_refcount_find_shared(cur->bc_mp, cur->bc_private.a.agno,
> > +			agbno, aglen);
> > +
> > +	/* By default, skip the whole range */
> > +	*fbno = agbno + aglen;
> > +	*flen = 0;
> > +
> > +	/* Try to find a refcount extent that crosses the start */
> > +	error = xfs_refcount_lookup_le(cur, agbno, &have);
> > +	if (error)
> > +		goto out_error;
> > +	if (!have) {
> > +		/* No left extent, look at the next one */
> > +		error = xfs_btree_increment(cur, 0, &have);
> > +		if (error)
> > +			goto out_error;
> > +		if (!have)
> > +			goto done;
> > +	}
> > +	error = xfs_refcount_get_rec(cur, &tmp, &i);
> > +	if (error)
> > +		goto out_error;
> > +	XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
> > +
> > +	/* If the extent ends before the start, look at the next one */
> > +	if (tmp.rc_startblock + tmp.rc_blockcount <= agbno) {
> > +		error = xfs_btree_increment(cur, 0, &have);
> > +		if (error)
> > +			goto out_error;
> > +		if (!have)
> > +			goto done;
> > +		error = xfs_refcount_get_rec(cur, &tmp, &i);
> > +		if (error)
> > +			goto out_error;
> > +		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
> > +	}
> > +
> > +	/* If the extent ends after the range we want, bail out */
> 
> Nit:			 starts

Fixed.

> > +	if (tmp.rc_startblock >= agbno + aglen)
> > +		goto done;
> > +
> > +	/* We found the start of a shared extent! */
> > +	if (tmp.rc_startblock < agbno) {
> > +		tmp.rc_blockcount -= (agbno - tmp.rc_startblock);
> > +		tmp.rc_startblock = agbno;
> > +	}
> > +
> > +	*fbno = tmp.rc_startblock;
> > +	*flen = min(tmp.rc_blockcount, agbno + aglen - *fbno);
> > +	if (!find_maximal)
> > +		goto done;
> > +
> > +	/* Otherwise, find the end of this shared extent */
> > +	while (*fbno + *flen < agbno + aglen) {
> > +		error = xfs_btree_increment(cur, 0, &have);
> > +		if (error)
> > +			goto out_error;
> > +		if (!have)
> > +			break;
> > +		error = xfs_refcount_get_rec(cur, &tmp, &i);
> > +		if (error)
> > +			goto out_error;
> > +		XFS_WANT_CORRUPTED_GOTO(cur->bc_mp, i == 1, out_error);
> > +		if (tmp.rc_startblock >= agbno + aglen ||
> > +		    tmp.rc_startblock != *fbno + *flen)
> > +			break;
> 
> FWIW, my impression from the comment above the function is that
> find_maximal means to find the longest shared extent in the range of
> blocks. Rather, this appears to extend the first found shared range to
> include separate, but physically contiguous records (which I assume
> means that only the refcounts differ). Perhaps changing "find_maximal"
> to "ignore_refcount" or something better might be more clear?

That's correct.

I'm going to change the name to "find_end_of_shared".

--D

> 
> Brian
> 
> > +		*flen = min(*flen + tmp.rc_blockcount, agbno + aglen - *fbno);
> > +	}
> > +
> > +done:
> > +	trace_xfs_refcount_find_shared_result(cur->bc_mp,
> > +			cur->bc_private.a.agno, *fbno, *flen);
> > +
> > +out_error:
> > +	if (error)
> > +		trace_xfs_refcount_find_shared_error(cur->bc_mp,
> > +				cur->bc_private.a.agno, error, _RET_IP_);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> > index 7e750a5..48c576c 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.h
> > +++ b/fs/xfs/libxfs/xfs_refcount.h
> > @@ -54,4 +54,8 @@ extern int xfs_refcount_finish_one(struct xfs_trans *tp,
> >  		xfs_fsblock_t startblock, xfs_extlen_t blockcount,
> >  		xfs_extlen_t *adjusted, struct xfs_btree_cur **pcur);
> >  
> > +extern int xfs_refcount_find_shared(struct xfs_btree_cur *cur,
> > +		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
> > +		xfs_extlen_t *flen, bool find_maximal);
> > +
> >  #endif	/* __XFS_REFCOUNT_H__ */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 20/63] xfs: log bmap intent items
  2016-09-30 19:22   ` Brian Foster
@ 2016-09-30 19:52     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 19:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Fri, Sep 30, 2016 at 03:22:14PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:07:45PM -0700, Darrick J. Wong wrote:
> > Provide a mechanism for higher levels to create BUI/BUD items, submit
> > them to the log, and a stub function to deal with recovered BUI items.
> > These parts will be connected to the rmapbt in a later patch.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: Only support one item per BUI.
> > ---
> >  fs/xfs/Makefile          |    1 
> >  fs/xfs/libxfs/xfs_bmap.h |   14 ++++
> >  fs/xfs/xfs_bmap_item.c   |   69 ++++++++++++++++++
> >  fs/xfs/xfs_bmap_item.h   |    1 
> >  fs/xfs/xfs_log_recover.c |  177 ++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans.h       |   13 +++
> >  fs/xfs/xfs_trans_bmap.c  |   84 ++++++++++++++++++++++
> >  7 files changed, 359 insertions(+)
> >  create mode 100644 fs/xfs/xfs_trans_bmap.c
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > index ea736af..4e46b63 100644
> > --- a/fs/xfs/xfs_bmap_item.c
> > +++ b/fs/xfs/xfs_bmap_item.c
> ...
> > @@ -372,3 +378,66 @@ xfs_bud_init(
> >  
> >  	return budp;
> >  }
> > +
> > +/*
> > + * Process a bmap update intent item that was recovered from the log.
> > + * We need to update some inode's bmbt.
> > + */
> > +int
> > +xfs_bui_recover(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_bui_log_item		*buip)
> > +{
> > +	int				error = 0;
> > +	struct xfs_map_extent		*bmap;
> > +	xfs_fsblock_t			startblock_fsb;
> > +	xfs_fsblock_t			inode_fsb;
> > +	bool				op_ok;
> > +
> > +	ASSERT(!test_bit(XFS_BUI_RECOVERED, &buip->bui_flags));
> > +
> > +	/* Only one mapping operation per BUI... */
> > +	if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) {
> > +		set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> > +		xfs_bui_release(buip);
> > +		return -EIO;
> > +	}
> > +
> > +	/*
> > +	 * First check the validity of the extent described by the
> > +	 * BUI.  If anything is bad, then toss the BUI.
> > +	 */
> > +	bmap = &buip->bui_format.bui_extents[0];
> > +	startblock_fsb = XFS_BB_TO_FSB(mp,
> > +			   XFS_FSB_TO_DADDR(mp, bmap->me_startblock));
> > +	inode_fsb = XFS_BB_TO_FSB(mp, XFS_FSB_TO_DADDR(mp,
> > +			XFS_INO_TO_FSB(mp, bmap->me_owner)));
> > +	switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
> > +	case XFS_BMAP_MAP:
> > +	case XFS_BMAP_UNMAP:
> > +		op_ok = true;
> > +		break;
> > +	default:
> > +		op_ok = false;
> > +		break;
> > +	}
> > +	if (!op_ok || startblock_fsb == 0 ||
> > +	    bmap->me_len == 0 ||
> > +	    inode_fsb == 0 ||
> > +	    startblock_fsb >= mp->m_sb.sb_dblocks ||
> > +	    bmap->me_len >= mp->m_sb.sb_agblocks ||
> > +	    inode_fsb >= mp->m_sb.sb_agblocks ||
> 
> Did you mean sb_dblocks here?

Yes, thank you.  I'll fix that.

--D

> 
> Brian
> 
> > +	    (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) {
> > +		/*
> > +		 * This will pull the BUI from the AIL and
> > +		 * free the memory associated with it.
> > +		 */
> > +		set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> > +		xfs_bui_release(buip);
> > +		return -EIO;
> > +	}
> > +
> > +	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
> > +	xfs_bui_release(buip);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
> > index 57c13d3..c867daa 100644
> > --- a/fs/xfs/xfs_bmap_item.h
> > +++ b/fs/xfs/xfs_bmap_item.h
> > @@ -93,5 +93,6 @@ struct xfs_bud_log_item *xfs_bud_init(struct xfs_mount *,
> >  		struct xfs_bui_log_item *);
> >  void xfs_bui_item_free(struct xfs_bui_log_item *);
> >  void xfs_bui_release(struct xfs_bui_log_item *);
> > +int xfs_bui_recover(struct xfs_mount *mp, struct xfs_bui_log_item *buip);
> >  
> >  #endif	/* __XFS_BMAP_ITEM_H__ */
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 622881a..9697e94 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -46,6 +46,7 @@
> >  #include "xfs_rmap_item.h"
> >  #include "xfs_buf_item.h"
> >  #include "xfs_refcount_item.h"
> > +#include "xfs_bmap_item.h"
> >  
> >  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
> >  
> > @@ -1927,6 +1928,8 @@ xlog_recover_reorder_trans(
> >  		case XFS_LI_RUD:
> >  		case XFS_LI_CUI:
> >  		case XFS_LI_CUD:
> > +		case XFS_LI_BUI:
> > +		case XFS_LI_BUD:
> >  			trace_xfs_log_recover_item_reorder_tail(log,
> >  							trans, item, pass);
> >  			list_move_tail(&item->ri_list, &inode_list);
> > @@ -3671,6 +3674,125 @@ xlog_recover_cud_pass2(
> >  }
> >  
> >  /*
> > + * Copy an BUI format buffer from the given buf, and into the destination
> > + * BUI format structure.  The BUI/BUD items were designed not to need any
> > + * special alignment handling.
> > + */
> > +static int
> > +xfs_bui_copy_format(
> > +	struct xfs_log_iovec		*buf,
> > +	struct xfs_bui_log_format	*dst_bui_fmt)
> > +{
> > +	struct xfs_bui_log_format	*src_bui_fmt;
> > +	uint				len;
> > +
> > +	src_bui_fmt = buf->i_addr;
> > +	len = xfs_bui_log_format_sizeof(src_bui_fmt->bui_nextents);
> > +
> > +	if (buf->i_len == len) {
> > +		memcpy(dst_bui_fmt, src_bui_fmt, len);
> > +		return 0;
> > +	}
> > +	return -EFSCORRUPTED;
> > +}
> > +
> > +/*
> > + * This routine is called to create an in-core extent bmap update
> > + * item from the bui format structure which was logged on disk.
> > + * It allocates an in-core bui, copies the extents from the format
> > + * structure into it, and adds the bui to the AIL with the given
> > + * LSN.
> > + */
> > +STATIC int
> > +xlog_recover_bui_pass2(
> > +	struct xlog			*log,
> > +	struct xlog_recover_item	*item,
> > +	xfs_lsn_t			lsn)
> > +{
> > +	int				error;
> > +	struct xfs_mount		*mp = log->l_mp;
> > +	struct xfs_bui_log_item		*buip;
> > +	struct xfs_bui_log_format	*bui_formatp;
> > +
> > +	bui_formatp = item->ri_buf[0].i_addr;
> > +
> > +	if (bui_formatp->bui_nextents != XFS_BUI_MAX_FAST_EXTENTS)
> > +		return -EFSCORRUPTED;
> > +	buip = xfs_bui_init(mp);
> > +	error = xfs_bui_copy_format(&item->ri_buf[0], &buip->bui_format);
> > +	if (error) {
> > +		xfs_bui_item_free(buip);
> > +		return error;
> > +	}
> > +	atomic_set(&buip->bui_next_extent, bui_formatp->bui_nextents);
> > +
> > +	spin_lock(&log->l_ailp->xa_lock);
> > +	/*
> > +	 * The RUI has two references. One for the RUD and one for RUI to ensure
> > +	 * it makes it into the AIL. Insert the RUI into the AIL directly and
> > +	 * drop the RUI reference. Note that xfs_trans_ail_update() drops the
> > +	 * AIL lock.
> > +	 */
> > +	xfs_trans_ail_update(log->l_ailp, &buip->bui_item, lsn);
> > +	xfs_bui_release(buip);
> > +	return 0;
> > +}
> > +
> > +
> > +/*
> > + * This routine is called when an BUD format structure is found in a committed
> > + * transaction in the log. Its purpose is to cancel the corresponding BUI if it
> > + * was still in the log. To do this it searches the AIL for the BUI with an id
> > + * equal to that in the BUD format structure. If we find it we drop the BUD
> > + * reference, which removes the BUI from the AIL and frees it.
> > + */
> > +STATIC int
> > +xlog_recover_bud_pass2(
> > +	struct xlog			*log,
> > +	struct xlog_recover_item	*item)
> > +{
> > +	struct xfs_bud_log_format	*bud_formatp;
> > +	struct xfs_bui_log_item		*buip = NULL;
> > +	struct xfs_log_item		*lip;
> > +	__uint64_t			bui_id;
> > +	struct xfs_ail_cursor		cur;
> > +	struct xfs_ail			*ailp = log->l_ailp;
> > +
> > +	bud_formatp = item->ri_buf[0].i_addr;
> > +	if (item->ri_buf[0].i_len != sizeof(struct xfs_bud_log_format))
> > +		return -EFSCORRUPTED;
> > +	bui_id = bud_formatp->bud_bui_id;
> > +
> > +	/*
> > +	 * Search for the BUI with the id in the BUD format structure in the
> > +	 * AIL.
> > +	 */
> > +	spin_lock(&ailp->xa_lock);
> > +	lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > +	while (lip != NULL) {
> > +		if (lip->li_type == XFS_LI_BUI) {
> > +			buip = (struct xfs_bui_log_item *)lip;
> > +			if (buip->bui_format.bui_id == bui_id) {
> > +				/*
> > +				 * Drop the BUD reference to the BUI. This
> > +				 * removes the BUI from the AIL and frees it.
> > +				 */
> > +				spin_unlock(&ailp->xa_lock);
> > +				xfs_bui_release(buip);
> > +				spin_lock(&ailp->xa_lock);
> > +				break;
> > +			}
> > +		}
> > +		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > +	}
> > +
> > +	xfs_trans_ail_cursor_done(&cur);
> > +	spin_unlock(&ailp->xa_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> >   * This routine is called when an inode create format structure is found in a
> >   * committed transaction in the log.  It's purpose is to initialise the inodes
> >   * being allocated on disk. This requires us to get inode cluster buffers that
> > @@ -3899,6 +4021,8 @@ xlog_recover_ra_pass2(
> >  	case XFS_LI_RUD:
> >  	case XFS_LI_CUI:
> >  	case XFS_LI_CUD:
> > +	case XFS_LI_BUI:
> > +	case XFS_LI_BUD:
> >  	default:
> >  		break;
> >  	}
> > @@ -3926,6 +4050,8 @@ xlog_recover_commit_pass1(
> >  	case XFS_LI_RUD:
> >  	case XFS_LI_CUI:
> >  	case XFS_LI_CUD:
> > +	case XFS_LI_BUI:
> > +	case XFS_LI_BUD:
> >  		/* nothing to do in pass 1 */
> >  		return 0;
> >  	default:
> > @@ -3964,6 +4090,10 @@ xlog_recover_commit_pass2(
> >  		return xlog_recover_cui_pass2(log, item, trans->r_lsn);
> >  	case XFS_LI_CUD:
> >  		return xlog_recover_cud_pass2(log, item);
> > +	case XFS_LI_BUI:
> > +		return xlog_recover_bui_pass2(log, item, trans->r_lsn);
> > +	case XFS_LI_BUD:
> > +		return xlog_recover_bud_pass2(log, item);
> >  	case XFS_LI_DQUOT:
> >  		return xlog_recover_dquot_pass2(log, buffer_list, item,
> >  						trans->r_lsn);
> > @@ -4591,6 +4721,46 @@ xlog_recover_cancel_cui(
> >  	spin_lock(&ailp->xa_lock);
> >  }
> >  
> > +/* Recover the BUI if necessary. */
> > +STATIC int
> > +xlog_recover_process_bui(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_ail			*ailp,
> > +	struct xfs_log_item		*lip)
> > +{
> > +	struct xfs_bui_log_item		*buip;
> > +	int				error;
> > +
> > +	/*
> > +	 * Skip BUIs that we've already processed.
> > +	 */
> > +	buip = container_of(lip, struct xfs_bui_log_item, bui_item);
> > +	if (test_bit(XFS_BUI_RECOVERED, &buip->bui_flags))
> > +		return 0;
> > +
> > +	spin_unlock(&ailp->xa_lock);
> > +	error = xfs_bui_recover(mp, buip);
> > +	spin_lock(&ailp->xa_lock);
> > +
> > +	return error;
> > +}
> > +
> > +/* Release the BUI since we're cancelling everything. */
> > +STATIC void
> > +xlog_recover_cancel_bui(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_ail			*ailp,
> > +	struct xfs_log_item		*lip)
> > +{
> > +	struct xfs_bui_log_item		*buip;
> > +
> > +	buip = container_of(lip, struct xfs_bui_log_item, bui_item);
> > +
> > +	spin_unlock(&ailp->xa_lock);
> > +	xfs_bui_release(buip);
> > +	spin_lock(&ailp->xa_lock);
> > +}
> > +
> >  /* Is this log item a deferred action intent? */
> >  static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
> >  {
> > @@ -4598,6 +4768,7 @@ static inline bool xlog_item_is_intent(struct xfs_log_item *lip)
> >  	case XFS_LI_EFI:
> >  	case XFS_LI_RUI:
> >  	case XFS_LI_CUI:
> > +	case XFS_LI_BUI:
> >  		return true;
> >  	default:
> >  		return false;
> > @@ -4664,6 +4835,9 @@ xlog_recover_process_intents(
> >  		case XFS_LI_CUI:
> >  			error = xlog_recover_process_cui(log->l_mp, ailp, lip);
> >  			break;
> > +		case XFS_LI_BUI:
> > +			error = xlog_recover_process_bui(log->l_mp, ailp, lip);
> > +			break;
> >  		}
> >  		if (error)
> >  			goto out;
> > @@ -4714,6 +4888,9 @@ xlog_recover_cancel_intents(
> >  		case XFS_LI_CUI:
> >  			xlog_recover_cancel_cui(log->l_mp, ailp, lip);
> >  			break;
> > +		case XFS_LI_BUI:
> > +			xlog_recover_cancel_bui(log->l_mp, ailp, lip);
> > +			break;
> >  		}
> >  
> >  		lip = xfs_trans_ail_cursor_next(ailp, &cur);
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index a7a87d2..7cf02d3 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -39,6 +39,7 @@ struct xfs_btree_cur;
> >  struct xfs_cui_log_item;
> >  struct xfs_cud_log_item;
> >  struct xfs_defer_ops;
> > +struct xfs_bui_log_item;
> >  
> >  typedef struct xfs_log_item {
> >  	struct list_head		li_ail;		/* AIL pointers */
> > @@ -263,4 +264,16 @@ int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
> >  		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
> >  		struct xfs_btree_cur **pcur);
> >  
> > +/* mapping updates */
> > +enum xfs_bmap_intent_type;
> > +
> > +void xfs_bmap_update_init_defer_op(void);
> > +struct xfs_bud_log_item *xfs_trans_get_bud(struct xfs_trans *tp,
> > +		struct xfs_bui_log_item *buip);
> > +int xfs_trans_log_finish_bmap_update(struct xfs_trans *tp,
> > +		struct xfs_bud_log_item *rudp, struct xfs_defer_ops *dfops,
> > +		enum xfs_bmap_intent_type type, struct xfs_inode *ip,
> > +		int whichfork, xfs_fileoff_t startoff, xfs_fsblock_t startblock,
> > +		xfs_filblks_t blockcount, xfs_exntst_t state);
> > +
> >  #endif	/* __XFS_TRANS_H__ */
> > diff --git a/fs/xfs/xfs_trans_bmap.c b/fs/xfs/xfs_trans_bmap.c
> > new file mode 100644
> > index 0000000..656d669
> > --- /dev/null
> > +++ b/fs/xfs/xfs_trans_bmap.c
> > @@ -0,0 +1,84 @@
> > +/*
> > + * Copyright (C) 2016 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it would be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write the Free Software Foundation,
> > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_trans_priv.h"
> > +#include "xfs_bmap_item.h"
> > +#include "xfs_alloc.h"
> > +#include "xfs_bmap.h"
> > +#include "xfs_inode.h"
> > +
> > +/*
> > + * This routine is called to allocate a "bmap update done"
> > + * log item.
> > + */
> > +struct xfs_bud_log_item *
> > +xfs_trans_get_bud(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_bui_log_item		*buip)
> > +{
> > +	struct xfs_bud_log_item		*budp;
> > +
> > +	budp = xfs_bud_init(tp->t_mountp, buip);
> > +	xfs_trans_add_item(tp, &budp->bud_item);
> > +	return budp;
> > +}
> > +
> > +/*
> > + * Finish an bmap update and log it to the BUD. Note that the
> > + * transaction is marked dirty regardless of whether the bmap update
> > + * succeeds or fails to support the BUI/BUD lifecycle rules.
> > + */
> > +int
> > +xfs_trans_log_finish_bmap_update(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_bud_log_item		*budp,
> > +	struct xfs_defer_ops		*dop,
> > +	enum xfs_bmap_intent_type	type,
> > +	struct xfs_inode		*ip,
> > +	int				whichfork,
> > +	xfs_fileoff_t			startoff,
> > +	xfs_fsblock_t			startblock,
> > +	xfs_filblks_t			blockcount,
> > +	xfs_exntst_t			state)
> > +{
> > +	int				error;
> > +
> > +	error = -EFSCORRUPTED;
> > +
> > +	/*
> > +	 * Mark the transaction dirty, even on error. This ensures the
> > +	 * transaction is aborted, which:
> > +	 *
> > +	 * 1.) releases the BUI and frees the BUD
> > +	 * 2.) shuts down the filesystem
> > +	 */
> > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	budp->bud_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > +
> > +	return error;
> > +}
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 14/63] xfs: connect refcount adjust functions to upper layers
  2016-09-30 19:40     ` Darrick J. Wong
@ 2016-09-30 20:11       ` Brian Foster
  0 siblings, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-09-30 20:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Fri, Sep 30, 2016 at 12:40:40PM -0700, Darrick J. Wong wrote:
> On Fri, Sep 30, 2016 at 12:21:03PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:07:05PM -0700, Darrick J. Wong wrote:
> > > Plumb in the upper level interface to schedule and finish deferred
> > > refcount operations via the deferred ops mechanism.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_defer.h    |    1 
> > >  fs/xfs/libxfs/xfs_refcount.c |  170 ++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_refcount.h |   12 +++
> > >  fs/xfs/xfs_error.h           |    4 +
> > >  fs/xfs/xfs_refcount_item.c   |   73 ++++++++++++++++
> > >  fs/xfs/xfs_super.c           |    1 
> > >  fs/xfs/xfs_trace.h           |    3 +
> > >  fs/xfs/xfs_trans.h           |    8 +-
> > >  fs/xfs/xfs_trans_refcount.c  |  186 ++++++++++++++++++++++++++++++++++++++++++
> > >  9 files changed, 452 insertions(+), 6 deletions(-)
> > > 
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> > > index 599a8d2..e44007a 100644
> > > --- a/fs/xfs/xfs_refcount_item.c
> > > +++ b/fs/xfs/xfs_refcount_item.c
> > > @@ -396,9 +396,19 @@ xfs_cui_recover(
> > >  {
> > >  	int				i;
> > >  	int				error = 0;
> > > +	unsigned int			refc_type;
> > >  	struct xfs_phys_extent		*refc;
> > >  	xfs_fsblock_t			startblock_fsb;
> > >  	bool				op_ok;
> > > +	struct xfs_cud_log_item		*cudp;
> > > +	struct xfs_trans		*tp;
> > > +	struct xfs_btree_cur		*rcur = NULL;
> > > +	enum xfs_refcount_intent_type	type;
> > > +	xfs_fsblock_t			firstfsb;
> > > +	xfs_extlen_t			adjusted;
> > > +	struct xfs_bmbt_irec		irec;
> > > +	struct xfs_defer_ops		dfops;
> > > +	bool				requeue_only = false;
> > >  
> > >  	ASSERT(!test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags));
> > >  
> > > @@ -437,7 +447,68 @@ xfs_cui_recover(
> > >  		}
> > >  	}
> > >  
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > > +	if (error)
> > > +		return error;
> > > +	cudp = xfs_trans_get_cud(tp, cuip);
> > > +
> > > +	xfs_defer_init(&dfops, &firstfsb);
> > 
> > A comment would be nice here to point out the approach. E.g., that
> > refcount updates are initially deferred under normal runtime
> > circumstances, they handle reservation usage internally/dynamically, and
> > that since we're in recovery, we start the initial update directly and
> > defer the rest that won't fit in the transaction (worded better and
> > assuming I understand all that correctly ;P).
> 
> Yep, your understanding is correct.  I'll put that in as a comment.
> 
> > (Sorry for the comment requests and whatnot, BTW. I'm catching up from a
> > couple weeks of PTO, probably late to the game and not up to speed on
> > the latest status of the patchset. Feel free to defer, drop, or
> > conditionalize any of the aesthetic stuff to whenever is opportune if
> > this stuff is otherwise close to merge).
> 
> NP.  I appreciate review whenever I can get it. :)
> 
> (Plus, you found a bug! :) :))
> 
> > > +	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
> > > +		refc = &cuip->cui_format.cui_extents[i];
> > > +		refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
> > > +		switch (refc_type) {
> > > +		case XFS_REFCOUNT_INCREASE:
> > > +		case XFS_REFCOUNT_DECREASE:
> > > +		case XFS_REFCOUNT_ALLOC_COW:
> > > +		case XFS_REFCOUNT_FREE_COW:
> > > +			type = refc_type;
> > > +			break;
> > > +		default:
> > > +			error = -EFSCORRUPTED;
> > > +			goto abort_error;
> > > +		}
> > > +		if (requeue_only)
> > > +			adjusted = 0;
> > > +		else
> > > +			error = xfs_trans_log_finish_refcount_update(tp, cudp,
> > > +				&dfops, type, refc->pe_startblock, refc->pe_len,
> > > +				&adjusted, &rcur);
> > > +		if (error)
> > > +			goto abort_error;
> > > +
> > > +		/* Requeue what we didn't finish. */
> > > +		if (adjusted < refc->pe_len) {
> > > +			irec.br_startblock = refc->pe_startblock + adjusted;
> > > +			irec.br_blockcount = refc->pe_len - adjusted;
> > 
> > Hmm, so it appears we walk the range of blocks from beginning to end,
> > but the refcount update code doesn't necessarily always work that way.
> > It merges the boundaries and walks the middle range from start to end.
> > So what happens if the call above ends up doing a right merge and then
> > skips out on any other changes due to the transaction reservation?
> 
> D'oh!  You've found a bug!  _refcount_adjust needs to communicate to
> its caller how much work is left, which does by incrementing *adjusted
> every time it finishes more work.  The caller then moves the start of
> the extent upwards by *adjusted.  Unfortunately, as you point out, a
> right merge actually does work at the upper end of the extent, and this
> is not correctly accounted for.
> 
> To fix this, I'll change _refcount_adjust to report the unfinished
> extent directly to the caller, which will simplify both the function and
> its callers' accounting considerably.
> 
> Good catch!
> 

Ok. Another option might be to perform the refcount update work in
order, but whatever is easier/cleaner is probably fine (and it's
probably easier to update the interface than the mechanism at this
point).

> > Brian
> > 
> > P.S., Even if I'm missing something and this is not an issue, do we have
> > any log recovery oriented reflink xfstests in the current test pile? If
> > not, I'd suggest that something as simple as a "do a bunch of reflinks +
> > xfs_io -c 'shutdown -f' + umount/mount" loop could go a long way towards
> > shaking out any issues. Log recovery can be a pita and otherwise
> > problems therein can go undetected for a surprising amount of time.
> 
> xfs/{313,316,321,324,326} use the error injection mechanism to test log
> recovery.
> 

Great, thanks. What about basic reflink support for fsstress? I suppose
if we had that, some of the existing fsstress->crash->recover tests
would provide coverage as well. One of the things I actually do every
now and then is run an infinite fsstress+remount loop in one thread and
and a randomly timed (e.g., every 0-30s) fs shutdown trigger in another.
That helps catch recovery issues, corruption issues, etc.

Brian

> --D
> 
> > 
> > > +			switch (type) {
> > > +			case XFS_REFCOUNT_INCREASE:
> > > +				error = xfs_refcount_increase_extent(
> > > +						tp->t_mountp, &dfops, &irec);
> > > +				break;
> > > +			case XFS_REFCOUNT_DECREASE:
> > > +				error = xfs_refcount_decrease_extent(
> > > +						tp->t_mountp, &dfops, &irec);
> > > +				break;
> > > +			default:
> > > +				ASSERT(0);
> > > +			}
> > > +			if (error)
> > > +				goto abort_error;
> > > +			requeue_only = true;
> > > +		}
> > > +	}
> > > +
> > > +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > > +	if (error)
> > > +		goto abort_error;
> > >  	set_bit(XFS_CUI_RECOVERED, &cuip->cui_flags);
> > > -	xfs_cui_release(cuip);
> > > +	error = xfs_trans_commit(tp);
> > > +	return error;
> > > +
> > > +abort_error:
> > > +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > > +	xfs_defer_cancel(&dfops);
> > > +	xfs_trans_cancel(tp);
> > >  	return error;
> > >  }
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index abe69c6..6234622 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1903,6 +1903,7 @@ init_xfs_fs(void)
> > >  
> > >  	xfs_extent_free_init_defer_op();
> > >  	xfs_rmap_update_init_defer_op();
> > > +	xfs_refcount_update_init_defer_op();
> > >  
> > >  	xfs_dir_startup();
> > >  
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index fed1906..195a168 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -2931,6 +2931,9 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_find_right_extent_error);
> > >  DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared);
> > >  DEFINE_AG_EXTENT_EVENT(xfs_refcount_find_shared_result);
> > >  DEFINE_AG_ERROR_EVENT(xfs_refcount_find_shared_error);
> > > +#define DEFINE_REFCOUNT_DEFERRED_EVENT DEFINE_PHYS_EXTENT_DEFERRED_EVENT
> > > +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_defer);
> > > +DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
> > >  
> > >  TRACE_EVENT(xfs_refcount_finish_one_leftover,
> > >  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index fe69e20..a7a87d2 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -37,6 +37,8 @@ struct xfs_rud_log_item;
> > >  struct xfs_rui_log_item;
> > >  struct xfs_btree_cur;
> > >  struct xfs_cui_log_item;
> > > +struct xfs_cud_log_item;
> > > +struct xfs_defer_ops;
> > >  
> > >  typedef struct xfs_log_item {
> > >  	struct list_head		li_ail;		/* AIL pointers */
> > > @@ -252,11 +254,13 @@ int xfs_trans_log_finish_rmap_update(struct xfs_trans *tp,
> > >  /* refcount updates */
> > >  enum xfs_refcount_intent_type;
> > >  
> > > +void xfs_refcount_update_init_defer_op(void);
> > >  struct xfs_cud_log_item *xfs_trans_get_cud(struct xfs_trans *tp,
> > >  		struct xfs_cui_log_item *cuip);
> > >  int xfs_trans_log_finish_refcount_update(struct xfs_trans *tp,
> > > -		struct xfs_cud_log_item *cudp,
> > > +		struct xfs_cud_log_item *cudp, struct xfs_defer_ops *dfops,
> > >  		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
> > > -		xfs_extlen_t blockcount, struct xfs_btree_cur **pcur);
> > > +		xfs_extlen_t blockcount, xfs_extlen_t *adjusted,
> > > +		struct xfs_btree_cur **pcur);
> > >  
> > >  #endif	/* __XFS_TRANS_H__ */
> > > diff --git a/fs/xfs/xfs_trans_refcount.c b/fs/xfs/xfs_trans_refcount.c
> > > index b18d548..e3ac994 100644
> > > --- a/fs/xfs/xfs_trans_refcount.c
> > > +++ b/fs/xfs/xfs_trans_refcount.c
> > > @@ -56,15 +56,17 @@ int
> > >  xfs_trans_log_finish_refcount_update(
> > >  	struct xfs_trans		*tp,
> > >  	struct xfs_cud_log_item		*cudp,
> > > +	struct xfs_defer_ops		*dop,
> > >  	enum xfs_refcount_intent_type	type,
> > >  	xfs_fsblock_t			startblock,
> > >  	xfs_extlen_t			blockcount,
> > > +	xfs_extlen_t			*adjusted,
> > >  	struct xfs_btree_cur		**pcur)
> > >  {
> > >  	int				error;
> > >  
> > > -	/* XXX: leave this empty for now */
> > > -	error = -EFSCORRUPTED;
> > > +	error = xfs_refcount_finish_one(tp, dop, type, startblock,
> > > +			blockcount, adjusted, pcur);
> > >  
> > >  	/*
> > >  	 * Mark the transaction dirty, even on error. This ensures the
> > > @@ -78,3 +80,183 @@ xfs_trans_log_finish_refcount_update(
> > >  
> > >  	return error;
> > >  }
> > > +
> > > +/* Sort refcount intents by AG. */
> > > +static int
> > > +xfs_refcount_update_diff_items(
> > > +	void				*priv,
> > > +	struct list_head		*a,
> > > +	struct list_head		*b)
> > > +{
> > > +	struct xfs_mount		*mp = priv;
> > > +	struct xfs_refcount_intent	*ra;
> > > +	struct xfs_refcount_intent	*rb;
> > > +
> > > +	ra = container_of(a, struct xfs_refcount_intent, ri_list);
> > > +	rb = container_of(b, struct xfs_refcount_intent, ri_list);
> > > +	return  XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
> > > +		XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
> > > +}
> > > +
> > > +/* Get an CUI. */
> > > +STATIC void *
> > > +xfs_refcount_update_create_intent(
> > > +	struct xfs_trans		*tp,
> > > +	unsigned int			count)
> > > +{
> > > +	struct xfs_cui_log_item		*cuip;
> > > +
> > > +	ASSERT(tp != NULL);
> > > +	ASSERT(count > 0);
> > > +
> > > +	cuip = xfs_cui_init(tp->t_mountp, count);
> > > +	ASSERT(cuip != NULL);
> > > +
> > > +	/*
> > > +	 * Get a log_item_desc to point at the new item.
> > > +	 */
> > > +	xfs_trans_add_item(tp, &cuip->cui_item);
> > > +	return cuip;
> > > +}
> > > +
> > > +/* Set the phys extent flags for this reverse mapping. */
> > > +static void
> > > +xfs_trans_set_refcount_flags(
> > > +	struct xfs_phys_extent		*refc,
> > > +	enum xfs_refcount_intent_type	type)
> > > +{
> > > +	refc->pe_flags = 0;
> > > +	switch (type) {
> > > +	case XFS_REFCOUNT_INCREASE:
> > > +	case XFS_REFCOUNT_DECREASE:
> > > +	case XFS_REFCOUNT_ALLOC_COW:
> > > +	case XFS_REFCOUNT_FREE_COW:
> > > +		refc->pe_flags |= type;
> > > +		break;
> > > +	default:
> > > +		ASSERT(0);
> > > +	}
> > > +}
> > > +
> > > +/* Log refcount updates in the intent item. */
> > > +STATIC void
> > > +xfs_refcount_update_log_item(
> > > +	struct xfs_trans		*tp,
> > > +	void				*intent,
> > > +	struct list_head		*item)
> > > +{
> > > +	struct xfs_cui_log_item		*cuip = intent;
> > > +	struct xfs_refcount_intent	*refc;
> > > +	uint				next_extent;
> > > +	struct xfs_phys_extent		*ext;
> > > +
> > > +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> > > +
> > > +	tp->t_flags |= XFS_TRANS_DIRTY;
> > > +	cuip->cui_item.li_desc->lid_flags |= XFS_LID_DIRTY;
> > > +
> > > +	/*
> > > +	 * atomic_inc_return gives us the value after the increment;
> > > +	 * we want to use it as an array index so we need to subtract 1 from
> > > +	 * it.
> > > +	 */
> > > +	next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1;
> > > +	ASSERT(next_extent < cuip->cui_format.cui_nextents);
> > > +	ext = &cuip->cui_format.cui_extents[next_extent];
> > > +	ext->pe_startblock = refc->ri_startblock;
> > > +	ext->pe_len = refc->ri_blockcount;
> > > +	xfs_trans_set_refcount_flags(ext, refc->ri_type);
> > > +}
> > > +
> > > +/* Get an CUD so we can process all the deferred refcount updates. */
> > > +STATIC void *
> > > +xfs_refcount_update_create_done(
> > > +	struct xfs_trans		*tp,
> > > +	void				*intent,
> > > +	unsigned int			count)
> > > +{
> > > +	return xfs_trans_get_cud(tp, intent);
> > > +}
> > > +
> > > +/* Process a deferred refcount update. */
> > > +STATIC int
> > > +xfs_refcount_update_finish_item(
> > > +	struct xfs_trans		*tp,
> > > +	struct xfs_defer_ops		*dop,
> > > +	struct list_head		*item,
> > > +	void				*done_item,
> > > +	void				**state)
> > > +{
> > > +	struct xfs_refcount_intent	*refc;
> > > +	xfs_extlen_t			adjusted;
> > > +	int				error;
> > > +
> > > +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> > > +	error = xfs_trans_log_finish_refcount_update(tp, done_item, dop,
> > > +			refc->ri_type,
> > > +			refc->ri_startblock,
> > > +			refc->ri_blockcount,
> > > +			&adjusted,
> > > +			(struct xfs_btree_cur **)state);
> > > +	/* Did we run out of reservation?  Requeue what we didn't finish. */
> > > +	if (!error && adjusted < refc->ri_blockcount) {
> > > +		ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE ||
> > > +		       refc->ri_type == XFS_REFCOUNT_DECREASE);
> > > +		refc->ri_startblock += adjusted;
> > > +		refc->ri_blockcount -= adjusted;
> > > +		return -EAGAIN;
> > > +	}
> > > +	kmem_free(refc);
> > > +	return error;
> > > +}
> > > +
> > > +/* Clean up after processing deferred refcounts. */
> > > +STATIC void
> > > +xfs_refcount_update_finish_cleanup(
> > > +	struct xfs_trans	*tp,
> > > +	void			*state,
> > > +	int			error)
> > > +{
> > > +	struct xfs_btree_cur	*rcur = state;
> > > +
> > > +	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> > > +}
> > > +
> > > +/* Abort all pending CUIs. */
> > > +STATIC void
> > > +xfs_refcount_update_abort_intent(
> > > +	void				*intent)
> > > +{
> > > +	xfs_cui_release(intent);
> > > +}
> > > +
> > > +/* Cancel a deferred refcount update. */
> > > +STATIC void
> > > +xfs_refcount_update_cancel_item(
> > > +	struct list_head		*item)
> > > +{
> > > +	struct xfs_refcount_intent	*refc;
> > > +
> > > +	refc = container_of(item, struct xfs_refcount_intent, ri_list);
> > > +	kmem_free(refc);
> > > +}
> > > +
> > > +static const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
> > > +	.type		= XFS_DEFER_OPS_TYPE_REFCOUNT,
> > > +	.max_items	= XFS_CUI_MAX_FAST_EXTENTS,
> > > +	.diff_items	= xfs_refcount_update_diff_items,
> > > +	.create_intent	= xfs_refcount_update_create_intent,
> > > +	.abort_intent	= xfs_refcount_update_abort_intent,
> > > +	.log_item	= xfs_refcount_update_log_item,
> > > +	.create_done	= xfs_refcount_update_create_done,
> > > +	.finish_item	= xfs_refcount_update_finish_item,
> > > +	.finish_cleanup = xfs_refcount_update_finish_cleanup,
> > > +	.cancel_item	= xfs_refcount_update_cancel_item,
> > > +};
> > > +
> > > +/* Register the deferred op type. */
> > > +void
> > > +xfs_refcount_update_init_defer_op(void)
> > > +{
> > > +	xfs_defer_init_op_type(&xfs_refcount_update_defer_type);
> > > +}
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations
  2016-09-30 17:38     ` Darrick J. Wong
@ 2016-09-30 20:34       ` Roger Willcocks
  2016-09-30 21:08         ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Roger Willcocks @ 2016-09-30 20:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, david, linux-xfs, linux-fsdevel

On Fri, 2016-09-30 at 10:38 -0700, Darrick J. Wong wrote:
> On Fri, Sep 30, 2016 at 12:34:04AM -0700, Christoph Hellwig wrote:
> > > +/* Deferred mapping is only for real extents in the data fork. */
> > > +static bool
> > > +xfs_bmap_is_update_needed(
> > > +	int			whichfork,
> > > +	struct xfs_bmbt_irec	*bmap)
> > > +{
> > > +	ASSERT(whichfork == XFS_DATA_FORK);
> > > +
> > > +	return  bmap->br_startblock != HOLESTARTBLOCK &&
> > > +		bmap->br_startblock != DELAYSTARTBLOCK;
> > > +}
> > 
> > Passing in an argument just to assert on it seems weird.
> > And except for that a better name might be xfs_bmbt_is_real or similar,
> > and I bet we'd have other users for it as well.
> 
> xfs_bmap_*map_extent are the only callers, and the only whichfork
> values are XFS_DATA_FORK.  I might as well just tear out all those
> asserts since they're never going to trigger anyway.
> 

Um, isn't that the point of an assertion ?

--
Roger




^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations
  2016-09-30 20:34       ` Roger Willcocks
@ 2016-09-30 21:08         ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-09-30 21:08 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: Christoph Hellwig, david, linux-xfs, linux-fsdevel

On Fri, Sep 30, 2016 at 09:34:48PM +0100, Roger Willcocks wrote:
> On Fri, 2016-09-30 at 10:38 -0700, Darrick J. Wong wrote:
> > On Fri, Sep 30, 2016 at 12:34:04AM -0700, Christoph Hellwig wrote:
> > > > +/* Deferred mapping is only for real extents in the data fork. */
> > > > +static bool
> > > > +xfs_bmap_is_update_needed(
> > > > +	int			whichfork,
> > > > +	struct xfs_bmbt_irec	*bmap)
> > > > +{
> > > > +	ASSERT(whichfork == XFS_DATA_FORK);
> > > > +
> > > > +	return  bmap->br_startblock != HOLESTARTBLOCK &&
> > > > +		bmap->br_startblock != DELAYSTARTBLOCK;
> > > > +}
> > > 
> > > Passing in an argument just to assert on it seems weird.
> > > And except for that a better name might be xfs_bmbt_is_real or similar,
> > > and I bet we'd have other users for it as well.
> > 
> > xfs_bmap_*map_extent are the only callers, and the only whichfork
> > values are XFS_DATA_FORK.  I might as well just tear out all those
> > asserts since they're never going to trigger anyway.
> > 
> 
> Um, isn't that the point of an assertion ?

There's no point in checking whichfork since all callers pass
XFS_DATA_FORK, making the function argument unnecessary.

--D

> 
> --
> Roger
> 
> 
> 

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/63] xfs: map an inode's offset to an exact physical block
  2016-09-30  3:07 ` [PATCH 21/63] xfs: map an inode's offset to an exact physical block Darrick J. Wong
  2016-09-30  7:31   ` Christoph Hellwig
@ 2016-10-03 19:03   ` Brian Foster
  2016-10-04  0:11     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-03 19:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:07:56PM -0700, Darrick J. Wong wrote:
> Teach the bmap routine to know how to map a range of file blocks to a
> specific range of physical blocks, instead of simply allocating fresh
> blocks.  This enables reflink to map a file to blocks that are already
> in use.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_bmap.h |   10 +++++++
>  fs/xfs/xfs_trace.h       |   54 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 126 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 907d7b8d..9f145ed 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3877,6 +3877,55 @@ xfs_bmap_btalloc(
>  }
>  
>  /*
> + * For a remap operation, just "allocate" an extent at the address that the
> + * caller passed in, and ensure that the AGFL is the right size.  The caller
> + * will then map the "allocated" extent into the file somewhere.
> + */
> +STATIC int
> +xfs_bmap_remap_alloc(
> +	struct xfs_bmalloca	*ap)
> +{
> +	struct xfs_trans	*tp = ap->tp;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	xfs_agblock_t		bno;
> +	struct xfs_alloc_arg	args;
> +	int			error;
> +
> +	/*
> +	 * validate that the block number is legal - the enables us to detect
> +	 * and handle a silent filesystem corruption rather than crashing.
> +	 */
> +	memset(&args, 0, sizeof(struct xfs_alloc_arg));
> +	args.tp = ap->tp;
> +	args.mp = ap->tp->t_mountp;
> +	bno = *ap->firstblock;
> +	args.agno = XFS_FSB_TO_AGNO(mp, bno);
> +	ASSERT(args.agno < mp->m_sb.sb_agcount);
> +	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
> +	ASSERT(args.agbno < mp->m_sb.sb_agblocks);
> +
> +	/* "Allocate" the extent from the range we passed in. */
> +	trace_xfs_bmap_remap_alloc(ap->ip, *ap->firstblock, ap->length);
> +	ap->blkno = bno;
> +	ap->ip->i_d.di_nblocks += ap->length;
> +	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> +
> +	/* Fix the freelist, like a real allocator does. */
> +	args.datatype = ap->datatype;
> +	args.pag = xfs_perag_get(args.mp, args.agno);
> +	ASSERT(args.pag);
> +
> +	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);

Why the FREEING flag? 

> +	if (error)
> +		goto error0;
> +error0:
> +	xfs_perag_put(args.pag);
> +	if (error)
> +		trace_xfs_bmap_remap_alloc_error(ap->ip, error, _RET_IP_);
> +	return error;
> +}
> +
> +/*
>   * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file.
>   * It figures out where to ask the underlying allocator to put the new extent.
>   */
> @@ -3884,6 +3933,8 @@ STATIC int
>  xfs_bmap_alloc(
>  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
>  {
> +	if (ap->flags & XFS_BMAPI_REMAP)
> +		return xfs_bmap_remap_alloc(ap);
>  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
>  	    xfs_alloc_is_userdata(ap->datatype))
>  		return xfs_bmap_rtalloc(ap);
> @@ -4442,6 +4493,12 @@ xfs_bmapi_write(
>  	ASSERT(len > 0);
>  	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	if (whichfork == XFS_ATTR_FORK)
> +		ASSERT(!(flags & XFS_BMAPI_REMAP));

I think it's better to avoid conditionals if the only affected code
consists of ASSERT() statements (which can be compiled out). E.g., 

	ASSERT(!((flags & XFS_BMAPI_REMAP) && whichfork == XFS_ATTR_FORK));

... and so on, but not a big deal.

Brian

> +	if (flags & XFS_BMAPI_REMAP) {
> +		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
> +		ASSERT(!(flags & XFS_BMAPI_CONVERT));
> +	}
>  
>  	/* zeroing is for currently only for data extents, not metadata */
>  	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
> @@ -4503,6 +4560,12 @@ xfs_bmapi_write(
>  		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
>  
>  		/*
> +		 * Make sure we only reflink into a hole.
> +		 */
> +		if (flags & XFS_BMAPI_REMAP)
> +			ASSERT(inhole);
> +
> +		/*
>  		 * First, deal with the hole before the allocated space
>  		 * that we found, if any.
>  		 */
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index fcdb094..877b6f9 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -97,6 +97,13 @@ struct xfs_extent_free_item
>   */
>  #define XFS_BMAPI_ZERO		0x080
>  
> +/*
> + * Map the inode offset to the block given in ap->firstblock.  Primarily
> + * used for reflink.  The range must be in a hole, and this flag cannot be
> + * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
> + */
> +#define XFS_BMAPI_REMAP		0x100
> +
>  #define XFS_BMAPI_FLAGS \
>  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
>  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> @@ -105,7 +112,8 @@ struct xfs_extent_free_item
>  	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
>  	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
>  	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
> -	{ XFS_BMAPI_ZERO,	"ZERO" }
> +	{ XFS_BMAPI_ZERO,	"ZERO" }, \
> +	{ XFS_BMAPI_REMAP,	"REMAP" }
>  
>  
>  static inline int xfs_bmapi_aflag(int w)
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 195a168..8485984 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2965,6 +2965,60 @@ TRACE_EVENT(xfs_refcount_finish_one_leftover,
>  		  __entry->adjusted)
>  );
>  
> +/* simple inode-based error/%ip tracepoint class */
> +DECLARE_EVENT_CLASS(xfs_inode_error_class,
> +	TP_PROTO(struct xfs_inode *ip, int error, unsigned long caller_ip),
> +	TP_ARGS(ip, error, caller_ip),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_ino_t, ino)
> +		__field(int, error)
> +		__field(unsigned long, caller_ip)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> +		__entry->ino = ip->i_ino;
> +		__entry->error = error;
> +		__entry->caller_ip = caller_ip;
> +	),
> +	TP_printk("dev %d:%d ino %llx error %d caller %ps",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ino,
> +		  __entry->error,
> +		  (char *)__entry->caller_ip)
> +);
> +
> +#define DEFINE_INODE_ERROR_EVENT(name) \
> +DEFINE_EVENT(xfs_inode_error_class, name, \
> +	TP_PROTO(struct xfs_inode *ip, int error, \
> +		 unsigned long caller_ip), \
> +	TP_ARGS(ip, error, caller_ip))
> +
> +/* reflink allocator */
> +TRACE_EVENT(xfs_bmap_remap_alloc,
> +	TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t fsbno,
> +		 xfs_extlen_t len),
> +	TP_ARGS(ip, fsbno, len),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_ino_t, ino)
> +		__field(xfs_fsblock_t, fsbno)
> +		__field(xfs_extlen_t, len)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> +		__entry->ino = ip->i_ino;
> +		__entry->fsbno = fsbno;
> +		__entry->len = len;
> +	),
> +	TP_printk("dev %d:%d ino 0x%llx fsbno 0x%llx len %x",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ino,
> +		  __entry->fsbno,
> +		  __entry->len)
> +);
> +DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped
  2016-09-30  3:08 ` [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped Darrick J. Wong
  2016-09-30  7:35   ` Christoph Hellwig
@ 2016-10-03 19:04   ` Brian Foster
  2016-10-04  0:29     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-03 19:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:08:17PM -0700, Darrick J. Wong wrote:
> Log recovery will iget an inode to replay BUI items and iput the inode
> when it's done.  Unfortunately, the iput will see that i_nlink == 0
> and decide to truncate & free the inode, which prevents us from
> replaying subsequent BUIs.  We can't skip the BUIs because we have to
> replay all the redo items to ensure that atomic operations complete.
> 

The way this is written sort of implies that the inodes will
unconditionally have i_nlink == 0 in this context. That is not the case,
right? I.e., we're just trying to cover the case where bui recovery has
to deal with an inode that happens to be on the unlinked list..?

> Since unlinked inode recovery will reap the inode anyway, we can
> safely introduce a new inode flag to indicate that an inode is in this
> 'unlinked recovery' state and should not be auto-reaped in the
> drop_inode path.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_bmap_item.c   |    1 +
>  fs/xfs/xfs_inode.c       |    8 ++++++++
>  fs/xfs/xfs_inode.h       |    6 ++++++
>  fs/xfs/xfs_log_recover.c |    1 +
>  4 files changed, 16 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index ddda7c3..b1a220f 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -456,6 +456,7 @@ xfs_bui_recover(
>  	if (error)
>  		goto err_inode;
>  
> +	xfs_iflags_set(ip, XFS_IRECOVER_UNLINKED);

If so, I find the name of the flag a bit confusing because we set it
unconditionally. Perhaps just XFS_IRECOVER or XFS_IRECOVERY is
sufficient?

>  	xfs_defer_init(&dfops, &firstfsb);
>  
>  	/* Process deferred bmap item. */
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e08eaea..0c25a76 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1855,6 +1855,14 @@ xfs_inactive(
>  	if (mp->m_flags & XFS_MOUNT_RDONLY)
>  		return;
>  
> +	/*
> +	 * If this unlinked inode is in the middle of recovery, don't
> +	 * truncate and free the inode just yet; log recovery will take
> +	 * care of that.  See the comment for this inode flag.
> +	 */
> +	if (xfs_iflags_test(ip, XFS_IRECOVER_UNLINKED))
> +		return;
> +

Also, it might be better to push this one block of code down since the
following block still deals with i_nlink > 0 properly (not that it will
likely affect the code as it is now, since we only handle eofblocks
trimming atm).

Brian

>  	if (VFS_I(ip)->i_nlink != 0) {
>  		/*
>  		 * force is true because we are evicting an inode from the
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index a8658e6..46632f1 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -222,6 +222,12 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
>  #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
>  #define XFS_IDONTCACHE		(1 << 9) /* don't cache the inode long term */
>  #define XFS_IEOFBLOCKS		(1 << 10)/* has the preallocblocks tag set */
> +/*
> + * If this unlinked inode is in the middle of recovery, don't let drop_inode
> + * truncate and free the inode.  This can happen if we iget the inode during
> + * log recovery to replay a bmap operation on the inode.
> + */
> +#define XFS_IRECOVER_UNLINKED	(1 << 11)
>  
>  /*
>   * Per-lifetime flags need to be reset when re-using a reclaimable inode during
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 9697e94..b121f02 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -4969,6 +4969,7 @@ xlog_recover_process_one_iunlink(
>  	if (error)
>  		goto fail_iput;
>  
> +	xfs_iflags_clear(ip, XFS_IRECOVER_UNLINKED);
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
>  	ASSERT(VFS_I(ip)->i_mode != 0);
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation
  2016-09-30  3:08 ` [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
  2016-09-30  7:19   ` Christoph Hellwig
@ 2016-10-03 19:04   ` Brian Foster
  2016-10-04  0:30     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-03 19:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:08:23PM -0700, Darrick J. Wong wrote:
> Return the range of file blocks that bunmapi didn't free.  This hint
> is used by CoW and reflink to figure out what part of an extent
> actually got freed so that it can set up the appropriate atomic
> remapping of just the freed range.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   36 ++++++++++++++++++++++++++++++------
>  fs/xfs/libxfs/xfs_bmap.h |    4 ++++
>  2 files changed, 34 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index a5e429e..1e4f1a1 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
...
> @@ -5478,6 +5481,27 @@ xfs_bunmapi(
>  	return error;
>  }
>  
> +/* Unmap a range of a file. */
> +int
> +xfs_bunmapi(
> +	xfs_trans_t		*tp,
> +	struct xfs_inode	*ip,
> +	xfs_fileoff_t		bno,
> +	xfs_filblks_t		len,
> +	int			flags,
> +	xfs_extnum_t		nexts,
> +	xfs_fsblock_t		*firstblock,
> +	struct xfs_defer_ops	*dfops,
> +	int			*done)
> +{
> +	int			error;
> +
> +	error = __xfs_bunmapi(tp, ip, bno, &len, flags, nexts, firstblock,
> +			dfops);
> +	*done = (len == 0);
> +	return error;
> +}
> +

I wonder if we really need such a wrapper for this. There aren't too
many xfs_bunmapi() callers and at least a couple of the few that I
checked don't even use 'done.' That can always get fixed up later
though.

Brian

>  /*
>   * Determine whether an extent shift can be accomplished by a merge with the
>   * extent that precedes the target hole of the shift.
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 53970b1..48ba3ed 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -197,6 +197,10 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fsblock_t *firstblock, xfs_extlen_t total,
>  		struct xfs_bmbt_irec *mval, int *nmap,
>  		struct xfs_defer_ops *dfops);
> +int	__xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
> +		xfs_fileoff_t bno, xfs_filblks_t *rlen, int flags,
> +		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
> +		struct xfs_defer_ops *dfops);
>  int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
>  		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/63] xfs: map an inode's offset to an exact physical block
  2016-10-03 19:03   ` Brian Foster
@ 2016-10-04  0:11     ` Darrick J. Wong
  2016-10-04 12:43       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-04  0:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Mon, Oct 03, 2016 at 03:03:49PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:07:56PM -0700, Darrick J. Wong wrote:
> > Teach the bmap routine to know how to map a range of file blocks to a
> > specific range of physical blocks, instead of simply allocating fresh
> > blocks.  This enables reflink to map a file to blocks that are already
> > in use.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_bmap.h |   10 +++++++
> >  fs/xfs/xfs_trace.h       |   54 +++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 126 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 907d7b8d..9f145ed 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3877,6 +3877,55 @@ xfs_bmap_btalloc(
> >  }
> >  
> >  /*
> > + * For a remap operation, just "allocate" an extent at the address that the
> > + * caller passed in, and ensure that the AGFL is the right size.  The caller
> > + * will then map the "allocated" extent into the file somewhere.
> > + */
> > +STATIC int
> > +xfs_bmap_remap_alloc(
> > +	struct xfs_bmalloca	*ap)
> > +{
> > +	struct xfs_trans	*tp = ap->tp;
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	xfs_agblock_t		bno;
> > +	struct xfs_alloc_arg	args;
> > +	int			error;
> > +
> > +	/*
> > +	 * validate that the block number is legal - the enables us to detect
> > +	 * and handle a silent filesystem corruption rather than crashing.
> > +	 */
> > +	memset(&args, 0, sizeof(struct xfs_alloc_arg));
> > +	args.tp = ap->tp;
> > +	args.mp = ap->tp->t_mountp;
> > +	bno = *ap->firstblock;
> > +	args.agno = XFS_FSB_TO_AGNO(mp, bno);
> > +	ASSERT(args.agno < mp->m_sb.sb_agcount);
> > +	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
> > +	ASSERT(args.agbno < mp->m_sb.sb_agblocks);
> > +
> > +	/* "Allocate" the extent from the range we passed in. */
> > +	trace_xfs_bmap_remap_alloc(ap->ip, *ap->firstblock, ap->length);
> > +	ap->blkno = bno;
> > +	ap->ip->i_d.di_nblocks += ap->length;
> > +	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > +
> > +	/* Fix the freelist, like a real allocator does. */
> > +	args.datatype = ap->datatype;
> > +	args.pag = xfs_perag_get(args.mp, args.agno);
> > +	ASSERT(args.pag);
> > +
> > +	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
> 
> Why the FREEING flag? 

/*
 * The freelist fixing code will decline the allocation if
 * the size and shape of the free space doesn't allow for
 * allocating the extent and updating all the metadata that
 * happens during an allocation.  We're remapping, not
 * allocating, so skip that check by pretending to be freeing.
 */

> > +	if (error)
> > +		goto error0;
> > +error0:
> > +	xfs_perag_put(args.pag);
> > +	if (error)
> > +		trace_xfs_bmap_remap_alloc_error(ap->ip, error, _RET_IP_);
> > +	return error;
> > +}
> > +
> > +/*
> >   * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file.
> >   * It figures out where to ask the underlying allocator to put the new extent.
> >   */
> > @@ -3884,6 +3933,8 @@ STATIC int
> >  xfs_bmap_alloc(
> >  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> >  {
> > +	if (ap->flags & XFS_BMAPI_REMAP)
> > +		return xfs_bmap_remap_alloc(ap);
> >  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
> >  	    xfs_alloc_is_userdata(ap->datatype))
> >  		return xfs_bmap_rtalloc(ap);
> > @@ -4442,6 +4493,12 @@ xfs_bmapi_write(
> >  	ASSERT(len > 0);
> >  	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
> >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		ASSERT(!(flags & XFS_BMAPI_REMAP));
> 
> I think it's better to avoid conditionals if the only affected code
> consists of ASSERT() statements (which can be compiled out). E.g., 
> 
> 	ASSERT(!((flags & XFS_BMAPI_REMAP) && whichfork == XFS_ATTR_FORK));
> 
> ... and so on, but not a big deal.

<nod> I might as well fix that up too...

--D

> Brian
> 
> > +	if (flags & XFS_BMAPI_REMAP) {
> > +		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
> > +		ASSERT(!(flags & XFS_BMAPI_CONVERT));
> > +	}
> >  
> >  	/* zeroing is for currently only for data extents, not metadata */
> >  	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
> > @@ -4503,6 +4560,12 @@ xfs_bmapi_write(
> >  		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
> >  
> >  		/*
> > +		 * Make sure we only reflink into a hole.
> > +		 */
> > +		if (flags & XFS_BMAPI_REMAP)
> > +			ASSERT(inhole);
> > +
> > +		/*
> >  		 * First, deal with the hole before the allocated space
> >  		 * that we found, if any.
> >  		 */
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index fcdb094..877b6f9 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -97,6 +97,13 @@ struct xfs_extent_free_item
> >   */
> >  #define XFS_BMAPI_ZERO		0x080
> >  
> > +/*
> > + * Map the inode offset to the block given in ap->firstblock.  Primarily
> > + * used for reflink.  The range must be in a hole, and this flag cannot be
> > + * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
> > + */
> > +#define XFS_BMAPI_REMAP		0x100
> > +
> >  #define XFS_BMAPI_FLAGS \
> >  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
> >  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> > @@ -105,7 +112,8 @@ struct xfs_extent_free_item
> >  	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
> >  	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
> >  	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
> > -	{ XFS_BMAPI_ZERO,	"ZERO" }
> > +	{ XFS_BMAPI_ZERO,	"ZERO" }, \
> > +	{ XFS_BMAPI_REMAP,	"REMAP" }
> >  
> >  
> >  static inline int xfs_bmapi_aflag(int w)
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 195a168..8485984 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2965,6 +2965,60 @@ TRACE_EVENT(xfs_refcount_finish_one_leftover,
> >  		  __entry->adjusted)
> >  );
> >  
> > +/* simple inode-based error/%ip tracepoint class */
> > +DECLARE_EVENT_CLASS(xfs_inode_error_class,
> > +	TP_PROTO(struct xfs_inode *ip, int error, unsigned long caller_ip),
> > +	TP_ARGS(ip, error, caller_ip),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_ino_t, ino)
> > +		__field(int, error)
> > +		__field(unsigned long, caller_ip)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > +		__entry->ino = ip->i_ino;
> > +		__entry->error = error;
> > +		__entry->caller_ip = caller_ip;
> > +	),
> > +	TP_printk("dev %d:%d ino %llx error %d caller %ps",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->ino,
> > +		  __entry->error,
> > +		  (char *)__entry->caller_ip)
> > +);
> > +
> > +#define DEFINE_INODE_ERROR_EVENT(name) \
> > +DEFINE_EVENT(xfs_inode_error_class, name, \
> > +	TP_PROTO(struct xfs_inode *ip, int error, \
> > +		 unsigned long caller_ip), \
> > +	TP_ARGS(ip, error, caller_ip))
> > +
> > +/* reflink allocator */
> > +TRACE_EVENT(xfs_bmap_remap_alloc,
> > +	TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t fsbno,
> > +		 xfs_extlen_t len),
> > +	TP_ARGS(ip, fsbno, len),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_ino_t, ino)
> > +		__field(xfs_fsblock_t, fsbno)
> > +		__field(xfs_extlen_t, len)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > +		__entry->ino = ip->i_ino;
> > +		__entry->fsbno = fsbno;
> > +		__entry->len = len;
> > +	),
> > +	TP_printk("dev %d:%d ino 0x%llx fsbno 0x%llx len %x",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->ino,
> > +		  __entry->fsbno,
> > +		  __entry->len)
> > +);
> > +DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
> > +
> >  #endif /* _TRACE_XFS_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped
  2016-10-03 19:04   ` Brian Foster
@ 2016-10-04  0:29     ` Darrick J. Wong
  2016-10-04 12:44       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-04  0:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Mon, Oct 03, 2016 at 03:04:10PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:08:17PM -0700, Darrick J. Wong wrote:
> > Log recovery will iget an inode to replay BUI items and iput the inode
> > when it's done.  Unfortunately, the iput will see that i_nlink == 0
> > and decide to truncate & free the inode, which prevents us from
> > replaying subsequent BUIs.  We can't skip the BUIs because we have to
> > replay all the redo items to ensure that atomic operations complete.
> > 
> 
> The way this is written sort of implies that the inodes will
> unconditionally have i_nlink == 0 in this context. That is not the case,
> right? I.e., we're just trying to cover the case where bui recovery has
> to deal with an inode that happens to be on the unlinked list..?

Right.  I'll fix the commit message to clarify that.

> 
> > Since unlinked inode recovery will reap the inode anyway, we can
> > safely introduce a new inode flag to indicate that an inode is in this
> > 'unlinked recovery' state and should not be auto-reaped in the
> > drop_inode path.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_bmap_item.c   |    1 +
> >  fs/xfs/xfs_inode.c       |    8 ++++++++
> >  fs/xfs/xfs_inode.h       |    6 ++++++
> >  fs/xfs/xfs_log_recover.c |    1 +
> >  4 files changed, 16 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > index ddda7c3..b1a220f 100644
> > --- a/fs/xfs/xfs_bmap_item.c
> > +++ b/fs/xfs/xfs_bmap_item.c
> > @@ -456,6 +456,7 @@ xfs_bui_recover(
> >  	if (error)
> >  		goto err_inode;
> >  
> > +	xfs_iflags_set(ip, XFS_IRECOVER_UNLINKED);
> 
> If so, I find the name of the flag a bit confusing because we set it
> unconditionally. Perhaps just XFS_IRECOVER or XFS_IRECOVERY is
> sufficient?

Agreed, since it really just means "we're recovering inodes".

> >  	xfs_defer_init(&dfops, &firstfsb);
> >  
> >  	/* Process deferred bmap item. */
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index e08eaea..0c25a76 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1855,6 +1855,14 @@ xfs_inactive(
> >  	if (mp->m_flags & XFS_MOUNT_RDONLY)
> >  		return;
> >  
> > +	/*
> > +	 * If this unlinked inode is in the middle of recovery, don't
> > +	 * truncate and free the inode just yet; log recovery will take
> > +	 * care of that.  See the comment for this inode flag.
> > +	 */
> > +	if (xfs_iflags_test(ip, XFS_IRECOVER_UNLINKED))
> > +		return;
> > +
> 
> Also, it might be better to push this one block of code down since the
> following block still deals with i_nlink > 0 properly (not that it will
> likely affect the code as it is now, since we only handle eofblocks
> trimming atm).

I put the jump-out case there so that we touch the inode's bmap as little
as possible while we're recovering the inode.  Since the inode is still
around in memory, so we'll end up back there at a later point anyway.

--D

> 
> Brian
> 
> >  	if (VFS_I(ip)->i_nlink != 0) {
> >  		/*
> >  		 * force is true because we are evicting an inode from the
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index a8658e6..46632f1 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -222,6 +222,12 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
> >  #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
> >  #define XFS_IDONTCACHE		(1 << 9) /* don't cache the inode long term */
> >  #define XFS_IEOFBLOCKS		(1 << 10)/* has the preallocblocks tag set */
> > +/*
> > + * If this unlinked inode is in the middle of recovery, don't let drop_inode
> > + * truncate and free the inode.  This can happen if we iget the inode during
> > + * log recovery to replay a bmap operation on the inode.
> > + */
> > +#define XFS_IRECOVER_UNLINKED	(1 << 11)
> >  
> >  /*
> >   * Per-lifetime flags need to be reset when re-using a reclaimable inode during
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 9697e94..b121f02 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -4969,6 +4969,7 @@ xlog_recover_process_one_iunlink(
> >  	if (error)
> >  		goto fail_iput;
> >  
> > +	xfs_iflags_clear(ip, XFS_IRECOVER_UNLINKED);
> >  	ASSERT(VFS_I(ip)->i_nlink == 0);
> >  	ASSERT(VFS_I(ip)->i_mode != 0);
> >  
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation
  2016-10-03 19:04   ` Brian Foster
@ 2016-10-04  0:30     ` Darrick J. Wong
  2016-10-04 12:44       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-04  0:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Mon, Oct 03, 2016 at 03:04:31PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:08:23PM -0700, Darrick J. Wong wrote:
> > Return the range of file blocks that bunmapi didn't free.  This hint
> > is used by CoW and reflink to figure out what part of an extent
> > actually got freed so that it can set up the appropriate atomic
> > remapping of just the freed range.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c |   36 ++++++++++++++++++++++++++++++------
> >  fs/xfs/libxfs/xfs_bmap.h |    4 ++++
> >  2 files changed, 34 insertions(+), 6 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index a5e429e..1e4f1a1 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> ...
> > @@ -5478,6 +5481,27 @@ xfs_bunmapi(
> >  	return error;
> >  }
> >  
> > +/* Unmap a range of a file. */
> > +int
> > +xfs_bunmapi(
> > +	xfs_trans_t		*tp,
> > +	struct xfs_inode	*ip,
> > +	xfs_fileoff_t		bno,
> > +	xfs_filblks_t		len,
> > +	int			flags,
> > +	xfs_extnum_t		nexts,
> > +	xfs_fsblock_t		*firstblock,
> > +	struct xfs_defer_ops	*dfops,
> > +	int			*done)
> > +{
> > +	int			error;
> > +
> > +	error = __xfs_bunmapi(tp, ip, bno, &len, flags, nexts, firstblock,
> > +			dfops);
> > +	*done = (len == 0);
> > +	return error;
> > +}
> > +
> 
> I wonder if we really need such a wrapper for this. There aren't too
> many xfs_bunmapi() callers and at least a couple of the few that I
> checked don't even use 'done.' That can always get fixed up later
> though.

Perhaps not, but hch said he'll probably end up refactoring the bunmapi code
soon anyway.

--D

> 
> Brian
> 
> >  /*
> >   * Determine whether an extent shift can be accomplished by a merge with the
> >   * extent that precedes the target hole of the shift.
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 53970b1..48ba3ed 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -197,6 +197,10 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
> >  		xfs_fsblock_t *firstblock, xfs_extlen_t total,
> >  		struct xfs_bmbt_irec *mval, int *nmap,
> >  		struct xfs_defer_ops *dfops);
> > +int	__xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
> > +		xfs_fileoff_t bno, xfs_filblks_t *rlen, int flags,
> > +		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
> > +		struct xfs_defer_ops *dfops);
> >  int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
> >  		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
> >  		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/63] xfs: map an inode's offset to an exact physical block
  2016-10-04  0:11     ` Darrick J. Wong
@ 2016-10-04 12:43       ` Brian Foster
  2016-10-04 17:28         ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-04 12:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Mon, Oct 03, 2016 at 05:11:19PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 03, 2016 at 03:03:49PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:07:56PM -0700, Darrick J. Wong wrote:
> > > Teach the bmap routine to know how to map a range of file blocks to a
> > > specific range of physical blocks, instead of simply allocating fresh
> > > blocks.  This enables reflink to map a file to blocks that are already
> > > in use.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_bmap.h |   10 +++++++
> > >  fs/xfs/xfs_trace.h       |   54 +++++++++++++++++++++++++++++++++++++++
> > >  3 files changed, 126 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 907d7b8d..9f145ed 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -3877,6 +3877,55 @@ xfs_bmap_btalloc(
> > >  }
> > >  
> > >  /*
> > > + * For a remap operation, just "allocate" an extent at the address that the
> > > + * caller passed in, and ensure that the AGFL is the right size.  The caller
> > > + * will then map the "allocated" extent into the file somewhere.
> > > + */
> > > +STATIC int
> > > +xfs_bmap_remap_alloc(
> > > +	struct xfs_bmalloca	*ap)
> > > +{
> > > +	struct xfs_trans	*tp = ap->tp;
> > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > +	xfs_agblock_t		bno;
> > > +	struct xfs_alloc_arg	args;
> > > +	int			error;
> > > +
> > > +	/*
> > > +	 * validate that the block number is legal - the enables us to detect
> > > +	 * and handle a silent filesystem corruption rather than crashing.
> > > +	 */
> > > +	memset(&args, 0, sizeof(struct xfs_alloc_arg));
> > > +	args.tp = ap->tp;
> > > +	args.mp = ap->tp->t_mountp;
> > > +	bno = *ap->firstblock;
> > > +	args.agno = XFS_FSB_TO_AGNO(mp, bno);
> > > +	ASSERT(args.agno < mp->m_sb.sb_agcount);
> > > +	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
> > > +	ASSERT(args.agbno < mp->m_sb.sb_agblocks);
> > > +
> > > +	/* "Allocate" the extent from the range we passed in. */
> > > +	trace_xfs_bmap_remap_alloc(ap->ip, *ap->firstblock, ap->length);
> > > +	ap->blkno = bno;
> > > +	ap->ip->i_d.di_nblocks += ap->length;
> > > +	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > +
> > > +	/* Fix the freelist, like a real allocator does. */
> > > +	args.datatype = ap->datatype;
> > > +	args.pag = xfs_perag_get(args.mp, args.agno);
> > > +	ASSERT(args.pag);
> > > +
> > > +	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
> > 
> > Why the FREEING flag? 
> 
> /*
>  * The freelist fixing code will decline the allocation if
>  * the size and shape of the free space doesn't allow for
>  * allocating the extent and updating all the metadata that
>  * happens during an allocation.  We're remapping, not
>  * allocating, so skip that check by pretending to be freeing.
>  */
> 

Thanks. This also can bypass fixing up the AG freelist. I suppose that
won't matter since we aren't going to update either allocbt, but are we
safe from the rmapbt perspective as well?

Brian

> > > +	if (error)
> > > +		goto error0;
> > > +error0:
> > > +	xfs_perag_put(args.pag);
> > > +	if (error)
> > > +		trace_xfs_bmap_remap_alloc_error(ap->ip, error, _RET_IP_);
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > >   * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file.
> > >   * It figures out where to ask the underlying allocator to put the new extent.
> > >   */
> > > @@ -3884,6 +3933,8 @@ STATIC int
> > >  xfs_bmap_alloc(
> > >  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> > >  {
> > > +	if (ap->flags & XFS_BMAPI_REMAP)
> > > +		return xfs_bmap_remap_alloc(ap);
> > >  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
> > >  	    xfs_alloc_is_userdata(ap->datatype))
> > >  		return xfs_bmap_rtalloc(ap);
> > > @@ -4442,6 +4493,12 @@ xfs_bmapi_write(
> > >  	ASSERT(len > 0);
> > >  	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
> > >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > +	if (whichfork == XFS_ATTR_FORK)
> > > +		ASSERT(!(flags & XFS_BMAPI_REMAP));
> > 
> > I think it's better to avoid conditionals if the only affected code
> > consists of ASSERT() statements (which can be compiled out). E.g., 
> > 
> > 	ASSERT(!((flags & XFS_BMAPI_REMAP) && whichfork == XFS_ATTR_FORK));
> > 
> > ... and so on, but not a big deal.
> 
> <nod> I might as well fix that up too...
> 
> --D
> 
> > Brian
> > 
> > > +	if (flags & XFS_BMAPI_REMAP) {
> > > +		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
> > > +		ASSERT(!(flags & XFS_BMAPI_CONVERT));
> > > +	}
> > >  
> > >  	/* zeroing is for currently only for data extents, not metadata */
> > >  	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
> > > @@ -4503,6 +4560,12 @@ xfs_bmapi_write(
> > >  		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
> > >  
> > >  		/*
> > > +		 * Make sure we only reflink into a hole.
> > > +		 */
> > > +		if (flags & XFS_BMAPI_REMAP)
> > > +			ASSERT(inhole);
> > > +
> > > +		/*
> > >  		 * First, deal with the hole before the allocated space
> > >  		 * that we found, if any.
> > >  		 */
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > index fcdb094..877b6f9 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > @@ -97,6 +97,13 @@ struct xfs_extent_free_item
> > >   */
> > >  #define XFS_BMAPI_ZERO		0x080
> > >  
> > > +/*
> > > + * Map the inode offset to the block given in ap->firstblock.  Primarily
> > > + * used for reflink.  The range must be in a hole, and this flag cannot be
> > > + * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
> > > + */
> > > +#define XFS_BMAPI_REMAP		0x100
> > > +
> > >  #define XFS_BMAPI_FLAGS \
> > >  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
> > >  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> > > @@ -105,7 +112,8 @@ struct xfs_extent_free_item
> > >  	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
> > >  	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
> > >  	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
> > > -	{ XFS_BMAPI_ZERO,	"ZERO" }
> > > +	{ XFS_BMAPI_ZERO,	"ZERO" }, \
> > > +	{ XFS_BMAPI_REMAP,	"REMAP" }
> > >  
> > >  
> > >  static inline int xfs_bmapi_aflag(int w)
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 195a168..8485984 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -2965,6 +2965,60 @@ TRACE_EVENT(xfs_refcount_finish_one_leftover,
> > >  		  __entry->adjusted)
> > >  );
> > >  
> > > +/* simple inode-based error/%ip tracepoint class */
> > > +DECLARE_EVENT_CLASS(xfs_inode_error_class,
> > > +	TP_PROTO(struct xfs_inode *ip, int error, unsigned long caller_ip),
> > > +	TP_ARGS(ip, error, caller_ip),
> > > +	TP_STRUCT__entry(
> > > +		__field(dev_t, dev)
> > > +		__field(xfs_ino_t, ino)
> > > +		__field(int, error)
> > > +		__field(unsigned long, caller_ip)
> > > +	),
> > > +	TP_fast_assign(
> > > +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > +		__entry->ino = ip->i_ino;
> > > +		__entry->error = error;
> > > +		__entry->caller_ip = caller_ip;
> > > +	),
> > > +	TP_printk("dev %d:%d ino %llx error %d caller %ps",
> > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > +		  __entry->ino,
> > > +		  __entry->error,
> > > +		  (char *)__entry->caller_ip)
> > > +);
> > > +
> > > +#define DEFINE_INODE_ERROR_EVENT(name) \
> > > +DEFINE_EVENT(xfs_inode_error_class, name, \
> > > +	TP_PROTO(struct xfs_inode *ip, int error, \
> > > +		 unsigned long caller_ip), \
> > > +	TP_ARGS(ip, error, caller_ip))
> > > +
> > > +/* reflink allocator */
> > > +TRACE_EVENT(xfs_bmap_remap_alloc,
> > > +	TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t fsbno,
> > > +		 xfs_extlen_t len),
> > > +	TP_ARGS(ip, fsbno, len),
> > > +	TP_STRUCT__entry(
> > > +		__field(dev_t, dev)
> > > +		__field(xfs_ino_t, ino)
> > > +		__field(xfs_fsblock_t, fsbno)
> > > +		__field(xfs_extlen_t, len)
> > > +	),
> > > +	TP_fast_assign(
> > > +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > +		__entry->ino = ip->i_ino;
> > > +		__entry->fsbno = fsbno;
> > > +		__entry->len = len;
> > > +	),
> > > +	TP_printk("dev %d:%d ino 0x%llx fsbno 0x%llx len %x",
> > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > +		  __entry->ino,
> > > +		  __entry->fsbno,
> > > +		  __entry->len)
> > > +);
> > > +DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
> > > +
> > >  #endif /* _TRACE_XFS_H */
> > >  
> > >  #undef TRACE_INCLUDE_PATH
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped
  2016-10-04  0:29     ` Darrick J. Wong
@ 2016-10-04 12:44       ` Brian Foster
  2016-10-04 19:07         ` Dave Chinner
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-04 12:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Mon, Oct 03, 2016 at 05:29:25PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 03, 2016 at 03:04:10PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:08:17PM -0700, Darrick J. Wong wrote:
> > > Log recovery will iget an inode to replay BUI items and iput the inode
> > > when it's done.  Unfortunately, the iput will see that i_nlink == 0
> > > and decide to truncate & free the inode, which prevents us from
> > > replaying subsequent BUIs.  We can't skip the BUIs because we have to
> > > replay all the redo items to ensure that atomic operations complete.
> > > 
...
> > 
> > > Since unlinked inode recovery will reap the inode anyway, we can
> > > safely introduce a new inode flag to indicate that an inode is in this
> > > 'unlinked recovery' state and should not be auto-reaped in the
> > > drop_inode path.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_bmap_item.c   |    1 +
> > >  fs/xfs/xfs_inode.c       |    8 ++++++++
> > >  fs/xfs/xfs_inode.h       |    6 ++++++
> > >  fs/xfs/xfs_log_recover.c |    1 +
> > >  4 files changed, 16 insertions(+)
> > > 
> > > 
...
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index e08eaea..0c25a76 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -1855,6 +1855,14 @@ xfs_inactive(
> > >  	if (mp->m_flags & XFS_MOUNT_RDONLY)
> > >  		return;
> > >  
> > > +	/*
> > > +	 * If this unlinked inode is in the middle of recovery, don't
> > > +	 * truncate and free the inode just yet; log recovery will take
> > > +	 * care of that.  See the comment for this inode flag.
> > > +	 */
> > > +	if (xfs_iflags_test(ip, XFS_IRECOVER_UNLINKED))
> > > +		return;
> > > +
> > 
> > Also, it might be better to push this one block of code down since the
> > following block still deals with i_nlink > 0 properly (not that it will
> > likely affect the code as it is now, since we only handle eofblocks
> > trimming atm).
> 
> I put the jump-out case there so that we touch the inode's bmap as little
> as possible while we're recovering the inode.  Since the inode is still
> around in memory, so we'll end up back there at a later point anyway.
> 

I'm not quite following... it looks like we set the reclaim tag on the
inode unconditionally after we get through xfs_inactive(). That implies
the in-memory inode can go away at any point thereafter, unless somebody
else comes along and happens to look for it. Hmm?

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  	if (VFS_I(ip)->i_nlink != 0) {
> > >  		/*
> > >  		 * force is true because we are evicting an inode from the
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index a8658e6..46632f1 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -222,6 +222,12 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
> > >  #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
> > >  #define XFS_IDONTCACHE		(1 << 9) /* don't cache the inode long term */
> > >  #define XFS_IEOFBLOCKS		(1 << 10)/* has the preallocblocks tag set */
> > > +/*
> > > + * If this unlinked inode is in the middle of recovery, don't let drop_inode
> > > + * truncate and free the inode.  This can happen if we iget the inode during
> > > + * log recovery to replay a bmap operation on the inode.
> > > + */
> > > +#define XFS_IRECOVER_UNLINKED	(1 << 11)
> > >  
> > >  /*
> > >   * Per-lifetime flags need to be reset when re-using a reclaimable inode during
> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > > index 9697e94..b121f02 100644
> > > --- a/fs/xfs/xfs_log_recover.c
> > > +++ b/fs/xfs/xfs_log_recover.c
> > > @@ -4969,6 +4969,7 @@ xlog_recover_process_one_iunlink(
> > >  	if (error)
> > >  		goto fail_iput;
> > >  
> > > +	xfs_iflags_clear(ip, XFS_IRECOVER_UNLINKED);
> > >  	ASSERT(VFS_I(ip)->i_nlink == 0);
> > >  	ASSERT(VFS_I(ip)->i_mode != 0);
> > >  
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation
  2016-10-04  0:30     ` Darrick J. Wong
@ 2016-10-04 12:44       ` Brian Foster
  0 siblings, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-04 12:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Mon, Oct 03, 2016 at 05:30:56PM -0700, Darrick J. Wong wrote:
> On Mon, Oct 03, 2016 at 03:04:31PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:08:23PM -0700, Darrick J. Wong wrote:
> > > Return the range of file blocks that bunmapi didn't free.  This hint
> > > is used by CoW and reflink to figure out what part of an extent
> > > actually got freed so that it can set up the appropriate atomic
> > > remapping of just the freed range.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c |   36 ++++++++++++++++++++++++++++++------
> > >  fs/xfs/libxfs/xfs_bmap.h |    4 ++++
> > >  2 files changed, 34 insertions(+), 6 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index a5e429e..1e4f1a1 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > ...
> > > @@ -5478,6 +5481,27 @@ xfs_bunmapi(
> > >  	return error;
> > >  }
> > >  
> > > +/* Unmap a range of a file. */
> > > +int
> > > +xfs_bunmapi(
> > > +	xfs_trans_t		*tp,
> > > +	struct xfs_inode	*ip,
> > > +	xfs_fileoff_t		bno,
> > > +	xfs_filblks_t		len,
> > > +	int			flags,
> > > +	xfs_extnum_t		nexts,
> > > +	xfs_fsblock_t		*firstblock,
> > > +	struct xfs_defer_ops	*dfops,
> > > +	int			*done)
> > > +{
> > > +	int			error;
> > > +
> > > +	error = __xfs_bunmapi(tp, ip, bno, &len, flags, nexts, firstblock,
> > > +			dfops);
> > > +	*done = (len == 0);
> > > +	return error;
> > > +}
> > > +
> > 
> > I wonder if we really need such a wrapper for this. There aren't too
> > many xfs_bunmapi() callers and at least a couple of the few that I
> > checked don't even use 'done.' That can always get fixed up later
> > though.
> 
> Perhaps not, but hch said he'll probably end up refactoring the bunmapi code
> soon anyway.
> 

Ok, sounds good.

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  /*
> > >   * Determine whether an extent shift can be accomplished by a merge with the
> > >   * extent that precedes the target hole of the shift.
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > index 53970b1..48ba3ed 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > @@ -197,6 +197,10 @@ int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
> > >  		xfs_fsblock_t *firstblock, xfs_extlen_t total,
> > >  		struct xfs_bmbt_irec *mval, int *nmap,
> > >  		struct xfs_defer_ops *dfops);
> > > +int	__xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
> > > +		xfs_fileoff_t bno, xfs_filblks_t *rlen, int flags,
> > > +		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
> > > +		struct xfs_defer_ops *dfops);
> > >  int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
> > >  		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
> > >  		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 31/63] xfs: create delalloc extents in CoW fork
  2016-09-30  3:09 ` [PATCH 31/63] xfs: create delalloc extents in " Darrick J. Wong
@ 2016-10-04 16:38   ` Brian Foster
  2016-10-04 17:39     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-04 16:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Sep 29, 2016 at 08:09:01PM -0700, Darrick J. Wong wrote:
> Wire up iomap_begin to detect shared extents and create delayed allocation
> extents in the CoW fork:
> 
>  1) Check if we already have an extent in the COW fork for the area.
>     If so nothing to do, we can move along.
>  2) Look up block number for the current extent, and if there is none
>     it's not shared move along.
>  3) Unshare the current extent as far as we are going to write into it.
>     For this we avoid an additional COW fork lookup and use the
>     information we set aside in step 1) above.
>  4) Goto 1) unless we've covered the whole range.
> 
> Last but not least, this updates the xfs_reflink_reserve_cow_range calling
> convention to pass a byte offset and length, as that is what both callers
> expect anyway.  This patch has been refactored considerably as part of the
> iomap transition.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_iomap.c   |   12 ++-
>  fs/xfs/xfs_reflink.c |  202 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_reflink.h |    9 ++
>  3 files changed, 221 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 59c7beb..e8312b0 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -39,6 +39,7 @@
>  #include "xfs_quota.h"
>  #include "xfs_dquot_item.h"
>  #include "xfs_dquot.h"
> +#include "xfs_reflink.h"
>  
>  
>  #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
> @@ -961,8 +962,15 @@ xfs_file_iomap_begin(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> -	if ((flags & IOMAP_WRITE) &&
> -	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> +		error = xfs_reflink_reserve_cow_range(ip, offset, length);
> +		if (error < 0)
> +			return error;
> +	}
> +
> +	if ((flags & IOMAP_WRITE) && !IS_DAX(inode) &&
> +		   !xfs_get_extsz_hint(ip)) {
> +		/* Reserve delalloc blocks for regular writeback. */
>  		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
>  				iomap);
>  	}

What about the short write case? E.g., do we have to clear out delalloc
blocks from the cow fork in iomap_end() if we don't end up using them?

> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 7adbb83..05a7fe6 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -51,6 +51,7 @@
>  #include "xfs_btree.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
> +#include "xfs_iomap.h"
>  
>  /*
>   * Copy on Write of Shared Blocks
> @@ -112,3 +113,204 @@
>   * ioend structure.  Better yet, the more ground we can cover with one
>   * ioend, the better.
>   */
> +
> +/*
> + * Given an AG extent, find the lowest-numbered run of shared blocks within
> + * that range and return the range in fbno/flen.
> + */
> +int
> +xfs_reflink_find_shared(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_agblock_t		agbno,
> +	xfs_extlen_t		aglen,
> +	xfs_agblock_t		*fbno,
> +	xfs_extlen_t		*flen,
> +	bool			find_maximal)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +
> +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> +	if (error)
> +		return error;
> +
> +	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
> +
> +	error = xfs_refcount_find_shared(cur, agbno, aglen, fbno, flen,
> +			find_maximal);
> +
> +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +
> +	xfs_buf_relse(agbp);
> +	return error;
> +}
> +
> +/*
> + * Trim the mapping to the next block where there's a change in the
> + * shared/unshared status.  More specifically, this means that we
> + * find the lowest-numbered extent of shared blocks that coincides with
> + * the given block mapping.  If the shared extent overlaps the start of
> + * the mapping, trim the mapping to the end of the shared extent.  If
> + * the shared region intersects the mapping, trim the mapping to the
> + * start of the shared extent.  If there are no shared regions that
> + * overlap, just return the original extent.
> + */
> +int
> +xfs_reflink_trim_around_shared(
> +	struct xfs_inode	*ip,
> +	struct xfs_bmbt_irec	*irec,
> +	bool			*shared,
> +	bool			*trimmed)
> +{
> +	xfs_agnumber_t		agno;
> +	xfs_agblock_t		agbno;
> +	xfs_extlen_t		aglen;
> +	xfs_agblock_t		fbno;
> +	xfs_extlen_t		flen;
> +	int			error = 0;
> +
> +	/* Holes, unwritten, and delalloc extents cannot be shared */
> +	if (!xfs_is_reflink_inode(ip) ||
> +	    ISUNWRITTEN(irec) ||
> +	    irec->br_startblock == HOLESTARTBLOCK ||
> +	    irec->br_startblock == DELAYSTARTBLOCK) {
> +		*shared = false;
> +		return 0;
> +	}
> +
> +	trace_xfs_reflink_trim_around_shared(ip, irec);
> +
> +	agno = XFS_FSB_TO_AGNO(ip->i_mount, irec->br_startblock);
> +	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, irec->br_startblock);
> +	aglen = irec->br_blockcount;
> +
> +	error = xfs_reflink_find_shared(ip->i_mount, agno, agbno,
> +			aglen, &fbno, &flen, true);
> +	if (error)
> +		return error;
> +
> +	*shared = *trimmed = false;
> +	if (flen == 0) {

Preferable to use NULLAGBLOCK for this, imo.

> +		/* No shared blocks at all. */
> +		return 0;
> +	} else if (fbno == agbno) {
> +		/* The start of this extent is shared. */
> +		irec->br_blockcount = flen;
> +		*shared = true;
> +		*trimmed = true;

Why do we set trimmed based solely on fbno == agbno? Is that valid if
the bmapbt extent exactly matches the refcntbt extent and we thus don't
actually modify the extent (e.g., br_blockcount == flen)? It's hard to
tell because trimmed looks unused (to this point?), so I could just
misunderstand the meaning.

> +		return 0;
> +	} else {
> +		/* There's a shared extent midway through this extent. */
> +		irec->br_blockcount = fbno - agbno;

Don't we have to push the startblock forward in this case?

Oh, I see. We trim the unshared length to push the fileoffset fsb to the
start of the shared region for the next iteration.

Brian

> +		*trimmed = true;
> +		return 0;
> +	}
> +}
> +
> +/* Create a CoW reservation for a range of blocks within a file. */
> +static int
> +__xfs_reflink_reserve_cow(
> +	struct xfs_inode	*ip,
> +	xfs_fileoff_t		*offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	struct xfs_bmbt_irec	got, prev, imap;
> +	xfs_fileoff_t		orig_end_fsb;
> +	int			nimaps, eof = 0, error = 0;
> +	bool			shared = false, trimmed = false;
> +	xfs_extnum_t		idx;
> +
> +	/* Already reserved?  Skip the refcount btree access. */
> +	xfs_bmap_search_extents(ip, *offset_fsb, XFS_COW_FORK, &eof, &idx,
> +			&got, &prev);
> +	if (!eof && got.br_startoff <= *offset_fsb) {
> +		end_fsb = orig_end_fsb = got.br_startoff + got.br_blockcount;
> +		trace_xfs_reflink_cow_found(ip, &got);
> +		goto done;
> +	}
> +
> +	/* Read extent from the source file. */
> +	nimaps = 1;
> +	error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
> +			&imap, &nimaps, 0);
> +	if (error)
> +		goto out_unlock;
> +	ASSERT(nimaps == 1);
> +
> +	/* Trim the mapping to the nearest shared extent boundary. */
> +	error = xfs_reflink_trim_around_shared(ip, &imap, &shared, &trimmed);
> +	if (error)
> +		goto out_unlock;
> +
> +	end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
> +
> +	/* Not shared?  Just report the (potentially capped) extent. */
> +	if (!shared)
> +		goto done;
> +
> +	/*
> +	 * Fork all the shared blocks from our write offset until the end of
> +	 * the extent.
> +	 */
> +	error = xfs_qm_dqattach_locked(ip, 0);
> +	if (error)
> +		goto out_unlock;
> +
> +retry:
> +	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, *offset_fsb,
> +			end_fsb - *offset_fsb, &got,
> +			&prev, &idx, eof);
> +	switch (error) {
> +	case 0:
> +		break;
> +	case -ENOSPC:
> +	case -EDQUOT:
> +		/* retry without any preallocation */
> +		trace_xfs_reflink_cow_enospc(ip, &imap);
> +		if (end_fsb != orig_end_fsb) {
> +			end_fsb = orig_end_fsb;
> +			goto retry;
> +		}
> +		/*FALLTHRU*/
> +	default:
> +		goto out_unlock;
> +	}
> +
> +	trace_xfs_reflink_cow_alloc(ip, &got);
> +done:
> +	*offset_fsb = end_fsb;
> +out_unlock:
> +	return error;
> +}
> +
> +/* Create a CoW reservation for part of a file. */
> +int
> +xfs_reflink_reserve_cow_range(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		count)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fileoff_t		offset_fsb, end_fsb;
> +	int			error;
> +
> +	trace_xfs_reflink_reserve_cow_range(ip, offset, count);
> +
> +	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +	end_fsb = XFS_B_TO_FSB(mp, offset + count);
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	while (offset_fsb < end_fsb) {
> +		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb);
> +		if (error) {
> +			trace_xfs_reflink_reserve_cow_range_error(ip, error,
> +				_RET_IP_);
> +			break;
> +		}
> +	}
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 820b151..f824f87 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -20,4 +20,13 @@
>  #ifndef __XFS_REFLINK_H
>  #define __XFS_REFLINK_H 1
>  
> +extern int xfs_reflink_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
> +		xfs_extlen_t *flen, bool find_maximal);
> +extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> +		struct xfs_bmbt_irec *irec, bool *shared, bool *trimmed);
> +
> +extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> +		xfs_off_t offset, xfs_off_t count);
> +
>  #endif /* __XFS_REFLINK_H */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 32/63] xfs: support allocating delayed extents in CoW fork
  2016-09-30  3:09 ` [PATCH 32/63] xfs: support allocating delayed " Darrick J. Wong
  2016-09-30  7:42   ` Christoph Hellwig
@ 2016-10-04 16:38   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-04 16:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:09:08PM -0700, Darrick J. Wong wrote:
> Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
> allocation extents in the CoW fork to real allocations, and wire this
> up all the way back to xfs_iomap_write_allocate().  In a subsequent
> patch, we'll modify the writepage handler to call this.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   51 ++++++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_aops.c        |    6 ++++-
>  fs/xfs/xfs_iomap.c       |    7 +++++-
>  fs/xfs/xfs_iomap.h       |    2 +-
>  4 files changed, 46 insertions(+), 20 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 5749618..85a0c86 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
...
> @@ -4510,6 +4522,11 @@ xfs_bmapi_write(
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	if (whichfork == XFS_ATTR_FORK)
>  		ASSERT(!(flags & XFS_BMAPI_REMAP));
> +	if (whichfork == XFS_COW_FORK) {
> +		ASSERT(!(flags & XFS_BMAPI_REMAP));
> +		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
> +		ASSERT(!(flags & XFS_BMAPI_CONVERT));
> +	}

Just some more 'if (..) { ASSERT() }' stuff here and below...

Brian

>  	if (flags & XFS_BMAPI_REMAP) {
>  		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
>  		ASSERT(!(flags & XFS_BMAPI_CONVERT));
> @@ -4579,6 +4596,8 @@ xfs_bmapi_write(
>  		 */
>  		if (flags & XFS_BMAPI_REMAP)
>  			ASSERT(inhole);
> +		if (flags & XFS_BMAPI_COWFORK)
> +			ASSERT(!inhole);
>  
>  		/*
>  		 * First, deal with the hole before the allocated space
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 4a28fa9..007a520 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -362,9 +362,11 @@ xfs_map_blocks(
>  
>  	if (type == XFS_IO_DELALLOC &&
>  	    (!nimaps || isnullstartblock(imap->br_startblock))) {
> -		error = xfs_iomap_write_allocate(ip, offset, imap);
> +		error = xfs_iomap_write_allocate(ip, XFS_DATA_FORK, offset,
> +				imap);
>  		if (!error)
> -			trace_xfs_map_blocks_alloc(ip, offset, count, type, imap);
> +			trace_xfs_map_blocks_alloc(ip, offset, count, type,
> +					imap);
>  		return error;
>  	}
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e8312b0..ad6939d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -667,6 +667,7 @@ xfs_file_iomap_begin_delay(
>  int
>  xfs_iomap_write_allocate(
>  	xfs_inode_t	*ip,
> +	int		whichfork,
>  	xfs_off_t	offset,
>  	xfs_bmbt_irec_t *imap)
>  {
> @@ -679,8 +680,12 @@ xfs_iomap_write_allocate(
>  	xfs_trans_t	*tp;
>  	int		nimaps;
>  	int		error = 0;
> +	int		flags = 0;
>  	int		nres;
>  
> +	if (whichfork == XFS_COW_FORK)
> +		flags |= XFS_BMAPI_COWFORK;
> +
>  	/*
>  	 * Make sure that the dquots are there.
>  	 */
> @@ -774,7 +779,7 @@ xfs_iomap_write_allocate(
>  			 * pointer that the caller gave to us.
>  			 */
>  			error = xfs_bmapi_write(tp, ip, map_start_fsb,
> -						count_fsb, 0, &first_block,
> +						count_fsb, flags, &first_block,
>  						nres, imap, &nimaps,
>  						&dfops);
>  			if (error)
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 6498be4..a16b956 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -25,7 +25,7 @@ struct xfs_bmbt_irec;
>  
>  int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
>  			struct xfs_bmbt_irec *, int);
> -int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t,
> +int xfs_iomap_write_allocate(struct xfs_inode *, int, xfs_off_t,
>  			struct xfs_bmbt_irec *);
>  int xfs_iomap_write_unwritten(struct xfs_inode *, xfs_off_t, xfs_off_t);
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 33/63] xfs: allocate delayed extents in CoW fork
  2016-09-30  3:09 ` [PATCH 33/63] xfs: allocate " Darrick J. Wong
@ 2016-10-04 16:38   ` Brian Foster
  2016-10-04 18:26     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-04 16:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Sep 29, 2016 at 08:09:14PM -0700, Darrick J. Wong wrote:
> Modify the writepage handler to find and convert pending delalloc
> extents to real allocations.  Furthermore, when we're doing non-cow
> writes to a part of a file that already has a CoW reservation (the
> cowextsz hint that we set up in a subsequent patch facilitates this),
> promote the write to copy-on-write so that the entire extent can get
> written out as a single extent on disk, thereby reducing post-CoW
> fragmentation.
> 
> Christoph moved the CoW support code in _map_blocks to a separate helper
> function, refactored other functions, and reduced the number of CoW fork
> lookups, so I merged those changes here to reduce churn.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c    |  106 ++++++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_aops.h    |    4 +-
>  fs/xfs/xfs_reflink.c |   86 +++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_reflink.h |    4 ++
>  4 files changed, 178 insertions(+), 22 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 007a520..7b1e9de 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
...
> @@ -645,13 +653,16 @@ xfs_check_page_type(
>  	bh = head = page_buffers(page);
>  	do {
>  		if (buffer_unwritten(bh)) {
> -			if (type == XFS_IO_UNWRITTEN)
> +			if (type == XFS_IO_UNWRITTEN ||
> +			    type == XFS_IO_COW)
>  				return true;
>  		} else if (buffer_delay(bh)) {
> -			if (type == XFS_IO_DELALLOC)
> +			if (type == XFS_IO_DELALLOC ||
> +			    type == XFS_IO_COW)
>  				return true;
>  		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
> -			if (type == XFS_IO_OVERWRITE)
> +			if (type == XFS_IO_OVERWRITE ||
> +			    type == XFS_IO_COW)
>  				return true;
>  		}

What's the purpose of this hunk? As it is, we don't appear to have any
non-XFS_IO_DELALLOC callers. This probably warrants an update to the
top-of-function comment at the very least.

Brian

>  
> @@ -739,6 +750,56 @@ xfs_aops_discard_page(
>  	return;
>  }
>  
> +static int
> +xfs_map_cow(
> +	struct xfs_writepage_ctx *wpc,
> +	struct inode		*inode,
> +	loff_t			offset,
> +	unsigned int		*new_type)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_bmbt_irec	imap;
> +	bool			is_cow = false, need_alloc = false;
> +	int			error;
> +
> +	/*
> +	 * If we already have a valid COW mapping keep using it.
> +	 */
> +	if (wpc->io_type == XFS_IO_COW) {
> +		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
> +		if (wpc->imap_valid) {
> +			*new_type = XFS_IO_COW;
> +			return 0;
> +		}
> +	}
> +
> +	/*
> +	 * Else we need to check if there is a COW mapping at this offset.
> +	 */
> +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> +	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap, &need_alloc);
> +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +
> +	if (!is_cow)
> +		return 0;
> +
> +	/*
> +	 * And if the COW mapping has a delayed extent here we need to
> +	 * allocate real space for it now.
> +	 */
> +	if (need_alloc) {
> +		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
> +				&imap);
> +		if (error)
> +			return error;
> +	}
> +
> +	wpc->io_type = *new_type = XFS_IO_COW;
> +	wpc->imap_valid = true;
> +	wpc->imap = imap;
> +	return 0;
> +}
> +
>  /*
>   * We implement an immediate ioend submission policy here to avoid needing to
>   * chain multiple ioends and hence nest mempool allocations which can violate
> @@ -771,6 +832,7 @@ xfs_writepage_map(
>  	int			error = 0;
>  	int			count = 0;
>  	int			uptodate = 1;
> +	unsigned int		new_type;
>  
>  	bh = head = page_buffers(page);
>  	offset = page_offset(page);
> @@ -791,22 +853,13 @@ xfs_writepage_map(
>  			continue;
>  		}
>  
> -		if (buffer_unwritten(bh)) {
> -			if (wpc->io_type != XFS_IO_UNWRITTEN) {
> -				wpc->io_type = XFS_IO_UNWRITTEN;
> -				wpc->imap_valid = false;
> -			}
> -		} else if (buffer_delay(bh)) {
> -			if (wpc->io_type != XFS_IO_DELALLOC) {
> -				wpc->io_type = XFS_IO_DELALLOC;
> -				wpc->imap_valid = false;
> -			}
> -		} else if (buffer_uptodate(bh)) {
> -			if (wpc->io_type != XFS_IO_OVERWRITE) {
> -				wpc->io_type = XFS_IO_OVERWRITE;
> -				wpc->imap_valid = false;
> -			}
> -		} else {
> +		if (buffer_unwritten(bh))
> +			new_type = XFS_IO_UNWRITTEN;
> +		else if (buffer_delay(bh))
> +			new_type = XFS_IO_DELALLOC;
> +		else if (buffer_uptodate(bh))
> +			new_type = XFS_IO_OVERWRITE;
> +		else {
>  			if (PageUptodate(page))
>  				ASSERT(buffer_mapped(bh));
>  			/*
> @@ -819,6 +872,17 @@ xfs_writepage_map(
>  			continue;
>  		}
>  
> +		if (xfs_is_reflink_inode(XFS_I(inode))) {
> +			error = xfs_map_cow(wpc, inode, offset, &new_type);
> +			if (error)
> +				goto out;
> +		}
> +
> +		if (wpc->io_type != new_type) {
> +			wpc->io_type = new_type;
> +			wpc->imap_valid = false;
> +		}
> +
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
>  							 offset);
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 1950e3b..b3c6634 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -28,13 +28,15 @@ enum {
>  	XFS_IO_DELALLOC,	/* covers delalloc region */
>  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
>  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
> +	XFS_IO_COW,		/* covers copy-on-write extent */
>  };
>  
>  #define XFS_IO_TYPES \
>  	{ XFS_IO_INVALID,		"invalid" }, \
>  	{ XFS_IO_DELALLOC,		"delalloc" }, \
>  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
> -	{ XFS_IO_OVERWRITE,		"overwrite" }
> +	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> +	{ XFS_IO_COW,			"CoW" }
>  
>  /*
>   * Structure for buffered I/O completions.
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 05a7fe6..e8c7c85 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -314,3 +314,89 @@ xfs_reflink_reserve_cow_range(
>  
>  	return error;
>  }
> +
> +/*
> + * Find the CoW reservation (and whether or not it needs block allocation)
> + * for a given byte offset of a file.
> + */
> +bool
> +xfs_reflink_find_cow_mapping(
> +	struct xfs_inode		*ip,
> +	xfs_off_t			offset,
> +	struct xfs_bmbt_irec		*imap,
> +	bool				*need_alloc)
> +{
> +	struct xfs_bmbt_irec		irec;
> +	struct xfs_ifork		*ifp;
> +	struct xfs_bmbt_rec_host	*gotp;
> +	xfs_fileoff_t			bno;
> +	xfs_extnum_t			idx;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
> +
> +	if (!xfs_is_reflink_inode(ip))
> +		return false;
> +
> +	/* Find the extent in the CoW fork. */
> +	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> +	bno = XFS_B_TO_FSBT(ip->i_mount, offset);
> +	gotp = xfs_iext_bno_to_ext(ifp, bno, &idx);
> +	if (!gotp)
> +		return false;
> +
> +	xfs_bmbt_get_all(gotp, &irec);
> +	if (bno >= irec.br_startoff + irec.br_blockcount ||
> +	    bno < irec.br_startoff)
> +		return false;
> +
> +	trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
> +			&irec);
> +
> +	/* If it's still delalloc, we must allocate later. */
> +	*imap = irec;
> +	*need_alloc = !!(isnullstartblock(irec.br_startblock));
> +
> +	return true;
> +}
> +
> +/*
> + * Trim an extent to end at the next CoW reservation past offset_fsb.
> + */
> +int
> +xfs_reflink_trim_irec_to_next_cow(
> +	struct xfs_inode		*ip,
> +	xfs_fileoff_t			offset_fsb,
> +	struct xfs_bmbt_irec		*imap)
> +{
> +	struct xfs_bmbt_irec		irec;
> +	struct xfs_ifork		*ifp;
> +	struct xfs_bmbt_rec_host	*gotp;
> +	xfs_extnum_t			idx;
> +
> +	if (!xfs_is_reflink_inode(ip))
> +		return 0;
> +
> +	/* Find the extent in the CoW fork. */
> +	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> +	gotp = xfs_iext_bno_to_ext(ifp, offset_fsb, &idx);
> +	if (!gotp)
> +		return 0;
> +	xfs_bmbt_get_all(gotp, &irec);
> +
> +	/* This is the extent before; try sliding up one. */
> +	if (irec.br_startoff < offset_fsb) {
> +		idx++;
> +		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
> +			return 0;
> +		gotp = xfs_iext_get_ext(ifp, idx);
> +		xfs_bmbt_get_all(gotp, &irec);
> +	}
> +
> +	if (irec.br_startoff >= imap->br_startoff + imap->br_blockcount)
> +		return 0;
> +
> +	imap->br_blockcount = irec.br_startoff - imap->br_startoff;
> +	trace_xfs_reflink_trim_irec(ip, imap);
> +
> +	return 0;
> +}
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index f824f87..11408c0 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -28,5 +28,9 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
>  
>  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
>  		xfs_off_t offset, xfs_off_t count);
> +extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> +		struct xfs_bmbt_irec *imap, bool *need_alloc);
> +extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> +		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
>  
>  #endif /* __XFS_REFLINK_H */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 21/63] xfs: map an inode's offset to an exact physical block
  2016-10-04 12:43       ` Brian Foster
@ 2016-10-04 17:28         ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-04 17:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Tue, Oct 04, 2016 at 08:43:34AM -0400, Brian Foster wrote:
> On Mon, Oct 03, 2016 at 05:11:19PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 03, 2016 at 03:03:49PM -0400, Brian Foster wrote:
> > > On Thu, Sep 29, 2016 at 08:07:56PM -0700, Darrick J. Wong wrote:
> > > > Teach the bmap routine to know how to map a range of file blocks to a
> > > > specific range of physical blocks, instead of simply allocating fresh
> > > > blocks.  This enables reflink to map a file to blocks that are already
> > > > in use.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_bmap.h |   10 +++++++
> > > >  fs/xfs/xfs_trace.h       |   54 +++++++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 126 insertions(+), 1 deletion(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index 907d7b8d..9f145ed 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -3877,6 +3877,55 @@ xfs_bmap_btalloc(
> > > >  }
> > > >  
> > > >  /*
> > > > + * For a remap operation, just "allocate" an extent at the address that the
> > > > + * caller passed in, and ensure that the AGFL is the right size.  The caller
> > > > + * will then map the "allocated" extent into the file somewhere.
> > > > + */
> > > > +STATIC int
> > > > +xfs_bmap_remap_alloc(
> > > > +	struct xfs_bmalloca	*ap)
> > > > +{
> > > > +	struct xfs_trans	*tp = ap->tp;
> > > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > > +	xfs_agblock_t		bno;
> > > > +	struct xfs_alloc_arg	args;
> > > > +	int			error;
> > > > +
> > > > +	/*
> > > > +	 * validate that the block number is legal - the enables us to detect
> > > > +	 * and handle a silent filesystem corruption rather than crashing.
> > > > +	 */
> > > > +	memset(&args, 0, sizeof(struct xfs_alloc_arg));
> > > > +	args.tp = ap->tp;
> > > > +	args.mp = ap->tp->t_mountp;
> > > > +	bno = *ap->firstblock;
> > > > +	args.agno = XFS_FSB_TO_AGNO(mp, bno);
> > > > +	ASSERT(args.agno < mp->m_sb.sb_agcount);
> > > > +	args.agbno = XFS_FSB_TO_AGBNO(mp, bno);
> > > > +	ASSERT(args.agbno < mp->m_sb.sb_agblocks);
> > > > +
> > > > +	/* "Allocate" the extent from the range we passed in. */
> > > > +	trace_xfs_bmap_remap_alloc(ap->ip, *ap->firstblock, ap->length);
> > > > +	ap->blkno = bno;
> > > > +	ap->ip->i_d.di_nblocks += ap->length;
> > > > +	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > > +
> > > > +	/* Fix the freelist, like a real allocator does. */
> > > > +	args.datatype = ap->datatype;
> > > > +	args.pag = xfs_perag_get(args.mp, args.agno);
> > > > +	ASSERT(args.pag);
> > > > +
> > > > +	error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING);
> > > 
> > > Why the FREEING flag? 
> > 
> > /*
> >  * The freelist fixing code will decline the allocation if
> >  * the size and shape of the free space doesn't allow for
> >  * allocating the extent and updating all the metadata that
> >  * happens during an allocation.  We're remapping, not
> >  * allocating, so skip that check by pretending to be freeing.
> >  */
> > 
> 
> Thanks. This also can bypass fixing up the AG freelist. I suppose that
> won't matter since we aren't going to update either allocbt, but are we
> safe from the rmapbt perspective as well?

In general we'll be fine since the AG reservation code reserves enough
space to feed the rmapbt.  If reflink (which invokes this remapping)
detects that we're low on AG reservation, it'll return ENOSPC and let
userspace do a regular copy (presumably into a less full AG) to avoid
adding pressure on that AG.  COW (the second remap user) should be fine
since the extent being (re)mapped in is a regular non-shared extent with
an rmapping (owned by XFS_REFC_OWN_COW) already in the relevant AG.
The same idea applies to the extent swapper (the third remap user),
though the owner is the donor file.

--D

> 
> Brian
> 
> > > > +	if (error)
> > > > +		goto error0;
> > > > +error0:
> > > > +	xfs_perag_put(args.pag);
> > > > +	if (error)
> > > > +		trace_xfs_bmap_remap_alloc_error(ap->ip, error, _RET_IP_);
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > >   * xfs_bmap_alloc is called by xfs_bmapi to allocate an extent for a file.
> > > >   * It figures out where to ask the underlying allocator to put the new extent.
> > > >   */
> > > > @@ -3884,6 +3933,8 @@ STATIC int
> > > >  xfs_bmap_alloc(
> > > >  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> > > >  {
> > > > +	if (ap->flags & XFS_BMAPI_REMAP)
> > > > +		return xfs_bmap_remap_alloc(ap);
> > > >  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
> > > >  	    xfs_alloc_is_userdata(ap->datatype))
> > > >  		return xfs_bmap_rtalloc(ap);
> > > > @@ -4442,6 +4493,12 @@ xfs_bmapi_write(
> > > >  	ASSERT(len > 0);
> > > >  	ASSERT(XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_LOCAL);
> > > >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > +	if (whichfork == XFS_ATTR_FORK)
> > > > +		ASSERT(!(flags & XFS_BMAPI_REMAP));
> > > 
> > > I think it's better to avoid conditionals if the only affected code
> > > consists of ASSERT() statements (which can be compiled out). E.g., 
> > > 
> > > 	ASSERT(!((flags & XFS_BMAPI_REMAP) && whichfork == XFS_ATTR_FORK));
> > > 
> > > ... and so on, but not a big deal.
> > 
> > <nod> I might as well fix that up too...
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > +	if (flags & XFS_BMAPI_REMAP) {
> > > > +		ASSERT(!(flags & XFS_BMAPI_PREALLOC));
> > > > +		ASSERT(!(flags & XFS_BMAPI_CONVERT));
> > > > +	}
> > > >  
> > > >  	/* zeroing is for currently only for data extents, not metadata */
> > > >  	ASSERT((flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ZERO)) !=
> > > > @@ -4503,6 +4560,12 @@ xfs_bmapi_write(
> > > >  		wasdelay = !inhole && isnullstartblock(bma.got.br_startblock);
> > > >  
> > > >  		/*
> > > > +		 * Make sure we only reflink into a hole.
> > > > +		 */
> > > > +		if (flags & XFS_BMAPI_REMAP)
> > > > +			ASSERT(inhole);
> > > > +
> > > > +		/*
> > > >  		 * First, deal with the hole before the allocated space
> > > >  		 * that we found, if any.
> > > >  		 */
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > > index fcdb094..877b6f9 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > > @@ -97,6 +97,13 @@ struct xfs_extent_free_item
> > > >   */
> > > >  #define XFS_BMAPI_ZERO		0x080
> > > >  
> > > > +/*
> > > > + * Map the inode offset to the block given in ap->firstblock.  Primarily
> > > > + * used for reflink.  The range must be in a hole, and this flag cannot be
> > > > + * turned on with PREALLOC or CONVERT, and cannot be used on the attr fork.
> > > > + */
> > > > +#define XFS_BMAPI_REMAP		0x100
> > > > +
> > > >  #define XFS_BMAPI_FLAGS \
> > > >  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
> > > >  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> > > > @@ -105,7 +112,8 @@ struct xfs_extent_free_item
> > > >  	{ XFS_BMAPI_IGSTATE,	"IGSTATE" }, \
> > > >  	{ XFS_BMAPI_CONTIG,	"CONTIG" }, \
> > > >  	{ XFS_BMAPI_CONVERT,	"CONVERT" }, \
> > > > -	{ XFS_BMAPI_ZERO,	"ZERO" }
> > > > +	{ XFS_BMAPI_ZERO,	"ZERO" }, \
> > > > +	{ XFS_BMAPI_REMAP,	"REMAP" }
> > > >  
> > > >  
> > > >  static inline int xfs_bmapi_aflag(int w)
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index 195a168..8485984 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -2965,6 +2965,60 @@ TRACE_EVENT(xfs_refcount_finish_one_leftover,
> > > >  		  __entry->adjusted)
> > > >  );
> > > >  
> > > > +/* simple inode-based error/%ip tracepoint class */
> > > > +DECLARE_EVENT_CLASS(xfs_inode_error_class,
> > > > +	TP_PROTO(struct xfs_inode *ip, int error, unsigned long caller_ip),
> > > > +	TP_ARGS(ip, error, caller_ip),
> > > > +	TP_STRUCT__entry(
> > > > +		__field(dev_t, dev)
> > > > +		__field(xfs_ino_t, ino)
> > > > +		__field(int, error)
> > > > +		__field(unsigned long, caller_ip)
> > > > +	),
> > > > +	TP_fast_assign(
> > > > +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > +		__entry->ino = ip->i_ino;
> > > > +		__entry->error = error;
> > > > +		__entry->caller_ip = caller_ip;
> > > > +	),
> > > > +	TP_printk("dev %d:%d ino %llx error %d caller %ps",
> > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > +		  __entry->ino,
> > > > +		  __entry->error,
> > > > +		  (char *)__entry->caller_ip)
> > > > +);
> > > > +
> > > > +#define DEFINE_INODE_ERROR_EVENT(name) \
> > > > +DEFINE_EVENT(xfs_inode_error_class, name, \
> > > > +	TP_PROTO(struct xfs_inode *ip, int error, \
> > > > +		 unsigned long caller_ip), \
> > > > +	TP_ARGS(ip, error, caller_ip))
> > > > +
> > > > +/* reflink allocator */
> > > > +TRACE_EVENT(xfs_bmap_remap_alloc,
> > > > +	TP_PROTO(struct xfs_inode *ip, xfs_fsblock_t fsbno,
> > > > +		 xfs_extlen_t len),
> > > > +	TP_ARGS(ip, fsbno, len),
> > > > +	TP_STRUCT__entry(
> > > > +		__field(dev_t, dev)
> > > > +		__field(xfs_ino_t, ino)
> > > > +		__field(xfs_fsblock_t, fsbno)
> > > > +		__field(xfs_extlen_t, len)
> > > > +	),
> > > > +	TP_fast_assign(
> > > > +		__entry->dev = VFS_I(ip)->i_sb->s_dev;
> > > > +		__entry->ino = ip->i_ino;
> > > > +		__entry->fsbno = fsbno;
> > > > +		__entry->len = len;
> > > > +	),
> > > > +	TP_printk("dev %d:%d ino 0x%llx fsbno 0x%llx len %x",
> > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > +		  __entry->ino,
> > > > +		  __entry->fsbno,
> > > > +		  __entry->len)
> > > > +);
> > > > +DEFINE_INODE_ERROR_EVENT(xfs_bmap_remap_alloc_error);
> > > > +
> > > >  #endif /* _TRACE_XFS_H */
> > > >  
> > > >  #undef TRACE_INCLUDE_PATH
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 31/63] xfs: create delalloc extents in CoW fork
  2016-10-04 16:38   ` Brian Foster
@ 2016-10-04 17:39     ` Darrick J. Wong
  2016-10-04 18:38       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-04 17:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Tue, Oct 04, 2016 at 12:38:23PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:09:01PM -0700, Darrick J. Wong wrote:
> > Wire up iomap_begin to detect shared extents and create delayed allocation
> > extents in the CoW fork:
> > 
> >  1) Check if we already have an extent in the COW fork for the area.
> >     If so nothing to do, we can move along.
> >  2) Look up block number for the current extent, and if there is none
> >     it's not shared move along.
> >  3) Unshare the current extent as far as we are going to write into it.
> >     For this we avoid an additional COW fork lookup and use the
> >     information we set aside in step 1) above.
> >  4) Goto 1) unless we've covered the whole range.
> > 
> > Last but not least, this updates the xfs_reflink_reserve_cow_range calling
> > convention to pass a byte offset and length, as that is what both callers
> > expect anyway.  This patch has been refactored considerably as part of the
> > iomap transition.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_iomap.c   |   12 ++-
> >  fs/xfs/xfs_reflink.c |  202 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_reflink.h |    9 ++
> >  3 files changed, 221 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 59c7beb..e8312b0 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -39,6 +39,7 @@
> >  #include "xfs_quota.h"
> >  #include "xfs_dquot_item.h"
> >  #include "xfs_dquot.h"
> > +#include "xfs_reflink.h"
> >  
> >  
> >  #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
> > @@ -961,8 +962,15 @@ xfs_file_iomap_begin(
> >  	if (XFS_FORCED_SHUTDOWN(mp))
> >  		return -EIO;
> >  
> > -	if ((flags & IOMAP_WRITE) &&
> > -	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> > +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> > +		error = xfs_reflink_reserve_cow_range(ip, offset, length);
> > +		if (error < 0)
> > +			return error;
> > +	}
> > +
> > +	if ((flags & IOMAP_WRITE) && !IS_DAX(inode) &&
> > +		   !xfs_get_extsz_hint(ip)) {
> > +		/* Reserve delalloc blocks for regular writeback. */
> >  		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> >  				iomap);
> >  	}
> 
> What about the short write case? E.g., do we have to clear out delalloc
> blocks from the cow fork in iomap_end() if we don't end up using them?

Nope, unused blocks sit around in the CoW fork (with the cowextsize hint
set, this happens all the time) so that a subsequent write to an
adjacent file offset lands in the same place as the successful write.
The unused extents get cleaned out when the inode is evicted, we run out
of disk space, or the garbage collector triggers.

> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 7adbb83..05a7fe6 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -51,6 +51,7 @@
> >  #include "xfs_btree.h"
> >  #include "xfs_bmap_btree.h"
> >  #include "xfs_reflink.h"
> > +#include "xfs_iomap.h"
> >  
> >  /*
> >   * Copy on Write of Shared Blocks
> > @@ -112,3 +113,204 @@
> >   * ioend structure.  Better yet, the more ground we can cover with one
> >   * ioend, the better.
> >   */
> > +
> > +/*
> > + * Given an AG extent, find the lowest-numbered run of shared blocks within
> > + * that range and return the range in fbno/flen.
> > + */
> > +int
> > +xfs_reflink_find_shared(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno,
> > +	xfs_agblock_t		agbno,
> > +	xfs_extlen_t		aglen,
> > +	xfs_agblock_t		*fbno,
> > +	xfs_extlen_t		*flen,
> > +	bool			find_maximal)
> > +{
> > +	struct xfs_buf		*agbp;
> > +	struct xfs_btree_cur	*cur;
> > +	int			error;
> > +
> > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > +	if (error)
> > +		return error;
> > +
> > +	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
> > +
> > +	error = xfs_refcount_find_shared(cur, agbno, aglen, fbno, flen,
> > +			find_maximal);
> > +
> > +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > +
> > +	xfs_buf_relse(agbp);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Trim the mapping to the next block where there's a change in the
> > + * shared/unshared status.  More specifically, this means that we
> > + * find the lowest-numbered extent of shared blocks that coincides with
> > + * the given block mapping.  If the shared extent overlaps the start of
> > + * the mapping, trim the mapping to the end of the shared extent.  If
> > + * the shared region intersects the mapping, trim the mapping to the
> > + * start of the shared extent.  If there are no shared regions that
> > + * overlap, just return the original extent.
> > + */
> > +int
> > +xfs_reflink_trim_around_shared(
> > +	struct xfs_inode	*ip,
> > +	struct xfs_bmbt_irec	*irec,
> > +	bool			*shared,
> > +	bool			*trimmed)
> > +{
> > +	xfs_agnumber_t		agno;
> > +	xfs_agblock_t		agbno;
> > +	xfs_extlen_t		aglen;
> > +	xfs_agblock_t		fbno;
> > +	xfs_extlen_t		flen;
> > +	int			error = 0;
> > +
> > +	/* Holes, unwritten, and delalloc extents cannot be shared */
> > +	if (!xfs_is_reflink_inode(ip) ||
> > +	    ISUNWRITTEN(irec) ||
> > +	    irec->br_startblock == HOLESTARTBLOCK ||
> > +	    irec->br_startblock == DELAYSTARTBLOCK) {
> > +		*shared = false;
> > +		return 0;
> > +	}
> > +
> > +	trace_xfs_reflink_trim_around_shared(ip, irec);
> > +
> > +	agno = XFS_FSB_TO_AGNO(ip->i_mount, irec->br_startblock);
> > +	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, irec->br_startblock);
> > +	aglen = irec->br_blockcount;
> > +
> > +	error = xfs_reflink_find_shared(ip->i_mount, agno, agbno,
> > +			aglen, &fbno, &flen, true);
> > +	if (error)
> > +		return error;
> > +
> > +	*shared = *trimmed = false;
> > +	if (flen == 0) {
> 
> Preferable to use NULLAGBLOCK for this, imo.

Yeah, I will look into changing this.

> > +		/* No shared blocks at all. */
> > +		return 0;
> > +	} else if (fbno == agbno) {
> > +		/* The start of this extent is shared. */
> > +		irec->br_blockcount = flen;
> > +		*shared = true;
> > +		*trimmed = true;
> 
> Why do we set trimmed based solely on fbno == agbno? Is that valid if
> the bmapbt extent exactly matches the refcntbt extent and we thus don't
> actually modify the extent (e.g., br_blockcount == flen)? It's hard to
> tell because trimmed looks unused (to this point?), so I could just
> misunderstand the meaning.

You're right, we don't have to set trimmed if flen == aglen.

> > +		return 0;
> > +	} else {
> > +		/* There's a shared extent midway through this extent. */
> > +		irec->br_blockcount = fbno - agbno;
> 
> Don't we have to push the startblock forward in this case?
> 
> Oh, I see. We trim the unshared length to push the fileoffset fsb to the
> start of the shared region for the next iteration.

Yep.  I'll clarify the comment.

--D

> 
> Brian
> 
> > +		*trimmed = true;
> > +		return 0;
> > +	}
> > +}
> > +
> > +/* Create a CoW reservation for a range of blocks within a file. */
> > +static int
> > +__xfs_reflink_reserve_cow(
> > +	struct xfs_inode	*ip,
> > +	xfs_fileoff_t		*offset_fsb,
> > +	xfs_fileoff_t		end_fsb)
> > +{
> > +	struct xfs_bmbt_irec	got, prev, imap;
> > +	xfs_fileoff_t		orig_end_fsb;
> > +	int			nimaps, eof = 0, error = 0;
> > +	bool			shared = false, trimmed = false;
> > +	xfs_extnum_t		idx;
> > +
> > +	/* Already reserved?  Skip the refcount btree access. */
> > +	xfs_bmap_search_extents(ip, *offset_fsb, XFS_COW_FORK, &eof, &idx,
> > +			&got, &prev);
> > +	if (!eof && got.br_startoff <= *offset_fsb) {
> > +		end_fsb = orig_end_fsb = got.br_startoff + got.br_blockcount;
> > +		trace_xfs_reflink_cow_found(ip, &got);
> > +		goto done;
> > +	}
> > +
> > +	/* Read extent from the source file. */
> > +	nimaps = 1;
> > +	error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
> > +			&imap, &nimaps, 0);
> > +	if (error)
> > +		goto out_unlock;
> > +	ASSERT(nimaps == 1);
> > +
> > +	/* Trim the mapping to the nearest shared extent boundary. */
> > +	error = xfs_reflink_trim_around_shared(ip, &imap, &shared, &trimmed);
> > +	if (error)
> > +		goto out_unlock;
> > +
> > +	end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
> > +
> > +	/* Not shared?  Just report the (potentially capped) extent. */
> > +	if (!shared)
> > +		goto done;
> > +
> > +	/*
> > +	 * Fork all the shared blocks from our write offset until the end of
> > +	 * the extent.
> > +	 */
> > +	error = xfs_qm_dqattach_locked(ip, 0);
> > +	if (error)
> > +		goto out_unlock;
> > +
> > +retry:
> > +	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, *offset_fsb,
> > +			end_fsb - *offset_fsb, &got,
> > +			&prev, &idx, eof);
> > +	switch (error) {
> > +	case 0:
> > +		break;
> > +	case -ENOSPC:
> > +	case -EDQUOT:
> > +		/* retry without any preallocation */
> > +		trace_xfs_reflink_cow_enospc(ip, &imap);
> > +		if (end_fsb != orig_end_fsb) {
> > +			end_fsb = orig_end_fsb;
> > +			goto retry;
> > +		}
> > +		/*FALLTHRU*/
> > +	default:
> > +		goto out_unlock;
> > +	}
> > +
> > +	trace_xfs_reflink_cow_alloc(ip, &got);
> > +done:
> > +	*offset_fsb = end_fsb;
> > +out_unlock:
> > +	return error;
> > +}
> > +
> > +/* Create a CoW reservation for part of a file. */
> > +int
> > +xfs_reflink_reserve_cow_range(
> > +	struct xfs_inode	*ip,
> > +	xfs_off_t		offset,
> > +	xfs_off_t		count)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_fileoff_t		offset_fsb, end_fsb;
> > +	int			error;
> > +
> > +	trace_xfs_reflink_reserve_cow_range(ip, offset, count);
> > +
> > +	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > +	end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	while (offset_fsb < end_fsb) {
> > +		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb);
> > +		if (error) {
> > +			trace_xfs_reflink_reserve_cow_range_error(ip, error,
> > +				_RET_IP_);
> > +			break;
> > +		}
> > +	}
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index 820b151..f824f87 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -20,4 +20,13 @@
> >  #ifndef __XFS_REFLINK_H
> >  #define __XFS_REFLINK_H 1
> >  
> > +extern int xfs_reflink_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
> > +		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
> > +		xfs_extlen_t *flen, bool find_maximal);
> > +extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > +		struct xfs_bmbt_irec *irec, bool *shared, bool *trimmed);
> > +
> > +extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > +		xfs_off_t offset, xfs_off_t count);
> > +
> >  #endif /* __XFS_REFLINK_H */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 33/63] xfs: allocate delayed extents in CoW fork
  2016-10-04 16:38   ` Brian Foster
@ 2016-10-04 18:26     ` Darrick J. Wong
  2016-10-04 18:39       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-04 18:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Tue, Oct 04, 2016 at 12:38:40PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:09:14PM -0700, Darrick J. Wong wrote:
> > Modify the writepage handler to find and convert pending delalloc
> > extents to real allocations.  Furthermore, when we're doing non-cow
> > writes to a part of a file that already has a CoW reservation (the
> > cowextsz hint that we set up in a subsequent patch facilitates this),
> > promote the write to copy-on-write so that the entire extent can get
> > written out as a single extent on disk, thereby reducing post-CoW
> > fragmentation.
> > 
> > Christoph moved the CoW support code in _map_blocks to a separate helper
> > function, refactored other functions, and reduced the number of CoW fork
> > lookups, so I merged those changes here to reduce churn.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_aops.c    |  106 ++++++++++++++++++++++++++++++++++++++++----------
> >  fs/xfs/xfs_aops.h    |    4 +-
> >  fs/xfs/xfs_reflink.c |   86 +++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_reflink.h |    4 ++
> >  4 files changed, 178 insertions(+), 22 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 007a520..7b1e9de 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> ...
> > @@ -645,13 +653,16 @@ xfs_check_page_type(
> >  	bh = head = page_buffers(page);
> >  	do {
> >  		if (buffer_unwritten(bh)) {
> > -			if (type == XFS_IO_UNWRITTEN)
> > +			if (type == XFS_IO_UNWRITTEN ||
> > +			    type == XFS_IO_COW)
> >  				return true;
> >  		} else if (buffer_delay(bh)) {
> > -			if (type == XFS_IO_DELALLOC)
> > +			if (type == XFS_IO_DELALLOC ||
> > +			    type == XFS_IO_COW)
> >  				return true;
> >  		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
> > -			if (type == XFS_IO_OVERWRITE)
> > +			if (type == XFS_IO_OVERWRITE ||
> > +			    type == XFS_IO_COW)
> >  				return true;
> >  		}
> 
> What's the purpose of this hunk? As it is, we don't appear to have any
> non-XFS_IO_DELALLOC callers. This probably warrants an update to the
> top-of-function comment at the very least.

Hmmmmm, originally these hunks /did/ actually help us to promote any
write that also had a CoW fork reservation into a copy-on-write to
reduce fragmentation, but with the iomap rework I think I can drop
this hunk since there's only one caller of this function now.

--D

> 
> Brian
> 
> >  
> > @@ -739,6 +750,56 @@ xfs_aops_discard_page(
> >  	return;
> >  }
> >  
> > +static int
> > +xfs_map_cow(
> > +	struct xfs_writepage_ctx *wpc,
> > +	struct inode		*inode,
> > +	loff_t			offset,
> > +	unsigned int		*new_type)
> > +{
> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	struct xfs_bmbt_irec	imap;
> > +	bool			is_cow = false, need_alloc = false;
> > +	int			error;
> > +
> > +	/*
> > +	 * If we already have a valid COW mapping keep using it.
> > +	 */
> > +	if (wpc->io_type == XFS_IO_COW) {
> > +		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
> > +		if (wpc->imap_valid) {
> > +			*new_type = XFS_IO_COW;
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Else we need to check if there is a COW mapping at this offset.
> > +	 */
> > +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > +	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap, &need_alloc);
> > +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > +
> > +	if (!is_cow)
> > +		return 0;
> > +
> > +	/*
> > +	 * And if the COW mapping has a delayed extent here we need to
> > +	 * allocate real space for it now.
> > +	 */
> > +	if (need_alloc) {
> > +		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
> > +				&imap);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	wpc->io_type = *new_type = XFS_IO_COW;
> > +	wpc->imap_valid = true;
> > +	wpc->imap = imap;
> > +	return 0;
> > +}
> > +
> >  /*
> >   * We implement an immediate ioend submission policy here to avoid needing to
> >   * chain multiple ioends and hence nest mempool allocations which can violate
> > @@ -771,6 +832,7 @@ xfs_writepage_map(
> >  	int			error = 0;
> >  	int			count = 0;
> >  	int			uptodate = 1;
> > +	unsigned int		new_type;
> >  
> >  	bh = head = page_buffers(page);
> >  	offset = page_offset(page);
> > @@ -791,22 +853,13 @@ xfs_writepage_map(
> >  			continue;
> >  		}
> >  
> > -		if (buffer_unwritten(bh)) {
> > -			if (wpc->io_type != XFS_IO_UNWRITTEN) {
> > -				wpc->io_type = XFS_IO_UNWRITTEN;
> > -				wpc->imap_valid = false;
> > -			}
> > -		} else if (buffer_delay(bh)) {
> > -			if (wpc->io_type != XFS_IO_DELALLOC) {
> > -				wpc->io_type = XFS_IO_DELALLOC;
> > -				wpc->imap_valid = false;
> > -			}
> > -		} else if (buffer_uptodate(bh)) {
> > -			if (wpc->io_type != XFS_IO_OVERWRITE) {
> > -				wpc->io_type = XFS_IO_OVERWRITE;
> > -				wpc->imap_valid = false;
> > -			}
> > -		} else {
> > +		if (buffer_unwritten(bh))
> > +			new_type = XFS_IO_UNWRITTEN;
> > +		else if (buffer_delay(bh))
> > +			new_type = XFS_IO_DELALLOC;
> > +		else if (buffer_uptodate(bh))
> > +			new_type = XFS_IO_OVERWRITE;
> > +		else {
> >  			if (PageUptodate(page))
> >  				ASSERT(buffer_mapped(bh));
> >  			/*
> > @@ -819,6 +872,17 @@ xfs_writepage_map(
> >  			continue;
> >  		}
> >  
> > +		if (xfs_is_reflink_inode(XFS_I(inode))) {
> > +			error = xfs_map_cow(wpc, inode, offset, &new_type);
> > +			if (error)
> > +				goto out;
> > +		}
> > +
> > +		if (wpc->io_type != new_type) {
> > +			wpc->io_type = new_type;
> > +			wpc->imap_valid = false;
> > +		}
> > +
> >  		if (wpc->imap_valid)
> >  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> >  							 offset);
> > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > index 1950e3b..b3c6634 100644
> > --- a/fs/xfs/xfs_aops.h
> > +++ b/fs/xfs/xfs_aops.h
> > @@ -28,13 +28,15 @@ enum {
> >  	XFS_IO_DELALLOC,	/* covers delalloc region */
> >  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
> >  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
> > +	XFS_IO_COW,		/* covers copy-on-write extent */
> >  };
> >  
> >  #define XFS_IO_TYPES \
> >  	{ XFS_IO_INVALID,		"invalid" }, \
> >  	{ XFS_IO_DELALLOC,		"delalloc" }, \
> >  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
> > -	{ XFS_IO_OVERWRITE,		"overwrite" }
> > +	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> > +	{ XFS_IO_COW,			"CoW" }
> >  
> >  /*
> >   * Structure for buffered I/O completions.
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 05a7fe6..e8c7c85 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -314,3 +314,89 @@ xfs_reflink_reserve_cow_range(
> >  
> >  	return error;
> >  }
> > +
> > +/*
> > + * Find the CoW reservation (and whether or not it needs block allocation)
> > + * for a given byte offset of a file.
> > + */
> > +bool
> > +xfs_reflink_find_cow_mapping(
> > +	struct xfs_inode		*ip,
> > +	xfs_off_t			offset,
> > +	struct xfs_bmbt_irec		*imap,
> > +	bool				*need_alloc)
> > +{
> > +	struct xfs_bmbt_irec		irec;
> > +	struct xfs_ifork		*ifp;
> > +	struct xfs_bmbt_rec_host	*gotp;
> > +	xfs_fileoff_t			bno;
> > +	xfs_extnum_t			idx;
> > +
> > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
> > +
> > +	if (!xfs_is_reflink_inode(ip))
> > +		return false;
> > +
> > +	/* Find the extent in the CoW fork. */
> > +	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > +	bno = XFS_B_TO_FSBT(ip->i_mount, offset);
> > +	gotp = xfs_iext_bno_to_ext(ifp, bno, &idx);
> > +	if (!gotp)
> > +		return false;
> > +
> > +	xfs_bmbt_get_all(gotp, &irec);
> > +	if (bno >= irec.br_startoff + irec.br_blockcount ||
> > +	    bno < irec.br_startoff)
> > +		return false;
> > +
> > +	trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
> > +			&irec);
> > +
> > +	/* If it's still delalloc, we must allocate later. */
> > +	*imap = irec;
> > +	*need_alloc = !!(isnullstartblock(irec.br_startblock));
> > +
> > +	return true;
> > +}
> > +
> > +/*
> > + * Trim an extent to end at the next CoW reservation past offset_fsb.
> > + */
> > +int
> > +xfs_reflink_trim_irec_to_next_cow(
> > +	struct xfs_inode		*ip,
> > +	xfs_fileoff_t			offset_fsb,
> > +	struct xfs_bmbt_irec		*imap)
> > +{
> > +	struct xfs_bmbt_irec		irec;
> > +	struct xfs_ifork		*ifp;
> > +	struct xfs_bmbt_rec_host	*gotp;
> > +	xfs_extnum_t			idx;
> > +
> > +	if (!xfs_is_reflink_inode(ip))
> > +		return 0;
> > +
> > +	/* Find the extent in the CoW fork. */
> > +	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > +	gotp = xfs_iext_bno_to_ext(ifp, offset_fsb, &idx);
> > +	if (!gotp)
> > +		return 0;
> > +	xfs_bmbt_get_all(gotp, &irec);
> > +
> > +	/* This is the extent before; try sliding up one. */
> > +	if (irec.br_startoff < offset_fsb) {
> > +		idx++;
> > +		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
> > +			return 0;
> > +		gotp = xfs_iext_get_ext(ifp, idx);
> > +		xfs_bmbt_get_all(gotp, &irec);
> > +	}
> > +
> > +	if (irec.br_startoff >= imap->br_startoff + imap->br_blockcount)
> > +		return 0;
> > +
> > +	imap->br_blockcount = irec.br_startoff - imap->br_startoff;
> > +	trace_xfs_reflink_trim_irec(ip, imap);
> > +
> > +	return 0;
> > +}
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index f824f87..11408c0 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -28,5 +28,9 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> >  
> >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> >  		xfs_off_t offset, xfs_off_t count);
> > +extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > +		struct xfs_bmbt_irec *imap, bool *need_alloc);
> > +extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > +		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
> >  
> >  #endif /* __XFS_REFLINK_H */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 31/63] xfs: create delalloc extents in CoW fork
  2016-10-04 17:39     ` Darrick J. Wong
@ 2016-10-04 18:38       ` Brian Foster
  0 siblings, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-04 18:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Tue, Oct 04, 2016 at 10:39:09AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 04, 2016 at 12:38:23PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:09:01PM -0700, Darrick J. Wong wrote:
> > > Wire up iomap_begin to detect shared extents and create delayed allocation
> > > extents in the CoW fork:
> > > 
> > >  1) Check if we already have an extent in the COW fork for the area.
> > >     If so nothing to do, we can move along.
> > >  2) Look up block number for the current extent, and if there is none
> > >     it's not shared move along.
> > >  3) Unshare the current extent as far as we are going to write into it.
> > >     For this we avoid an additional COW fork lookup and use the
> > >     information we set aside in step 1) above.
> > >  4) Goto 1) unless we've covered the whole range.
> > > 
> > > Last but not least, this updates the xfs_reflink_reserve_cow_range calling
> > > convention to pass a byte offset and length, as that is what both callers
> > > expect anyway.  This patch has been refactored considerably as part of the
> > > iomap transition.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/xfs_iomap.c   |   12 ++-
> > >  fs/xfs/xfs_reflink.c |  202 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_reflink.h |    9 ++
> > >  3 files changed, 221 insertions(+), 2 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 59c7beb..e8312b0 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -39,6 +39,7 @@
> > >  #include "xfs_quota.h"
> > >  #include "xfs_dquot_item.h"
> > >  #include "xfs_dquot.h"
> > > +#include "xfs_reflink.h"
> > >  
> > >  
> > >  #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
> > > @@ -961,8 +962,15 @@ xfs_file_iomap_begin(
> > >  	if (XFS_FORCED_SHUTDOWN(mp))
> > >  		return -EIO;
> > >  
> > > -	if ((flags & IOMAP_WRITE) &&
> > > -	    !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> > > +	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> > > +		error = xfs_reflink_reserve_cow_range(ip, offset, length);
> > > +		if (error < 0)
> > > +			return error;
> > > +	}
> > > +
> > > +	if ((flags & IOMAP_WRITE) && !IS_DAX(inode) &&
> > > +		   !xfs_get_extsz_hint(ip)) {
> > > +		/* Reserve delalloc blocks for regular writeback. */
> > >  		return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> > >  				iomap);
> > >  	}
> > 
> > What about the short write case? E.g., do we have to clear out delalloc
> > blocks from the cow fork in iomap_end() if we don't end up using them?
> 
> Nope, unused blocks sit around in the CoW fork (with the cowextsize hint
> set, this happens all the time) so that a subsequent write to an
> adjacent file offset lands in the same place as the successful write.
> The unused extents get cleaned out when the inode is evicted, we run out
> of disk space, or the garbage collector triggers.
> 

Interesting..  ok, I suppose I'll get to that bit eventually. :P Thanks.

Brian

> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 7adbb83..05a7fe6 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -51,6 +51,7 @@
> > >  #include "xfs_btree.h"
> > >  #include "xfs_bmap_btree.h"
> > >  #include "xfs_reflink.h"
> > > +#include "xfs_iomap.h"
> > >  
> > >  /*
> > >   * Copy on Write of Shared Blocks
> > > @@ -112,3 +113,204 @@
> > >   * ioend structure.  Better yet, the more ground we can cover with one
> > >   * ioend, the better.
> > >   */
> > > +
> > > +/*
> > > + * Given an AG extent, find the lowest-numbered run of shared blocks within
> > > + * that range and return the range in fbno/flen.
> > > + */
> > > +int
> > > +xfs_reflink_find_shared(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno,
> > > +	xfs_agblock_t		agbno,
> > > +	xfs_extlen_t		aglen,
> > > +	xfs_agblock_t		*fbno,
> > > +	xfs_extlen_t		*flen,
> > > +	bool			find_maximal)
> > > +{
> > > +	struct xfs_buf		*agbp;
> > > +	struct xfs_btree_cur	*cur;
> > > +	int			error;
> > > +
> > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
> > > +
> > > +	error = xfs_refcount_find_shared(cur, agbno, aglen, fbno, flen,
> > > +			find_maximal);
> > > +
> > > +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> > > +
> > > +	xfs_buf_relse(agbp);
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Trim the mapping to the next block where there's a change in the
> > > + * shared/unshared status.  More specifically, this means that we
> > > + * find the lowest-numbered extent of shared blocks that coincides with
> > > + * the given block mapping.  If the shared extent overlaps the start of
> > > + * the mapping, trim the mapping to the end of the shared extent.  If
> > > + * the shared region intersects the mapping, trim the mapping to the
> > > + * start of the shared extent.  If there are no shared regions that
> > > + * overlap, just return the original extent.
> > > + */
> > > +int
> > > +xfs_reflink_trim_around_shared(
> > > +	struct xfs_inode	*ip,
> > > +	struct xfs_bmbt_irec	*irec,
> > > +	bool			*shared,
> > > +	bool			*trimmed)
> > > +{
> > > +	xfs_agnumber_t		agno;
> > > +	xfs_agblock_t		agbno;
> > > +	xfs_extlen_t		aglen;
> > > +	xfs_agblock_t		fbno;
> > > +	xfs_extlen_t		flen;
> > > +	int			error = 0;
> > > +
> > > +	/* Holes, unwritten, and delalloc extents cannot be shared */
> > > +	if (!xfs_is_reflink_inode(ip) ||
> > > +	    ISUNWRITTEN(irec) ||
> > > +	    irec->br_startblock == HOLESTARTBLOCK ||
> > > +	    irec->br_startblock == DELAYSTARTBLOCK) {
> > > +		*shared = false;
> > > +		return 0;
> > > +	}
> > > +
> > > +	trace_xfs_reflink_trim_around_shared(ip, irec);
> > > +
> > > +	agno = XFS_FSB_TO_AGNO(ip->i_mount, irec->br_startblock);
> > > +	agbno = XFS_FSB_TO_AGBNO(ip->i_mount, irec->br_startblock);
> > > +	aglen = irec->br_blockcount;
> > > +
> > > +	error = xfs_reflink_find_shared(ip->i_mount, agno, agbno,
> > > +			aglen, &fbno, &flen, true);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	*shared = *trimmed = false;
> > > +	if (flen == 0) {
> > 
> > Preferable to use NULLAGBLOCK for this, imo.
> 
> Yeah, I will look into changing this.
> 
> > > +		/* No shared blocks at all. */
> > > +		return 0;
> > > +	} else if (fbno == agbno) {
> > > +		/* The start of this extent is shared. */
> > > +		irec->br_blockcount = flen;
> > > +		*shared = true;
> > > +		*trimmed = true;
> > 
> > Why do we set trimmed based solely on fbno == agbno? Is that valid if
> > the bmapbt extent exactly matches the refcntbt extent and we thus don't
> > actually modify the extent (e.g., br_blockcount == flen)? It's hard to
> > tell because trimmed looks unused (to this point?), so I could just
> > misunderstand the meaning.
> 
> You're right, we don't have to set trimmed if flen == aglen.
> 
> > > +		return 0;
> > > +	} else {
> > > +		/* There's a shared extent midway through this extent. */
> > > +		irec->br_blockcount = fbno - agbno;
> > 
> > Don't we have to push the startblock forward in this case?
> > 
> > Oh, I see. We trim the unshared length to push the fileoffset fsb to the
> > start of the shared region for the next iteration.
> 
> Yep.  I'll clarify the comment.
> 
> --D
> 
> > 
> > Brian
> > 
> > > +		*trimmed = true;
> > > +		return 0;
> > > +	}
> > > +}
> > > +
> > > +/* Create a CoW reservation for a range of blocks within a file. */
> > > +static int
> > > +__xfs_reflink_reserve_cow(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_fileoff_t		*offset_fsb,
> > > +	xfs_fileoff_t		end_fsb)
> > > +{
> > > +	struct xfs_bmbt_irec	got, prev, imap;
> > > +	xfs_fileoff_t		orig_end_fsb;
> > > +	int			nimaps, eof = 0, error = 0;
> > > +	bool			shared = false, trimmed = false;
> > > +	xfs_extnum_t		idx;
> > > +
> > > +	/* Already reserved?  Skip the refcount btree access. */
> > > +	xfs_bmap_search_extents(ip, *offset_fsb, XFS_COW_FORK, &eof, &idx,
> > > +			&got, &prev);
> > > +	if (!eof && got.br_startoff <= *offset_fsb) {
> > > +		end_fsb = orig_end_fsb = got.br_startoff + got.br_blockcount;
> > > +		trace_xfs_reflink_cow_found(ip, &got);
> > > +		goto done;
> > > +	}
> > > +
> > > +	/* Read extent from the source file. */
> > > +	nimaps = 1;
> > > +	error = xfs_bmapi_read(ip, *offset_fsb, end_fsb - *offset_fsb,
> > > +			&imap, &nimaps, 0);
> > > +	if (error)
> > > +		goto out_unlock;
> > > +	ASSERT(nimaps == 1);
> > > +
> > > +	/* Trim the mapping to the nearest shared extent boundary. */
> > > +	error = xfs_reflink_trim_around_shared(ip, &imap, &shared, &trimmed);
> > > +	if (error)
> > > +		goto out_unlock;
> > > +
> > > +	end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
> > > +
> > > +	/* Not shared?  Just report the (potentially capped) extent. */
> > > +	if (!shared)
> > > +		goto done;
> > > +
> > > +	/*
> > > +	 * Fork all the shared blocks from our write offset until the end of
> > > +	 * the extent.
> > > +	 */
> > > +	error = xfs_qm_dqattach_locked(ip, 0);
> > > +	if (error)
> > > +		goto out_unlock;
> > > +
> > > +retry:
> > > +	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, *offset_fsb,
> > > +			end_fsb - *offset_fsb, &got,
> > > +			&prev, &idx, eof);
> > > +	switch (error) {
> > > +	case 0:
> > > +		break;
> > > +	case -ENOSPC:
> > > +	case -EDQUOT:
> > > +		/* retry without any preallocation */
> > > +		trace_xfs_reflink_cow_enospc(ip, &imap);
> > > +		if (end_fsb != orig_end_fsb) {
> > > +			end_fsb = orig_end_fsb;
> > > +			goto retry;
> > > +		}
> > > +		/*FALLTHRU*/
> > > +	default:
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	trace_xfs_reflink_cow_alloc(ip, &got);
> > > +done:
> > > +	*offset_fsb = end_fsb;
> > > +out_unlock:
> > > +	return error;
> > > +}
> > > +
> > > +/* Create a CoW reservation for part of a file. */
> > > +int
> > > +xfs_reflink_reserve_cow_range(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_off_t		offset,
> > > +	xfs_off_t		count)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	xfs_fileoff_t		offset_fsb, end_fsb;
> > > +	int			error;
> > > +
> > > +	trace_xfs_reflink_reserve_cow_range(ip, offset, count);
> > > +
> > > +	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > +	end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > > +
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	while (offset_fsb < end_fsb) {
> > > +		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb);
> > > +		if (error) {
> > > +			trace_xfs_reflink_reserve_cow_range_error(ip, error,
> > > +				_RET_IP_);
> > > +			break;
> > > +		}
> > > +	}
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > index 820b151..f824f87 100644
> > > --- a/fs/xfs/xfs_reflink.h
> > > +++ b/fs/xfs/xfs_reflink.h
> > > @@ -20,4 +20,13 @@
> > >  #ifndef __XFS_REFLINK_H
> > >  #define __XFS_REFLINK_H 1
> > >  
> > > +extern int xfs_reflink_find_shared(struct xfs_mount *mp, xfs_agnumber_t agno,
> > > +		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
> > > +		xfs_extlen_t *flen, bool find_maximal);
> > > +extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > > +		struct xfs_bmbt_irec *irec, bool *shared, bool *trimmed);
> > > +
> > > +extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > > +		xfs_off_t offset, xfs_off_t count);
> > > +
> > >  #endif /* __XFS_REFLINK_H */
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 33/63] xfs: allocate delayed extents in CoW fork
  2016-10-04 18:26     ` Darrick J. Wong
@ 2016-10-04 18:39       ` Brian Foster
  0 siblings, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-04 18:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Tue, Oct 04, 2016 at 11:26:37AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 04, 2016 at 12:38:40PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:09:14PM -0700, Darrick J. Wong wrote:
> > > Modify the writepage handler to find and convert pending delalloc
> > > extents to real allocations.  Furthermore, when we're doing non-cow
> > > writes to a part of a file that already has a CoW reservation (the
> > > cowextsz hint that we set up in a subsequent patch facilitates this),
> > > promote the write to copy-on-write so that the entire extent can get
> > > written out as a single extent on disk, thereby reducing post-CoW
> > > fragmentation.
> > > 
> > > Christoph moved the CoW support code in _map_blocks to a separate helper
> > > function, refactored other functions, and reduced the number of CoW fork
> > > lookups, so I merged those changes here to reduce churn.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/xfs_aops.c    |  106 ++++++++++++++++++++++++++++++++++++++++----------
> > >  fs/xfs/xfs_aops.h    |    4 +-
> > >  fs/xfs/xfs_reflink.c |   86 +++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_reflink.h |    4 ++
> > >  4 files changed, 178 insertions(+), 22 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index 007a520..7b1e9de 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > ...
> > > @@ -645,13 +653,16 @@ xfs_check_page_type(
> > >  	bh = head = page_buffers(page);
> > >  	do {
> > >  		if (buffer_unwritten(bh)) {
> > > -			if (type == XFS_IO_UNWRITTEN)
> > > +			if (type == XFS_IO_UNWRITTEN ||
> > > +			    type == XFS_IO_COW)
> > >  				return true;
> > >  		} else if (buffer_delay(bh)) {
> > > -			if (type == XFS_IO_DELALLOC)
> > > +			if (type == XFS_IO_DELALLOC ||
> > > +			    type == XFS_IO_COW)
> > >  				return true;
> > >  		} else if (buffer_dirty(bh) && buffer_mapped(bh)) {
> > > -			if (type == XFS_IO_OVERWRITE)
> > > +			if (type == XFS_IO_OVERWRITE ||
> > > +			    type == XFS_IO_COW)
> > >  				return true;
> > >  		}
> > 
> > What's the purpose of this hunk? As it is, we don't appear to have any
> > non-XFS_IO_DELALLOC callers. This probably warrants an update to the
> > top-of-function comment at the very least.
> 
> Hmmmmm, originally these hunks /did/ actually help us to promote any
> write that also had a CoW fork reservation into a copy-on-write to
> reduce fragmentation, but with the iomap rework I think I can drop
> this hunk since there's only one caller of this function now.
> 

Ah, right. That makes sense. Thanks.

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  
> > > @@ -739,6 +750,56 @@ xfs_aops_discard_page(
> > >  	return;
> > >  }
> > >  
> > > +static int
> > > +xfs_map_cow(
> > > +	struct xfs_writepage_ctx *wpc,
> > > +	struct inode		*inode,
> > > +	loff_t			offset,
> > > +	unsigned int		*new_type)
> > > +{
> > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > +	struct xfs_bmbt_irec	imap;
> > > +	bool			is_cow = false, need_alloc = false;
> > > +	int			error;
> > > +
> > > +	/*
> > > +	 * If we already have a valid COW mapping keep using it.
> > > +	 */
> > > +	if (wpc->io_type == XFS_IO_COW) {
> > > +		wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap, offset);
> > > +		if (wpc->imap_valid) {
> > > +			*new_type = XFS_IO_COW;
> > > +			return 0;
> > > +		}
> > > +	}
> > > +
> > > +	/*
> > > +	 * Else we need to check if there is a COW mapping at this offset.
> > > +	 */
> > > +	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > > +	is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap, &need_alloc);
> > > +	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > > +
> > > +	if (!is_cow)
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * And if the COW mapping has a delayed extent here we need to
> > > +	 * allocate real space for it now.
> > > +	 */
> > > +	if (need_alloc) {
> > > +		error = xfs_iomap_write_allocate(ip, XFS_COW_FORK, offset,
> > > +				&imap);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +
> > > +	wpc->io_type = *new_type = XFS_IO_COW;
> > > +	wpc->imap_valid = true;
> > > +	wpc->imap = imap;
> > > +	return 0;
> > > +}
> > > +
> > >  /*
> > >   * We implement an immediate ioend submission policy here to avoid needing to
> > >   * chain multiple ioends and hence nest mempool allocations which can violate
> > > @@ -771,6 +832,7 @@ xfs_writepage_map(
> > >  	int			error = 0;
> > >  	int			count = 0;
> > >  	int			uptodate = 1;
> > > +	unsigned int		new_type;
> > >  
> > >  	bh = head = page_buffers(page);
> > >  	offset = page_offset(page);
> > > @@ -791,22 +853,13 @@ xfs_writepage_map(
> > >  			continue;
> > >  		}
> > >  
> > > -		if (buffer_unwritten(bh)) {
> > > -			if (wpc->io_type != XFS_IO_UNWRITTEN) {
> > > -				wpc->io_type = XFS_IO_UNWRITTEN;
> > > -				wpc->imap_valid = false;
> > > -			}
> > > -		} else if (buffer_delay(bh)) {
> > > -			if (wpc->io_type != XFS_IO_DELALLOC) {
> > > -				wpc->io_type = XFS_IO_DELALLOC;
> > > -				wpc->imap_valid = false;
> > > -			}
> > > -		} else if (buffer_uptodate(bh)) {
> > > -			if (wpc->io_type != XFS_IO_OVERWRITE) {
> > > -				wpc->io_type = XFS_IO_OVERWRITE;
> > > -				wpc->imap_valid = false;
> > > -			}
> > > -		} else {
> > > +		if (buffer_unwritten(bh))
> > > +			new_type = XFS_IO_UNWRITTEN;
> > > +		else if (buffer_delay(bh))
> > > +			new_type = XFS_IO_DELALLOC;
> > > +		else if (buffer_uptodate(bh))
> > > +			new_type = XFS_IO_OVERWRITE;
> > > +		else {
> > >  			if (PageUptodate(page))
> > >  				ASSERT(buffer_mapped(bh));
> > >  			/*
> > > @@ -819,6 +872,17 @@ xfs_writepage_map(
> > >  			continue;
> > >  		}
> > >  
> > > +		if (xfs_is_reflink_inode(XFS_I(inode))) {
> > > +			error = xfs_map_cow(wpc, inode, offset, &new_type);
> > > +			if (error)
> > > +				goto out;
> > > +		}
> > > +
> > > +		if (wpc->io_type != new_type) {
> > > +			wpc->io_type = new_type;
> > > +			wpc->imap_valid = false;
> > > +		}
> > > +
> > >  		if (wpc->imap_valid)
> > >  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> > >  							 offset);
> > > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> > > index 1950e3b..b3c6634 100644
> > > --- a/fs/xfs/xfs_aops.h
> > > +++ b/fs/xfs/xfs_aops.h
> > > @@ -28,13 +28,15 @@ enum {
> > >  	XFS_IO_DELALLOC,	/* covers delalloc region */
> > >  	XFS_IO_UNWRITTEN,	/* covers allocated but uninitialized data */
> > >  	XFS_IO_OVERWRITE,	/* covers already allocated extent */
> > > +	XFS_IO_COW,		/* covers copy-on-write extent */
> > >  };
> > >  
> > >  #define XFS_IO_TYPES \
> > >  	{ XFS_IO_INVALID,		"invalid" }, \
> > >  	{ XFS_IO_DELALLOC,		"delalloc" }, \
> > >  	{ XFS_IO_UNWRITTEN,		"unwritten" }, \
> > > -	{ XFS_IO_OVERWRITE,		"overwrite" }
> > > +	{ XFS_IO_OVERWRITE,		"overwrite" }, \
> > > +	{ XFS_IO_COW,			"CoW" }
> > >  
> > >  /*
> > >   * Structure for buffered I/O completions.
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 05a7fe6..e8c7c85 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -314,3 +314,89 @@ xfs_reflink_reserve_cow_range(
> > >  
> > >  	return error;
> > >  }
> > > +
> > > +/*
> > > + * Find the CoW reservation (and whether or not it needs block allocation)
> > > + * for a given byte offset of a file.
> > > + */
> > > +bool
> > > +xfs_reflink_find_cow_mapping(
> > > +	struct xfs_inode		*ip,
> > > +	xfs_off_t			offset,
> > > +	struct xfs_bmbt_irec		*imap,
> > > +	bool				*need_alloc)
> > > +{
> > > +	struct xfs_bmbt_irec		irec;
> > > +	struct xfs_ifork		*ifp;
> > > +	struct xfs_bmbt_rec_host	*gotp;
> > > +	xfs_fileoff_t			bno;
> > > +	xfs_extnum_t			idx;
> > > +
> > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL | XFS_ILOCK_SHARED));
> > > +
> > > +	if (!xfs_is_reflink_inode(ip))
> > > +		return false;
> > > +
> > > +	/* Find the extent in the CoW fork. */
> > > +	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > > +	bno = XFS_B_TO_FSBT(ip->i_mount, offset);
> > > +	gotp = xfs_iext_bno_to_ext(ifp, bno, &idx);
> > > +	if (!gotp)
> > > +		return false;
> > > +
> > > +	xfs_bmbt_get_all(gotp, &irec);
> > > +	if (bno >= irec.br_startoff + irec.br_blockcount ||
> > > +	    bno < irec.br_startoff)
> > > +		return false;
> > > +
> > > +	trace_xfs_reflink_find_cow_mapping(ip, offset, 1, XFS_IO_OVERWRITE,
> > > +			&irec);
> > > +
> > > +	/* If it's still delalloc, we must allocate later. */
> > > +	*imap = irec;
> > > +	*need_alloc = !!(isnullstartblock(irec.br_startblock));
> > > +
> > > +	return true;
> > > +}
> > > +
> > > +/*
> > > + * Trim an extent to end at the next CoW reservation past offset_fsb.
> > > + */
> > > +int
> > > +xfs_reflink_trim_irec_to_next_cow(
> > > +	struct xfs_inode		*ip,
> > > +	xfs_fileoff_t			offset_fsb,
> > > +	struct xfs_bmbt_irec		*imap)
> > > +{
> > > +	struct xfs_bmbt_irec		irec;
> > > +	struct xfs_ifork		*ifp;
> > > +	struct xfs_bmbt_rec_host	*gotp;
> > > +	xfs_extnum_t			idx;
> > > +
> > > +	if (!xfs_is_reflink_inode(ip))
> > > +		return 0;
> > > +
> > > +	/* Find the extent in the CoW fork. */
> > > +	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
> > > +	gotp = xfs_iext_bno_to_ext(ifp, offset_fsb, &idx);
> > > +	if (!gotp)
> > > +		return 0;
> > > +	xfs_bmbt_get_all(gotp, &irec);
> > > +
> > > +	/* This is the extent before; try sliding up one. */
> > > +	if (irec.br_startoff < offset_fsb) {
> > > +		idx++;
> > > +		if (idx >= ifp->if_bytes / sizeof(xfs_bmbt_rec_t))
> > > +			return 0;
> > > +		gotp = xfs_iext_get_ext(ifp, idx);
> > > +		xfs_bmbt_get_all(gotp, &irec);
> > > +	}
> > > +
> > > +	if (irec.br_startoff >= imap->br_startoff + imap->br_blockcount)
> > > +		return 0;
> > > +
> > > +	imap->br_blockcount = irec.br_startoff - imap->br_startoff;
> > > +	trace_xfs_reflink_trim_irec(ip, imap);
> > > +
> > > +	return 0;
> > > +}
> > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > index f824f87..11408c0 100644
> > > --- a/fs/xfs/xfs_reflink.h
> > > +++ b/fs/xfs/xfs_reflink.h
> > > @@ -28,5 +28,9 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > >  
> > >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > >  		xfs_off_t offset, xfs_off_t count);
> > > +extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > > +		struct xfs_bmbt_irec *imap, bool *need_alloc);
> > > +extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > > +		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
> > >  
> > >  #endif /* __XFS_REFLINK_H */
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped
  2016-10-04 12:44       ` Brian Foster
@ 2016-10-04 19:07         ` Dave Chinner
  2016-10-04 21:44           ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Dave Chinner @ 2016-10-04 19:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Tue, Oct 04, 2016 at 08:44:01AM -0400, Brian Foster wrote:
> On Mon, Oct 03, 2016 at 05:29:25PM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 03, 2016 at 03:04:10PM -0400, Brian Foster wrote:
> > > On Thu, Sep 29, 2016 at 08:08:17PM -0700, Darrick J. Wong wrote:
> > > > Log recovery will iget an inode to replay BUI items and iput the inode
> > > > when it's done.  Unfortunately, the iput will see that i_nlink == 0
> > > > and decide to truncate & free the inode, which prevents us from
> > > > replaying subsequent BUIs.  We can't skip the BUIs because we have to
> > > > replay all the redo items to ensure that atomic operations complete.
> > > > 
> ...
> > > 
> > > > Since unlinked inode recovery will reap the inode anyway, we can
> > > > safely introduce a new inode flag to indicate that an inode is in this
> > > > 'unlinked recovery' state and should not be auto-reaped in the
> > > > drop_inode path.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/xfs_bmap_item.c   |    1 +
> > > >  fs/xfs/xfs_inode.c       |    8 ++++++++
> > > >  fs/xfs/xfs_inode.h       |    6 ++++++
> > > >  fs/xfs/xfs_log_recover.c |    1 +
> > > >  4 files changed, 16 insertions(+)
> > > > 
> > > > 
> ...
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index e08eaea..0c25a76 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -1855,6 +1855,14 @@ xfs_inactive(
> > > >  	if (mp->m_flags & XFS_MOUNT_RDONLY)
> > > >  		return;
> > > >  
> > > > +	/*
> > > > +	 * If this unlinked inode is in the middle of recovery, don't
> > > > +	 * truncate and free the inode just yet; log recovery will take
> > > > +	 * care of that.  See the comment for this inode flag.
> > > > +	 */
> > > > +	if (xfs_iflags_test(ip, XFS_IRECOVER_UNLINKED))
> > > > +		return;
> > > > +
> > > 
> > > Also, it might be better to push this one block of code down since the
> > > following block still deals with i_nlink > 0 properly (not that it will
> > > likely affect the code as it is now, since we only handle eofblocks
> > > trimming atm).
> > 
> > I put the jump-out case there so that we touch the inode's bmap as little
> > as possible while we're recovering the inode.  Since the inode is still
> > around in memory, so we'll end up back there at a later point anyway.
> > 
> 
> I'm not quite following... it looks like we set the reclaim tag on the
> inode unconditionally after we get through xfs_inactive(). That implies
> the in-memory inode can go away at any point thereafter, unless somebody
> else comes along and happens to look for it. Hmm?

Yup - the iunlink recover check needs to go into xfs_fs_drop_inode()
to determine whether the inode should be dropped from the cache or
not by iput_final(). That way it will never get near xfs_inactive()
because the VFS won't try to evict it until the
XFS_IRECOVER_UNLINKED flag is cleared.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped
  2016-10-04 19:07         ` Dave Chinner
@ 2016-10-04 21:44           ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-04 21:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Wed, Oct 05, 2016 at 06:07:27AM +1100, Dave Chinner wrote:
> On Tue, Oct 04, 2016 at 08:44:01AM -0400, Brian Foster wrote:
> > On Mon, Oct 03, 2016 at 05:29:25PM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 03, 2016 at 03:04:10PM -0400, Brian Foster wrote:
> > > > On Thu, Sep 29, 2016 at 08:08:17PM -0700, Darrick J. Wong wrote:
> > > > > Log recovery will iget an inode to replay BUI items and iput the inode
> > > > > when it's done.  Unfortunately, the iput will see that i_nlink == 0
> > > > > and decide to truncate & free the inode, which prevents us from
> > > > > replaying subsequent BUIs.  We can't skip the BUIs because we have to
> > > > > replay all the redo items to ensure that atomic operations complete.
> > > > > 
> > ...
> > > > 
> > > > > Since unlinked inode recovery will reap the inode anyway, we can
> > > > > safely introduce a new inode flag to indicate that an inode is in this
> > > > > 'unlinked recovery' state and should not be auto-reaped in the
> > > > > drop_inode path.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/xfs/xfs_bmap_item.c   |    1 +
> > > > >  fs/xfs/xfs_inode.c       |    8 ++++++++
> > > > >  fs/xfs/xfs_inode.h       |    6 ++++++
> > > > >  fs/xfs/xfs_log_recover.c |    1 +
> > > > >  4 files changed, 16 insertions(+)
> > > > > 
> > > > > 
> > ...
> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index e08eaea..0c25a76 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -1855,6 +1855,14 @@ xfs_inactive(
> > > > >  	if (mp->m_flags & XFS_MOUNT_RDONLY)
> > > > >  		return;
> > > > >  
> > > > > +	/*
> > > > > +	 * If this unlinked inode is in the middle of recovery, don't
> > > > > +	 * truncate and free the inode just yet; log recovery will take
> > > > > +	 * care of that.  See the comment for this inode flag.
> > > > > +	 */
> > > > > +	if (xfs_iflags_test(ip, XFS_IRECOVER_UNLINKED))
> > > > > +		return;
> > > > > +
> > > > 
> > > > Also, it might be better to push this one block of code down since the
> > > > following block still deals with i_nlink > 0 properly (not that it will
> > > > likely affect the code as it is now, since we only handle eofblocks
> > > > trimming atm).
> > > 
> > > I put the jump-out case there so that we touch the inode's bmap as little
> > > as possible while we're recovering the inode.  Since the inode is still
> > > around in memory, so we'll end up back there at a later point anyway.
> > > 
> > 
> > I'm not quite following... it looks like we set the reclaim tag on the
> > inode unconditionally after we get through xfs_inactive(). That implies
> > the in-memory inode can go away at any point thereafter, unless somebody
> > else comes along and happens to look for it. Hmm?
> 
> Yup - the iunlink recover check needs to go into xfs_fs_drop_inode()
> to determine whether the inode should be dropped from the cache or
> not by iput_final(). That way it will never get near xfs_inactive()
> because the VFS won't try to evict it until the
> XFS_IRECOVER_UNLINKED flag is cleared.

(Ick, the iput code...)

So.... paging some of my notes back into memory, iput_final() will
still evict() an i_count == 0 inode even if op->drop_inode says
not to drop the inode if MS_ACTIVE is not set on the superblock:

iput_final() {
  if (op->drop_inode)
  	drop = op->drop_inode(inode);
  else
  	drop = generic_drop_inode(inode);

  if (!drop && (sb->s_flags & MS_ACTIVE)) {
  	inode->i_state |= I_REFERENCED;
  	inode_add_lru(inode);
  	spin_unlock(&inode->i_lock);
  	return;
  }

  /* do stuff */

  evict(inode);
}

MS_ACTIVE isn't set on the superblock during recovery because the
VFS doesn't set it until fill_super succeeds, and fill_super doesn't
return until we're done with log recovery.  Therefore, we can end up
in xfs_inactive during recovery even if _drop_inode just told the VFS
not to evict the inode.

IIRC that's why the IRECOVERY check ended up in xfs_inactive.  I
don't mind adding a second IRECOVERY check to xfs_fs_drop_inode,
but removing the one in xfs_inactive breaks recovery (xfs/329).

(Or, per a suggestion of Dave, I could just set MS_ACTIVE prior to
second stage log recovery.)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 34/63] xfs: support removing extents from CoW fork
  2016-09-30  3:09 ` [PATCH 34/63] xfs: support removing extents from " Darrick J. Wong
  2016-09-30  7:46   ` Christoph Hellwig
@ 2016-10-05 18:26   ` Brian Foster
  1 sibling, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-05 18:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:09:21PM -0700, Darrick J. Wong wrote:
> Create a helper method to remove extents from the CoW fork without
> any of the side effects (rmapbt/bmbt updates) of the regular extent
> deletion routine.  We'll eventually use this to clear out the CoW fork
> during ioend processing.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: Use bmapi_read to iterate and trim the CoW extents instead of
> reading them raw via the iext code.
> ---
>  fs/xfs/libxfs/xfs_bmap.c |  176 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_bmap.h |    1 
>  2 files changed, 177 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 85a0c86..451f3e4 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4906,6 +4906,7 @@ xfs_bmap_del_extent(
>  		/*
>  		 * Matches the whole extent.  Delete the entry.
>  		 */
> +		trace_xfs_bmap_pre_update(ip, *idx, state, _THIS_IP_);
>  		xfs_iext_remove(ip, *idx, 1,
>  				whichfork == XFS_ATTR_FORK ? BMAP_ATTRFORK : 0);
>  		--*idx;
> @@ -5123,6 +5124,181 @@ xfs_bmap_del_extent(
>  }
>  
>  /*
> + * xfs_bunmapi_cow() -- Remove the relevant parts of the CoW fork.
> + *			See xfs_bmap_del_extent.
> + * @ip: XFS inode.
> + * @idx: Extent number to delete.
> + * @del: Extent to remove.
> + */
> +int
> +xfs_bunmapi_cow(
> +	xfs_inode_t		*ip,
> +	xfs_bmbt_irec_t		*del)
> +{
> +	xfs_filblks_t		da_new;	/* new delay-alloc indirect blocks */
> +	xfs_filblks_t		da_old;	/* old delay-alloc indirect blocks */
> +	xfs_fsblock_t		del_endblock = 0;/* first block past del */
> +	xfs_fileoff_t		del_endoff;	/* first offset past del */
> +	int			delay;	/* current block is delayed allocated */
> +	xfs_bmbt_rec_host_t	*ep;	/* current extent entry pointer */
> +	int			error;	/* error return value */
> +	xfs_bmbt_irec_t		got;	/* current extent entry */
> +	xfs_fileoff_t		got_endoff;	/* first offset past got */
> +	xfs_ifork_t		*ifp;	/* inode fork pointer */
> +	xfs_mount_t		*mp;	/* mount structure */
> +	xfs_filblks_t		nblks;	/* quota/sb block count */
> +	xfs_bmbt_irec_t		new;	/* new record to be inserted */
> +	/* REFERENCED */
> +	uint			qfield;	/* quota field to update */
> +	xfs_filblks_t		temp;	/* for indirect length calculations */
> +	xfs_filblks_t		temp2;	/* for indirect length calculations */
> +	int			state = BMAP_COWFORK;
> +	int			eof;
> +	xfs_extnum_t		eidx;
> +
> +	mp = ip->i_mount;
> +	XFS_STATS_INC(mp, xs_del_exlist);
> +
> +	ep = xfs_bmap_search_extents(ip, del->br_startoff, XFS_COW_FORK, &eof,
> +			&eidx, &got, &new);
> +
> +	ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK); ifp = ifp;
> +	ASSERT((eidx >= 0) && (eidx < ifp->if_bytes /
> +		(uint)sizeof(xfs_bmbt_rec_t)));

The alignment above is a little wonky. E.g.:

	ASSERT((eidx >= 0) &&
	       (eidx < ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)));

... but it sounds like hch plans to clean this up and rework it in the
near future.

> +	ASSERT(del->br_blockcount > 0);
> +	ASSERT(got.br_startoff <= del->br_startoff);
> +	del_endoff = del->br_startoff + del->br_blockcount;
> +	got_endoff = got.br_startoff + got.br_blockcount;
> +	ASSERT(got_endoff >= del_endoff);
> +	delay = isnullstartblock(got.br_startblock);
> +	ASSERT(isnullstartblock(del->br_startblock) == delay);
> +	qfield = 0;
> +	error = 0;
> +	/*
> +	 * If deleting a real allocation, must free up the disk space.
> +	 */
> +	if (!delay) {
> +		nblks = del->br_blockcount;
> +		qfield = XFS_TRANS_DQ_BCOUNT;
> +		/*
> +		 * Set up del_endblock and cur for later.
> +		 */
> +		del_endblock = del->br_startblock + del->br_blockcount;
> +		da_old = da_new = 0;
> +	} else {
> +		da_old = startblockval(got.br_startblock);
> +		da_new = 0;
> +		nblks = 0;
> +	}
> +	qfield = qfield;
> +	nblks = nblks;
> +
> +	/*
> +	 * Set flag value to use in switch statement.
> +	 * Left-contig is 2, right-contig is 1.
> +	 */
> +	switch (((got.br_startoff == del->br_startoff) << 1) |
> +		(got_endoff == del_endoff)) {
> +	case 3:
> +		/*
> +		 * Matches the whole extent.  Delete the entry.
> +		 */
> +		xfs_iext_remove(ip, eidx, 1, BMAP_COWFORK);
> +		--eidx;
> +		break;
> +
> +	case 2:
> +		/*
> +		 * Deleting the first part of the extent.
> +		 */
> +		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
> +		xfs_bmbt_set_startoff(ep, del_endoff);
> +		temp = got.br_blockcount - del->br_blockcount;
> +		xfs_bmbt_set_blockcount(ep, temp);
> +		if (delay) {
> +			temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
> +				da_old);
> +			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
> +			trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
> +			da_new = temp;
> +			break;
> +		}
> +		xfs_bmbt_set_startblock(ep, del_endblock);
> +		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
> +		break;
> +
> +	case 1:
> +		/*
> +		 * Deleting the last part of the extent.
> +		 */
> +		temp = got.br_blockcount - del->br_blockcount;
> +		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
> +		xfs_bmbt_set_blockcount(ep, temp);
> +		if (delay) {
> +			temp = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
> +				da_old);
> +			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
> +			trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
> +			da_new = temp;
> +			break;
> +		}
> +		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
> +		break;
> +
> +	case 0:
> +		/*
> +		 * Deleting the middle of the extent.
> +		 */
> +		temp = del->br_startoff - got.br_startoff;
> +		trace_xfs_bmap_pre_update(ip, eidx, state, _THIS_IP_);
> +		xfs_bmbt_set_blockcount(ep, temp);
> +		new.br_startoff = del_endoff;
> +		temp2 = got_endoff - del_endoff;
> +		new.br_blockcount = temp2;
> +		new.br_state = got.br_state;
> +		if (!delay) {
> +			new.br_startblock = del_endblock;
> +		} else {
> +			temp = xfs_bmap_worst_indlen(ip, temp);
> +			xfs_bmbt_set_startblock(ep, nullstartblock((int)temp));
> +			temp2 = xfs_bmap_worst_indlen(ip, temp2);
> +			new.br_startblock = nullstartblock((int)temp2);
> +			da_new = temp + temp2;
> +			while (da_new > da_old) {
> +				if (temp) {
> +					temp--;
> +					da_new--;
> +					xfs_bmbt_set_startblock(ep,
> +						nullstartblock((int)temp));
> +				}
> +				if (da_new == da_old)
> +					break;
> +				if (temp2) {
> +					temp2--;
> +					da_new--;
> +					new.br_startblock =
> +						nullstartblock((int)temp2);
> +				}
> +			}

xfs_bmap_split_indlen() ?

Brian

> +		}
> +		trace_xfs_bmap_post_update(ip, eidx, state, _THIS_IP_);
> +		xfs_iext_insert(ip, eidx + 1, 1, &new, state);
> +		++eidx;
> +		break;
> +	}
> +
> +	/*
> +	 * Account for change in delayed indirect blocks.
> +	 * Nothing to do for disk quota accounting here.
> +	 */
> +	ASSERT(da_old >= da_new);
> +	if (da_old > da_new)
> +		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), false);
> +
> +	return error;
> +}
> +
> +/*
>   * Unmap (remove) blocks from a file.
>   * If nexts is nonzero then the number of extents to remove is limited to
>   * that value.  If not all extents in the block range can be removed then
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 75b1a1f..7c4ad01 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -221,6 +221,7 @@ int	xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fileoff_t bno, xfs_filblks_t len, int flags,
>  		xfs_extnum_t nexts, xfs_fsblock_t *firstblock,
>  		struct xfs_defer_ops *dfops, int *done);
> +int	xfs_bunmapi_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *del);
>  int	xfs_check_nostate_extents(struct xfs_ifork *ifp, xfs_extnum_t idx,
>  		xfs_extnum_t num);
>  uint	xfs_default_attroffset(struct xfs_inode *ip);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 35/63] xfs: move mappings from cow fork to data fork after copy-write
  2016-09-30  3:09 ` [PATCH 35/63] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
@ 2016-10-05 18:26   ` Brian Foster
  2016-10-05 21:22     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-05 18:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Sep 29, 2016 at 08:09:27PM -0700, Darrick J. Wong wrote:
> After the write component of a copy-write operation finishes, clean up
> the bookkeeping left behind.  On error, we simply free the new blocks
> and pass the error up.  If we succeed, however, then we must remove
> the old data fork mapping and move the cow fork mapping to the data
> fork.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> [hch: Call the CoW failure function during xfs_cancel_ioend]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> v2: If CoW fails, we need to remove the CoW fork mapping and free the
> blocks.  Furthermore, if xfs_cancel_ioend happens, we also need to
> clean out all the CoW record keeping.
> 
> v3: When we're removing CoW extents, only free one extent per
> transaction to avoid running out of reservation.  Also,
> xfs_cancel_ioend mustn't clean out the CoW fork because it is called
> when async writeback can't get an inode lock and will try again.
> 
> v4: Use bmapi_read to iterate the CoW fork instead of calling the
> iext functions directly, and make the CoW remapping atomic by
> using the deferred ops mechanism which takes care of logging redo
> items for us.
> 
> v5: Unlock the inode if cancelling the CoW reservation fails.
> ---
>  fs/xfs/xfs_aops.c    |   22 ++++
>  fs/xfs/xfs_reflink.c |  271 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_reflink.h |    8 +
>  3 files changed, 299 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 7b1e9de..aa23993 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -288,6 +288,23 @@ xfs_end_io(
>  		error = -EIO;
>  
>  	/*
> +	 * For a CoW extent, we need to move the mapping from the CoW fork
> +	 * to the data fork.  If instead an error happened, just dump the
> +	 * new blocks.
> +	 */
> +	if (ioend->io_type == XFS_IO_COW) {
> +		if (ioend->io_bio->bi_error) {
> +			error = xfs_reflink_cancel_cow_range(ip,
> +					ioend->io_offset, ioend->io_size);
> +			goto done;
> +		}

I'm a little confused why we'd clear out delalloc blocks here but not if
the write happens to fail in the first place (but I take it the
explanation for my previous comment still applies..).

> +		error = xfs_reflink_end_cow(ip, ioend->io_offset,
> +				ioend->io_size);
> +		if (error)
> +			goto done;

This hunk clobbers error if set by the previous check (not shown in the
diff) due to fs shutdown.

> +	}
> +
> +	/*
>  	 * For unwritten extents we need to issue transactions to convert a
>  	 * range to normal written extens after the data I/O has finished.
>  	 * Detecting and handling completion IO errors is done individually
> @@ -302,7 +319,8 @@ xfs_end_io(
>  	} else if (ioend->io_append_trans) {
>  		error = xfs_setfilesize_ioend(ioend, error);
>  	} else {
> -		ASSERT(!xfs_ioend_is_append(ioend));
> +		ASSERT(!xfs_ioend_is_append(ioend) ||
> +		       ioend->io_type == XFS_IO_COW);
>  	}
>  
>  done:
> @@ -316,7 +334,7 @@ xfs_end_bio(
>  	struct xfs_ioend	*ioend = bio->bi_private;
>  	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
>  
> -	if (ioend->io_type == XFS_IO_UNWRITTEN)
> +	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
>  		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>  	else if (ioend->io_append_trans)
>  		queue_work(mp->m_data_workqueue, &ioend->io_work);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index e8c7c85..d913ad1 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -52,6 +52,7 @@
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
>  #include "xfs_iomap.h"
> +#include "xfs_rmap_btree.h"
>  
>  /*
>   * Copy on Write of Shared Blocks
> @@ -114,6 +115,37 @@
>   * ioend, the better.
>   */
>  
> +/* Trim extent to fit a logical block range. */
> +static void
> +xfs_trim_extent(
> +	struct xfs_bmbt_irec	*irec,
> +	xfs_fileoff_t		bno,
> +	xfs_filblks_t		len)
> +{
> +	xfs_fileoff_t		distance;
> +	xfs_fileoff_t		end = bno + len;
> +
> +	if (irec->br_startoff + irec->br_blockcount <= bno ||
> +	    irec->br_startoff >= end) {

Hmm, this seems like slightly strange behavior. Why reset blockcount on
an extent for a request to trim it to something beyond the extent? Is
this primarily an error/sanity check or is this an expected case?

As it is, it looks like bno should point to the end of irec in most
cases unless the unmap happens to remove less from the data fork than
what has been allocated in the cow fork (which seems sane). I wonder if
we just want to ASSERT() that the extent and trim range are sane here?

> +		irec->br_blockcount = 0;
> +		return;
> +	}
> +
> +	if (irec->br_startoff < bno) {
> +		distance = bno - irec->br_startoff;
> +		if (irec->br_startblock != DELAYSTARTBLOCK &&
> +		    irec->br_startblock != HOLESTARTBLOCK)
> +			irec->br_startblock += distance;
> +		irec->br_startoff += distance;
> +		irec->br_blockcount -= distance;
> +	}
> +
> +	if (end < irec->br_startoff + irec->br_blockcount) {
> +		distance = irec->br_startoff + irec->br_blockcount - end;
> +		irec->br_blockcount -= distance;
> +	}
> +}
> +
>  /*
>   * Given an AG extent, find the lowest-numbered run of shared blocks within
>   * that range and return the range in fbno/flen.
> @@ -400,3 +432,242 @@ xfs_reflink_trim_irec_to_next_cow(
>  
>  	return 0;
>  }
> +
> +/*
> + * Cancel all pending CoW reservations for some block range of an inode.
> + */
> +int
> +xfs_reflink_cancel_cow_blocks(
> +	struct xfs_inode		*ip,
> +	struct xfs_trans		**tpp,
> +	xfs_fileoff_t			offset_fsb,
> +	xfs_fileoff_t			end_fsb)
> +{
> +	struct xfs_bmbt_irec		irec;
> +	xfs_filblks_t			count_fsb;
> +	xfs_fsblock_t			firstfsb;
> +	struct xfs_defer_ops		dfops;
> +	int				error = 0;
> +	int				nimaps;
> +
> +	if (!xfs_is_reflink_inode(ip))
> +		return 0;
> +
> +	/* Go find the old extent in the CoW fork. */
> +	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
> +	while (count_fsb) {
> +		nimaps = 1;
> +		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
> +				&nimaps, XFS_BMAPI_COWFORK);
> +		if (error)
> +			break;
> +		ASSERT(nimaps == 1);
> +
> +		trace_xfs_reflink_cancel_cow(ip, &irec);
> +
> +		if (irec.br_startblock == DELAYSTARTBLOCK) {
> +			/* Free a delayed allocation. */
> +			xfs_mod_fdblocks(ip->i_mount, irec.br_blockcount,
> +					false);
> +			ip->i_delayed_blks -= irec.br_blockcount;
> +
> +			/* Remove the mapping from the CoW fork. */
> +			error = xfs_bunmapi_cow(ip, &irec);
> +			if (error)
> +				break;
> +		} else if (irec.br_startblock == HOLESTARTBLOCK) {
> +			/* empty */
> +		} else {
> +			xfs_trans_ijoin(*tpp, ip, 0);
> +			xfs_defer_init(&dfops, &firstfsb);
> +
> +			xfs_bmap_add_free(ip->i_mount, &dfops,
> +					irec.br_startblock, irec.br_blockcount,
> +					NULL);
> +
> +			/* Update quota accounting */
> +			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> +					-(long)irec.br_blockcount);
> +
> +			/* Roll the transaction */
> +			error = xfs_defer_finish(tpp, &dfops, ip);
> +			if (error) {
> +				xfs_defer_cancel(&dfops);
> +				break;
> +			}
> +
> +			/* Remove the mapping from the CoW fork. */
> +			error = xfs_bunmapi_cow(ip, &irec);
> +			if (error)
> +				break;
> +		}
> +
> +		/* Roll on... */
> +		count_fsb -= irec.br_startoff + irec.br_blockcount - offset_fsb;

Might be wise to safeguard against the extent being larger than the
range (or just use offset_fsb and kill count_fsb)...

> +		offset_fsb = irec.br_startoff + irec.br_blockcount;
> +	}
> +
> +	return error;
> +}
> +
> +/*
> + * Cancel all pending CoW reservations for some byte range of an inode.
> + */
> +int
> +xfs_reflink_cancel_cow_range(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		count)
> +{
> +	struct xfs_trans	*tp;
> +	xfs_fileoff_t		offset_fsb;
> +	xfs_fileoff_t		end_fsb;
> +	int			error;
> +
> +	trace_xfs_reflink_cancel_cow_range(ip, offset, count);
> +
> +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> +	if (count == NULLFILEOFF)
> +		end_fsb = NULLFILEOFF;
> +	else
> +		end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> +
> +	/* Start a rolling transaction to remove the mappings */
> +	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> +			0, 0, 0, &tp);
> +	if (error)
> +		goto out;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	/* Scrape out the old CoW reservations */
> +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, offset_fsb, end_fsb);
> +	if (error)
> +		goto out_defer;
> +
> +	error = xfs_trans_commit(tp);
> +
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +
> +out_defer:

out_cancel ?

> +	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +out:
> +	trace_xfs_reflink_cancel_cow_range_error(ip, error, _RET_IP_);
> +	return error;
> +}
> +
> +/*
> + * Remap parts of a file's data fork after a successful CoW.
> + */
> +int
> +xfs_reflink_end_cow(
> +	struct xfs_inode		*ip,
> +	xfs_off_t			offset,
> +	xfs_off_t			count)
> +{
> +	struct xfs_bmbt_irec		irec;
> +	struct xfs_bmbt_irec		uirec;
> +	struct xfs_trans		*tp;
> +	xfs_fileoff_t			offset_fsb;
> +	xfs_fileoff_t			end_fsb;
> +	xfs_filblks_t			count_fsb;
> +	xfs_fsblock_t			firstfsb;
> +	struct xfs_defer_ops		dfops;
> +	int				error;
> +	unsigned int			resblks;
> +	xfs_filblks_t			ilen;
> +	xfs_filblks_t			rlen;
> +	int				nimaps;
> +
> +	trace_xfs_reflink_end_cow(ip, offset, count);
> +
> +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> +	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
> +
> +	/* Start a rolling transaction to switch the mappings */
> +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> +	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> +			resblks, 0, 0, &tp);
> +	if (error)
> +		goto out;

I forget the exact reason why we preallocate append transactions for I/O
completion, but it would be nice if Dave or somebody could chime in on
that to make sure we don't need to do something similar here (and for
the cancel case).

> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	/* Go find the old extent in the CoW fork. */
> +	while (count_fsb) {
> +		/* Read extent from the source file */
> +		nimaps = 1;
> +		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
> +				&nimaps, XFS_BMAPI_COWFORK);
> +		if (error)
> +			goto out_cancel;
> +		ASSERT(nimaps == 1);
> +
> +		ASSERT(irec.br_startblock != DELAYSTARTBLOCK);
> +		trace_xfs_reflink_cow_remap(ip, &irec);
> +
> +		/*
> +		 * We can have a hole in the CoW fork if part of a directio
> +		 * write is CoW but part of it isn't.
> +		 */
> +		rlen = ilen = irec.br_blockcount;
> +		if (irec.br_startblock == HOLESTARTBLOCK)
> +			goto next_extent;
> +
> +		/* Unmap the old blocks in the data fork. */
> +		while (rlen) {
> +			xfs_defer_init(&dfops, &firstfsb);
> +			error = __xfs_bunmapi(tp, ip, irec.br_startoff,
> +					&rlen, 0, 1, &firstfsb, &dfops);
> +			if (error)
> +				goto out_defer;
> +
> +			/* Trim the extent to whatever got unmapped. */
> +			uirec = irec;
> +			xfs_trim_extent(&uirec, irec.br_startoff + rlen,
> +					irec.br_blockcount - rlen);

We assign uirec = irec, then pass calculated values based on irec and
rlen. How about xfs_trim_extent(&uirec, rlen)?

Also, it took me a while to grok that we "trim" the beginning of the
extent because bunmapi works backwards. A comment would be appreciated
here.  ;)

Brian

> +			irec.br_blockcount = rlen;
> +			trace_xfs_reflink_cow_remap_piece(ip, &uirec);
> +
> +			/* Map the new blocks into the data fork. */
> +			error = xfs_bmap_map_extent(tp->t_mountp, &dfops,
> +					ip, XFS_DATA_FORK, &uirec);
> +			if (error)
> +				goto out_defer;
> +
> +			/* Remove the mapping from the CoW fork. */
> +			error = xfs_bunmapi_cow(ip, &uirec);
> +			if (error)
> +				goto out_defer;
> +
> +			error = xfs_defer_finish(&tp, &dfops, ip);
> +			if (error)
> +				goto out_defer;
> +		}
> +
> +next_extent:
> +		/* Roll on... */
> +		count_fsb -= irec.br_startoff + ilen - offset_fsb;
> +		offset_fsb = irec.br_startoff + ilen;
> +	}
> +
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	if (error)
> +		goto out;
> +	return 0;
> +
> +out_defer:
> +	xfs_defer_cancel(&dfops);
> +out_cancel:
> +	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +out:
> +	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 11408c0..bffa4be 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -33,4 +33,12 @@ extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
>  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
>  		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
>  
> +extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
> +		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
> +		xfs_fileoff_t end_fsb);
> +extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> +		xfs_off_t count);
> +extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> +		xfs_off_t count);
> +
>  #endif /* __XFS_REFLINK_H */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-09-30  3:09 ` [PATCH 37/63] xfs: implement CoW for directio writes Darrick J. Wong
@ 2016-10-05 18:27   ` Brian Foster
  2016-10-05 20:55     ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-05 18:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Sep 29, 2016 at 08:09:40PM -0700, Darrick J. Wong wrote:
> For O_DIRECT writes to shared blocks, we have to CoW them just like
> we would with buffered writes.  For writes that are not block-aligned,
> just bounce them to the page cache.
> 
> For block-aligned writes, however, we can do better than that.  Use
> the same mechanisms that we employ for buffered CoW to set up a
> delalloc reservation, allocate all the blocks at once, issue the
> writes against the new blocks and use the same ioend functions to
> remap the blocks after the write.  This should be fairly performant.
> 
> Christoph discovered that xfs_reflink_allocate_cow_range may stumble
> over invalid entries in the extent array given that it drops the ilock
> but still expects the index to be stable.  Simple fixing it to a new
> lookup for every iteration still isn't correct given that
> xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
> there is nothing preventing a xfs_bunmapi_cow call removing extents
> once we dropped the ilock either.
> 
> This patch duplicates the inner loop of xfs_bmapi_allocate into a
> helper for xfs_reflink_allocate_cow_range so that it can be done under
> the same ilock critical section as our CoW fork delayed allocation.
> The directio CoW warts will be revisited in a later patch.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> v2: Turns out that there's no way for xfs_end_io_direct_write to know
> if the write completed successfully.  Therefore, do /not/ use the
> ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
> where we *can* tell if the write succeeded or not.
> 
> v3: Update the file size if we do a directio CoW across EOF.  This
> can happen if the last block is shared, the cowextsize hint is set,
> and we do a dio write past the end of the file.
> 
> v4: Christoph rewrote the allocate code to fix some concurrency
> problems as part of migrating the code to support iomap.
> ---
>  fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_file.c    |   20 ++++++++-
>  fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_reflink.h |    2 +
>  fs/xfs/xfs_trace.h   |    1 
>  5 files changed, 208 insertions(+), 13 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 1d0435a..62a95e4 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -40,6 +40,7 @@
>  /* flags for direct write completions */
>  #define XFS_DIO_FLAG_UNWRITTEN	(1 << 0)
>  #define XFS_DIO_FLAG_APPEND	(1 << 1)
> +#define XFS_DIO_FLAG_COW	(1 << 2)
>  
>  /*
>   * structure owned by writepages passed to individual writepage calls
> @@ -1191,18 +1192,24 @@ xfs_map_direct(
>  	struct inode		*inode,
>  	struct buffer_head	*bh_result,
>  	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset)
> +	xfs_off_t		offset,
> +	bool			is_cow)
>  {
>  	uintptr_t		*flags = (uintptr_t *)&bh_result->b_private;
>  	xfs_off_t		size = bh_result->b_size;
>  
>  	trace_xfs_get_blocks_map_direct(XFS_I(inode), offset, size,
> -		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : XFS_IO_OVERWRITE, imap);
> +		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : is_cow ? XFS_IO_COW :
> +		XFS_IO_OVERWRITE, imap);
>  
>  	if (ISUNWRITTEN(imap)) {
>  		*flags |= XFS_DIO_FLAG_UNWRITTEN;
>  		set_buffer_defer_completion(bh_result);
> -	} else if (offset + size > i_size_read(inode) || offset + size < 0) {
> +	} else if (is_cow) {
> +		*flags |= XFS_DIO_FLAG_COW;
> +		set_buffer_defer_completion(bh_result);
> +	}
> +	if (offset + size > i_size_read(inode) || offset + size < 0) {
>  		*flags |= XFS_DIO_FLAG_APPEND;
>  		set_buffer_defer_completion(bh_result);
>  	}
> @@ -1248,6 +1255,44 @@ xfs_map_trim_size(
>  	bh_result->b_size = mapping_size;
>  }
>  
> +/* Bounce unaligned directio writes to the page cache. */
> +static int
> +xfs_bounce_unaligned_dio_write(
> +	struct xfs_inode	*ip,
> +	xfs_fileoff_t		offset_fsb,
> +	struct xfs_bmbt_irec	*imap)
> +{
> +	struct xfs_bmbt_irec	irec;
> +	xfs_fileoff_t		delta;
> +	bool			shared;
> +	bool			x;
> +	int			error;
> +
> +	irec = *imap;
> +	if (offset_fsb > irec.br_startoff) {
> +		delta = offset_fsb - irec.br_startoff;
> +		irec.br_blockcount -= delta;
> +		irec.br_startblock += delta;
> +		irec.br_startoff = offset_fsb;
> +	}
> +	error = xfs_reflink_trim_around_shared(ip, &irec, &x, &shared);

'shared' is the 3rd parameter.

> +	if (error)
> +		return error;
> +	/*
> +	 * Are we doing a DIO write to a shared block?  In
> +	 * the ideal world we at least would fork full blocks,
> +	 * but for now just fall back to buffered mode.  Yuck.
> +	 * Use -EREMCHG ("remote address changed") to signal
> +	 * this, since in general XFS doesn't do this sort of
> +	 * fallback.
> +	 */
> +	if (shared) {
> +		trace_xfs_reflink_bounce_dio_write(ip, imap);
> +		return -EREMCHG;
> +	}

I get that this bumps the write back up to the buffered mechanism, but
the purpose is not very clear. We have the !unaligned check back up in
xfs_file_dio_aio_write(), in which case we do all of the cow allocation
stuff. I presume in that case we should never fail this check (?).

If the dio is unaligned, we skip that bit, get down to here and kick it
back only if the extent happens to be shared..? Unless I missed it, I
think this needs to be explained in the comments in both places,
including probably updating the comment at the end of dio_aio_write()
that states we don't fall back to buffered I/O on dio error. 

> +	return 0;
> +}
> +
>  STATIC int
>  __xfs_get_blocks(
>  	struct inode		*inode,
> @@ -1267,6 +1312,8 @@ __xfs_get_blocks(
>  	xfs_off_t		offset;
>  	ssize_t			size;
>  	int			new = 0;
> +	bool			is_cow = false;
> +	bool			need_alloc = false;
>  
>  	BUG_ON(create && !direct);
>  
> @@ -1292,8 +1339,27 @@ __xfs_get_blocks(
>  	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size);
>  	offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  
> -	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> -				&imap, &nimaps, XFS_BMAPI_ENTIRE);
> +	if (create && direct) {
> +		is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap,
> +					&need_alloc);
> +	}

Nits: no need for braces here and might be cleaner to check
xfs_is_reflink_inode(ip) here, since !reflink is probably the common
case.

> +
> +	if (!is_cow) {
> +		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> +					&imap, &nimaps, XFS_BMAPI_ENTIRE);
> +		/*
> +		 * Truncate an overwrite extent if there's a pending CoW
> +		 * reservation before the end of this extent.  This forces us
> +		 * to come back to writepage to take care of the CoW.

writepage?

> +		 */
> +		if (create && direct && nimaps &&
> +		    imap.br_startblock != HOLESTARTBLOCK &&
> +		    imap.br_startblock != DELAYSTARTBLOCK &&
> +		    !ISUNWRITTEN(&imap))
> +			xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb,
> +					&imap);
> +	}
> +	ASSERT(!need_alloc);
>  	if (error)
>  		goto out_unlock;
>  
> @@ -1345,6 +1411,13 @@ __xfs_get_blocks(
>  	if (imap.br_startblock != HOLESTARTBLOCK &&
>  	    imap.br_startblock != DELAYSTARTBLOCK &&
>  	    (create || !ISUNWRITTEN(&imap))) {
> +		if (create && direct && !is_cow) {
> +			error = xfs_bounce_unaligned_dio_write(ip, offset_fsb,
> +					&imap);
> +			if (error)
> +				return error;
> +		}
> +
>  		xfs_map_buffer(inode, bh_result, &imap, offset);
>  		if (ISUNWRITTEN(&imap))
>  			set_buffer_unwritten(bh_result);
> @@ -1353,7 +1426,8 @@ __xfs_get_blocks(
>  			if (dax_fault)
>  				ASSERT(!ISUNWRITTEN(&imap));
>  			else
> -				xfs_map_direct(inode, bh_result, &imap, offset);
> +				xfs_map_direct(inode, bh_result, &imap, offset,
> +						is_cow);
>  		}
>  	}
>  
> @@ -1479,7 +1553,10 @@ xfs_end_io_direct_write(
>  		trace_xfs_end_io_direct_write_unwritten(ip, offset, size);
>  
>  		error = xfs_iomap_write_unwritten(ip, offset, size);
> -	} else if (flags & XFS_DIO_FLAG_APPEND) {
> +	}
> +	if (flags & XFS_DIO_FLAG_COW)
> +		error = xfs_reflink_end_cow(ip, offset, size);
> +	if (flags & XFS_DIO_FLAG_APPEND) {
>  		trace_xfs_end_io_direct_write_append(ip, offset, size);
>  
>  		error = xfs_setfilesize(ip, offset, size);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index f99d7fa..025d52f 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -38,6 +38,7 @@
>  #include "xfs_icache.h"
>  #include "xfs_pnfs.h"
>  #include "xfs_iomap.h"
> +#include "xfs_reflink.h"
>  
>  #include <linux/dcache.h>
>  #include <linux/falloc.h>
> @@ -672,6 +673,13 @@ xfs_file_dio_aio_write(
>  
>  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
>  
> +	/* If this is a block-aligned directio CoW, remap immediately. */
> +	if (xfs_is_reflink_inode(ip) && !unaligned_io) {
> +		ret = xfs_reflink_allocate_cow_range(ip, iocb->ki_pos, count);
> +		if (ret)
> +			goto out;
> +	}

Is the fact that we do this allocation up front rather than via
get_blocks() (like traditional direct write) one of the "warts" that
needs cleaning, or for some other reason?

> +
>  	data = *from;
>  	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
>  			xfs_get_blocks_direct, xfs_end_io_direct_write,
> @@ -812,10 +820,18 @@ xfs_file_write_iter(
>  
>  	if (IS_DAX(inode))
>  		ret = xfs_file_dax_write(iocb, from);
> -	else if (iocb->ki_flags & IOCB_DIRECT)
> +	else if (iocb->ki_flags & IOCB_DIRECT) {
> +		/*
> +		 * Allow DIO to fall back to buffered *only* in the case
> +		 * that we're doing a reflink CoW.
> +		 */
>  		ret = xfs_file_dio_aio_write(iocb, from);
> -	else
> +		if (ret == -EREMCHG)
> +			goto buffered;
> +	} else {
> +buffered:
>  		ret = xfs_file_buffered_aio_write(iocb, from);
> +	}
>  
>  	if (ret > 0) {
>  		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index d913ad1..c95cdc3 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -246,7 +246,8 @@ static int
>  __xfs_reflink_reserve_cow(
>  	struct xfs_inode	*ip,
>  	xfs_fileoff_t		*offset_fsb,
> -	xfs_fileoff_t		end_fsb)
> +	xfs_fileoff_t		end_fsb,
> +	bool			*skipped)
>  {
>  	struct xfs_bmbt_irec	got, prev, imap;
>  	xfs_fileoff_t		orig_end_fsb;
> @@ -279,8 +280,10 @@ __xfs_reflink_reserve_cow(
>  	end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
>  
>  	/* Not shared?  Just report the (potentially capped) extent. */
> -	if (!shared)
> +	if (!shared) {
> +		*skipped = true;
>  		goto done;
> +	}
>  
>  	/*
>  	 * Fork all the shared blocks from our write offset until the end of
> @@ -326,6 +329,7 @@ xfs_reflink_reserve_cow_range(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	xfs_fileoff_t		offset_fsb, end_fsb;
> +	bool			skipped = false;
>  	int			error;
>  
>  	trace_xfs_reflink_reserve_cow_range(ip, offset, count);
> @@ -335,7 +339,8 @@ xfs_reflink_reserve_cow_range(
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	while (offset_fsb < end_fsb) {
> -		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb);
> +		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb,
> +				&skipped);
>  		if (error) {
>  			trace_xfs_reflink_reserve_cow_range_error(ip, error,
>  				_RET_IP_);
> @@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
>  	return error;
>  }
>  
> +/* Allocate all CoW reservations covering a range of blocks in a file. */
> +static int
> +__xfs_reflink_allocate_cow(
> +	struct xfs_inode	*ip,
> +	xfs_fileoff_t		*offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_bmbt_irec	imap;
> +	struct xfs_defer_ops	dfops;
> +	struct xfs_trans	*tp;
> +	xfs_fsblock_t		first_block;
> +	xfs_fileoff_t		next_fsb;
> +	int			nimaps = 1, error;
> +	bool			skipped = false;
> +
> +	xfs_defer_init(&dfops, &first_block);
> +
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> +			XFS_TRANS_RESERVE, &tp);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +	next_fsb = *offset_fsb;
> +	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
> +	if (error)
> +		goto out_trans_cancel;

Do we really need to do the delayed allocation that results from this?
Couldn't we factor out the shared extent walking that allows us to just
perform the real allocations below?

It looks like speculative preallocation for dio is at least one strange
side effect that can result from this...

> +
> +	if (skipped) {
> +		*offset_fsb = next_fsb;
> +		goto out_trans_cancel;
> +	}
> +
> +	xfs_trans_ijoin(tp, ip, 0);
> +	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
> +			XFS_BMAPI_COWFORK, &first_block,
> +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> +			&imap, &nimaps, &dfops);
> +	if (error)
> +		goto out_trans_cancel;

Should we be using unwritten extents (BMAPI_PREALLOC) to avoid stale
data exposure similar to traditional direct write (or is the cow fork
extent never accessible until it is remapped)?

Brian

> +
> +	/* We might not have been able to map the whole delalloc extent */
> +	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
> +
> +	error = xfs_defer_finish(&tp, &dfops, NULL);
> +	if (error)
> +		goto out_trans_cancel;
> +
> +	error = xfs_trans_commit(tp);
> +
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +out_trans_cancel:
> +	xfs_defer_cancel(&dfops);
> +	xfs_trans_cancel(tp);
> +	goto out_unlock;
> +}
> +
> +/* Allocate all CoW reservations covering a part of a file. */
> +int
> +xfs_reflink_allocate_cow_range(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		count)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
> +	int			error;
> +
> +	ASSERT(xfs_is_reflink_inode(ip));
> +
> +	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
> +
> +	/*
> +	 * Make sure that the dquots are there.
> +	 */
> +	error = xfs_qm_dqattach(ip, 0);
> +	if (error)
> +		return error;
> +
> +	while (offset_fsb < end_fsb) {
> +		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
> +		if (error) {
> +			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> +					_RET_IP_);
> +			break;
> +		}
> +	}
> +
> +	return error;
> +}
> +
>  /*
>   * Find the CoW reservation (and whether or not it needs block allocation)
>   * for a given byte offset of a file.
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index bffa4be..c0c989a 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
>  
>  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
>  		xfs_off_t offset, xfs_off_t count);
> +extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> +		xfs_off_t offset, xfs_off_t count);
>  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
>  		struct xfs_bmbt_irec *imap, bool *need_alloc);
>  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 7612096..8e89223 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
>  
>  DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
>  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> -DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
>  
>  DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
>  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-05 18:27   ` Brian Foster
@ 2016-10-05 20:55     ` Darrick J. Wong
  2016-10-06 12:20       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-05 20:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Wed, Oct 05, 2016 at 02:27:10PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:09:40PM -0700, Darrick J. Wong wrote:
> > For O_DIRECT writes to shared blocks, we have to CoW them just like
> > we would with buffered writes.  For writes that are not block-aligned,
> > just bounce them to the page cache.
> > 
> > For block-aligned writes, however, we can do better than that.  Use
> > the same mechanisms that we employ for buffered CoW to set up a
> > delalloc reservation, allocate all the blocks at once, issue the
> > writes against the new blocks and use the same ioend functions to
> > remap the blocks after the write.  This should be fairly performant.
> > 
> > Christoph discovered that xfs_reflink_allocate_cow_range may stumble
> > over invalid entries in the extent array given that it drops the ilock
> > but still expects the index to be stable.  Simple fixing it to a new
> > lookup for every iteration still isn't correct given that
> > xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
> > there is nothing preventing a xfs_bunmapi_cow call removing extents
> > once we dropped the ilock either.
> > 
> > This patch duplicates the inner loop of xfs_bmapi_allocate into a
> > helper for xfs_reflink_allocate_cow_range so that it can be done under
> > the same ilock critical section as our CoW fork delayed allocation.
> > The directio CoW warts will be revisited in a later patch.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > v2: Turns out that there's no way for xfs_end_io_direct_write to know
> > if the write completed successfully.  Therefore, do /not/ use the
> > ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
> > where we *can* tell if the write succeeded or not.
> > 
> > v3: Update the file size if we do a directio CoW across EOF.  This
> > can happen if the last block is shared, the cowextsize hint is set,
> > and we do a dio write past the end of the file.
> > 
> > v4: Christoph rewrote the allocate code to fix some concurrency
> > problems as part of migrating the code to support iomap.
> > ---
> >  fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
> >  fs/xfs/xfs_file.c    |   20 ++++++++-
> >  fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_reflink.h |    2 +
> >  fs/xfs/xfs_trace.h   |    1 
> >  5 files changed, 208 insertions(+), 13 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 1d0435a..62a95e4 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -40,6 +40,7 @@
> >  /* flags for direct write completions */
> >  #define XFS_DIO_FLAG_UNWRITTEN	(1 << 0)
> >  #define XFS_DIO_FLAG_APPEND	(1 << 1)
> > +#define XFS_DIO_FLAG_COW	(1 << 2)
> >  
> >  /*
> >   * structure owned by writepages passed to individual writepage calls
> > @@ -1191,18 +1192,24 @@ xfs_map_direct(
> >  	struct inode		*inode,
> >  	struct buffer_head	*bh_result,
> >  	struct xfs_bmbt_irec	*imap,
> > -	xfs_off_t		offset)
> > +	xfs_off_t		offset,
> > +	bool			is_cow)
> >  {
> >  	uintptr_t		*flags = (uintptr_t *)&bh_result->b_private;
> >  	xfs_off_t		size = bh_result->b_size;
> >  
> >  	trace_xfs_get_blocks_map_direct(XFS_I(inode), offset, size,
> > -		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : XFS_IO_OVERWRITE, imap);
> > +		ISUNWRITTEN(imap) ? XFS_IO_UNWRITTEN : is_cow ? XFS_IO_COW :
> > +		XFS_IO_OVERWRITE, imap);
> >  
> >  	if (ISUNWRITTEN(imap)) {
> >  		*flags |= XFS_DIO_FLAG_UNWRITTEN;
> >  		set_buffer_defer_completion(bh_result);
> > -	} else if (offset + size > i_size_read(inode) || offset + size < 0) {
> > +	} else if (is_cow) {
> > +		*flags |= XFS_DIO_FLAG_COW;
> > +		set_buffer_defer_completion(bh_result);
> > +	}
> > +	if (offset + size > i_size_read(inode) || offset + size < 0) {
> >  		*flags |= XFS_DIO_FLAG_APPEND;
> >  		set_buffer_defer_completion(bh_result);
> >  	}
> > @@ -1248,6 +1255,44 @@ xfs_map_trim_size(
> >  	bh_result->b_size = mapping_size;
> >  }
> >  
> > +/* Bounce unaligned directio writes to the page cache. */
> > +static int
> > +xfs_bounce_unaligned_dio_write(
> > +	struct xfs_inode	*ip,
> > +	xfs_fileoff_t		offset_fsb,
> > +	struct xfs_bmbt_irec	*imap)
> > +{
> > +	struct xfs_bmbt_irec	irec;
> > +	xfs_fileoff_t		delta;
> > +	bool			shared;
> > +	bool			x;
> > +	int			error;
> > +
> > +	irec = *imap;
> > +	if (offset_fsb > irec.br_startoff) {
> > +		delta = offset_fsb - irec.br_startoff;
> > +		irec.br_blockcount -= delta;
> > +		irec.br_startblock += delta;
> > +		irec.br_startoff = offset_fsb;
> > +	}
> > +	error = xfs_reflink_trim_around_shared(ip, &irec, &x, &shared);
> 
> 'shared' is the 3rd parameter.

Fixed.  (I actually fixed it yesterday when I made the _find_shareds
return NULLAGBLOCK to mean "no shared blocks here" and generic/139
blew up.)

> > +	if (error)
> > +		return error;
> > +	/*
> > +	 * Are we doing a DIO write to a shared block?  In
> > +	 * the ideal world we at least would fork full blocks,
> > +	 * but for now just fall back to buffered mode.  Yuck.
> > +	 * Use -EREMCHG ("remote address changed") to signal
> > +	 * this, since in general XFS doesn't do this sort of
> > +	 * fallback.
> > +	 */
> > +	if (shared) {
> > +		trace_xfs_reflink_bounce_dio_write(ip, imap);
> > +		return -EREMCHG;
> > +	}
> 
> I get that this bumps the write back up to the buffered mechanism, but
> the purpose is not very clear. We have the !unaligned check back up in
> xfs_file_dio_aio_write(), in which case we do all of the cow allocation
> stuff. I presume in that case we should never fail this check (?).

Ewww, stale stinky comments!  Originally we were just going to demote
all dio writes to a shared region to the pagecache, but when I created
the CoW fork I realized that when the write request was block aligned
that it was easy enough to make the reservation and allocate it all in
one step.  The thing you pointed out in xfs_file_dio_aio_write() is
exactly that mechanism.

Unfortunately, it seems that I neglected to modify the comment to
mention that we only do the bouncing for unaligned directio cows.

/*
 * We're here because we're trying to do a directio write to a region
 * that isn't aligned to a filesystem block.  If any part of the extent
 * is shared, fall back to buffered mode to handle the RMW.  This is
 * done by returning -EREMCHG ("remote addr changed"), which is caught
 * further up the call stack.
 */

> If the dio is unaligned, we skip that bit, get down to here and kick it
> back only if the extent happens to be shared..? Unless I missed it, I
> think this needs to be explained in the comments in both places,
> including probably updating the comment at the end of dio_aio_write()
> that states we don't fall back to buffered I/O on dio error. 

Yep.

I'd assumed that everyone knew that except for this one case we never
fall back to buffered IO from any of the directio paths.  I'll update
the other EREMCHG comment in xfs_file_write_iter() to state that more
explicitly:

/*
 * Allow a directio write to fall back to a buffered
 * write *only* in the case that we're doing a reflink
 * CoW.  In all other directio scenarios we do not
 * allow an operation to fall back to buffered mode.
 */

> > +	return 0;
> > +}
> > +
> >  STATIC int
> >  __xfs_get_blocks(
> >  	struct inode		*inode,
> > @@ -1267,6 +1312,8 @@ __xfs_get_blocks(
> >  	xfs_off_t		offset;
> >  	ssize_t			size;
> >  	int			new = 0;
> > +	bool			is_cow = false;
> > +	bool			need_alloc = false;
> >  
> >  	BUG_ON(create && !direct);
> >  
> > @@ -1292,8 +1339,27 @@ __xfs_get_blocks(
> >  	end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + size);
> >  	offset_fsb = XFS_B_TO_FSBT(mp, offset);
> >  
> > -	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> > -				&imap, &nimaps, XFS_BMAPI_ENTIRE);
> > +	if (create && direct) {
> > +		is_cow = xfs_reflink_find_cow_mapping(ip, offset, &imap,
> > +					&need_alloc);
> > +	}
> 
> Nits: no need for braces here and might be cleaner to check
> xfs_is_reflink_inode(ip) here, since !reflink is probably the common
> case.

<nod>

> > +
> > +	if (!is_cow) {
> > +		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> > +					&imap, &nimaps, XFS_BMAPI_ENTIRE);
> > +		/*
> > +		 * Truncate an overwrite extent if there's a pending CoW
> > +		 * reservation before the end of this extent.  This forces us
> > +		 * to come back to writepage to take care of the CoW.
> 
> writepage?

"...come back to get_blocks to take care of..."

> > +		 */
> > +		if (create && direct && nimaps &&
> > +		    imap.br_startblock != HOLESTARTBLOCK &&
> > +		    imap.br_startblock != DELAYSTARTBLOCK &&
> > +		    !ISUNWRITTEN(&imap))
> > +			xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb,
> > +					&imap);
> > +	}
> > +	ASSERT(!need_alloc);
> >  	if (error)
> >  		goto out_unlock;
> >  
> > @@ -1345,6 +1411,13 @@ __xfs_get_blocks(
> >  	if (imap.br_startblock != HOLESTARTBLOCK &&
> >  	    imap.br_startblock != DELAYSTARTBLOCK &&
> >  	    (create || !ISUNWRITTEN(&imap))) {
> > +		if (create && direct && !is_cow) {
> > +			error = xfs_bounce_unaligned_dio_write(ip, offset_fsb,
> > +					&imap);
> > +			if (error)
> > +				return error;
> > +		}
> > +
> >  		xfs_map_buffer(inode, bh_result, &imap, offset);
> >  		if (ISUNWRITTEN(&imap))
> >  			set_buffer_unwritten(bh_result);
> > @@ -1353,7 +1426,8 @@ __xfs_get_blocks(
> >  			if (dax_fault)
> >  				ASSERT(!ISUNWRITTEN(&imap));
> >  			else
> > -				xfs_map_direct(inode, bh_result, &imap, offset);
> > +				xfs_map_direct(inode, bh_result, &imap, offset,
> > +						is_cow);
> >  		}
> >  	}
> >  
> > @@ -1479,7 +1553,10 @@ xfs_end_io_direct_write(
> >  		trace_xfs_end_io_direct_write_unwritten(ip, offset, size);
> >  
> >  		error = xfs_iomap_write_unwritten(ip, offset, size);
> > -	} else if (flags & XFS_DIO_FLAG_APPEND) {
> > +	}
> > +	if (flags & XFS_DIO_FLAG_COW)
> > +		error = xfs_reflink_end_cow(ip, offset, size);
> > +	if (flags & XFS_DIO_FLAG_APPEND) {
> >  		trace_xfs_end_io_direct_write_append(ip, offset, size);
> >  
> >  		error = xfs_setfilesize(ip, offset, size);
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index f99d7fa..025d52f 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -38,6 +38,7 @@
> >  #include "xfs_icache.h"
> >  #include "xfs_pnfs.h"
> >  #include "xfs_iomap.h"
> > +#include "xfs_reflink.h"
> >  
> >  #include <linux/dcache.h>
> >  #include <linux/falloc.h>
> > @@ -672,6 +673,13 @@ xfs_file_dio_aio_write(
> >  
> >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> >  
> > +	/* If this is a block-aligned directio CoW, remap immediately. */
> > +	if (xfs_is_reflink_inode(ip) && !unaligned_io) {
> > +		ret = xfs_reflink_allocate_cow_range(ip, iocb->ki_pos, count);
> > +		if (ret)
> > +			goto out;
> > +	}
> 
> Is the fact that we do this allocation up front rather than via
> get_blocks() (like traditional direct write) one of the "warts" that
> needs cleaning, or for some other reason?

"Yes". :)

We do the allocation here because we know the exact size of the IO that
userspace is asking for, so we might as well do all the allocations
at once instead of repeatedly calling back into the allocator for each
shared segment that gets fed into get_blocks.  Sort of warty.

I think this could get moved to get_blocks, though TBH I've been
wondering if all this will just get replaced with iomap as part of
killing buffer heads.

> > +
> >  	data = *from;
> >  	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
> >  			xfs_get_blocks_direct, xfs_end_io_direct_write,
> > @@ -812,10 +820,18 @@ xfs_file_write_iter(
> >  
> >  	if (IS_DAX(inode))
> >  		ret = xfs_file_dax_write(iocb, from);
> > -	else if (iocb->ki_flags & IOCB_DIRECT)
> > +	else if (iocb->ki_flags & IOCB_DIRECT) {
> > +		/*
> > +		 * Allow DIO to fall back to buffered *only* in the case
> > +		 * that we're doing a reflink CoW.
> > +		 */
> >  		ret = xfs_file_dio_aio_write(iocb, from);
> > -	else
> > +		if (ret == -EREMCHG)
> > +			goto buffered;
> > +	} else {
> > +buffered:
> >  		ret = xfs_file_buffered_aio_write(iocb, from);
> > +	}
> >  
> >  	if (ret > 0) {
> >  		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index d913ad1..c95cdc3 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -246,7 +246,8 @@ static int
> >  __xfs_reflink_reserve_cow(
> >  	struct xfs_inode	*ip,
> >  	xfs_fileoff_t		*offset_fsb,
> > -	xfs_fileoff_t		end_fsb)
> > +	xfs_fileoff_t		end_fsb,
> > +	bool			*skipped)
> >  {
> >  	struct xfs_bmbt_irec	got, prev, imap;
> >  	xfs_fileoff_t		orig_end_fsb;
> > @@ -279,8 +280,10 @@ __xfs_reflink_reserve_cow(
> >  	end_fsb = orig_end_fsb = imap.br_startoff + imap.br_blockcount;
> >  
> >  	/* Not shared?  Just report the (potentially capped) extent. */
> > -	if (!shared)
> > +	if (!shared) {
> > +		*skipped = true;
> >  		goto done;
> > +	}
> >  
> >  	/*
> >  	 * Fork all the shared blocks from our write offset until the end of
> > @@ -326,6 +329,7 @@ xfs_reflink_reserve_cow_range(
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	xfs_fileoff_t		offset_fsb, end_fsb;
> > +	bool			skipped = false;
> >  	int			error;
> >  
> >  	trace_xfs_reflink_reserve_cow_range(ip, offset, count);
> > @@ -335,7 +339,8 @@ xfs_reflink_reserve_cow_range(
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	while (offset_fsb < end_fsb) {
> > -		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb);
> > +		error = __xfs_reflink_reserve_cow(ip, &offset_fsb, end_fsb,
> > +				&skipped);
> >  		if (error) {
> >  			trace_xfs_reflink_reserve_cow_range_error(ip, error,
> >  				_RET_IP_);
> > @@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
> >  	return error;
> >  }
> >  
> > +/* Allocate all CoW reservations covering a range of blocks in a file. */
> > +static int
> > +__xfs_reflink_allocate_cow(
> > +	struct xfs_inode	*ip,
> > +	xfs_fileoff_t		*offset_fsb,
> > +	xfs_fileoff_t		end_fsb)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_bmbt_irec	imap;
> > +	struct xfs_defer_ops	dfops;
> > +	struct xfs_trans	*tp;
> > +	xfs_fsblock_t		first_block;
> > +	xfs_fileoff_t		next_fsb;
> > +	int			nimaps = 1, error;
> > +	bool			skipped = false;
> > +
> > +	xfs_defer_init(&dfops, &first_block);
> > +
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> > +			XFS_TRANS_RESERVE, &tp);
> > +	if (error)
> > +		return error;
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +
> > +	next_fsb = *offset_fsb;
> > +	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
> > +	if (error)
> > +		goto out_trans_cancel;
> 
> Do we really need to do the delayed allocation that results from this?
> Couldn't we factor out the shared extent walking that allows us to just
> perform the real allocations below?

The delayed reservation -> allocation two-step is necessary to create
replacement that are aligned to the CoW extent size hint.  This is
important for aligning extents in the same way as the regular extent
size hint, and critical for detecting random writes and landing them all
in as close to a contiguous physical extent as possible.  This helps us
to reduce cow-related fragmentation to manageable levels, which is
necessary to avoid ENOMEM problems with the current incore extent tree.

Reducing fragmentation also helps us avoid problems seen on some other
filesystem where reflinking of a 64G root image takes minutes after a
couple of weeks of normal operations because the average extent size is
now 2 blocks.

(By contrast we're still averaging ~800 blocks per extent.)

> It looks like speculative preallocation for dio is at least one strange
> side effect that can result from this...

Christoph separated the delalloc reservation into separate functions for
the data fork and the CoW fork.  xfs_file_iomap_begin_delay() is for the
data fork (and does speculative prealloc), whereas
__xfs_reflink_reserve_cow() is for the CoW fork and doesn't know about
speculative prealloc.

> > +
> > +	if (skipped) {
> > +		*offset_fsb = next_fsb;
> > +		goto out_trans_cancel;
> > +	}
> > +
> > +	xfs_trans_ijoin(tp, ip, 0);
> > +	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
> > +			XFS_BMAPI_COWFORK, &first_block,
> > +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> > +			&imap, &nimaps, &dfops);
> > +	if (error)
> > +		goto out_trans_cancel;
> 
> Should we be using unwritten extents (BMAPI_PREALLOC) to avoid stale
> data exposure similar to traditional direct write (or is the cow fork
> extent never accessible until it is remapped)?

Correct.  CoW fork extents are not accessible until after remapping.

--D

> 
> Brian
> 
> > +
> > +	/* We might not have been able to map the whole delalloc extent */
> > +	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
> > +
> > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > +	if (error)
> > +		goto out_trans_cancel;
> > +
> > +	error = xfs_trans_commit(tp);
> > +
> > +out_unlock:
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	return error;
> > +out_trans_cancel:
> > +	xfs_defer_cancel(&dfops);
> > +	xfs_trans_cancel(tp);
> > +	goto out_unlock;
> > +}
> > +
> > +/* Allocate all CoW reservations covering a part of a file. */
> > +int
> > +xfs_reflink_allocate_cow_range(
> > +	struct xfs_inode	*ip,
> > +	xfs_off_t		offset,
> > +	xfs_off_t		count)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > +	int			error;
> > +
> > +	ASSERT(xfs_is_reflink_inode(ip));
> > +
> > +	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
> > +
> > +	/*
> > +	 * Make sure that the dquots are there.
> > +	 */
> > +	error = xfs_qm_dqattach(ip, 0);
> > +	if (error)
> > +		return error;
> > +
> > +	while (offset_fsb < end_fsb) {
> > +		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
> > +		if (error) {
> > +			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> > +					_RET_IP_);
> > +			break;
> > +		}
> > +	}
> > +
> > +	return error;
> > +}
> > +
> >  /*
> >   * Find the CoW reservation (and whether or not it needs block allocation)
> >   * for a given byte offset of a file.
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index bffa4be..c0c989a 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> >  
> >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> >  		xfs_off_t offset, xfs_off_t count);
> > +extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> > +		xfs_off_t offset, xfs_off_t count);
> >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> >  		struct xfs_bmbt_irec *imap, bool *need_alloc);
> >  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 7612096..8e89223 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> >  
> >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
> >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > -DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
> >  
> >  DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
> >  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 35/63] xfs: move mappings from cow fork to data fork after copy-write
  2016-10-05 18:26   ` Brian Foster
@ 2016-10-05 21:22     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-05 21:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Wed, Oct 05, 2016 at 02:26:48PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:09:27PM -0700, Darrick J. Wong wrote:
> > After the write component of a copy-write operation finishes, clean up
> > the bookkeeping left behind.  On error, we simply free the new blocks
> > and pass the error up.  If we succeed, however, then we must remove
> > the old data fork mapping and move the cow fork mapping to the data
> > fork.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > [hch: Call the CoW failure function during xfs_cancel_ioend]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > v2: If CoW fails, we need to remove the CoW fork mapping and free the
> > blocks.  Furthermore, if xfs_cancel_ioend happens, we also need to
> > clean out all the CoW record keeping.
> > 
> > v3: When we're removing CoW extents, only free one extent per
> > transaction to avoid running out of reservation.  Also,
> > xfs_cancel_ioend mustn't clean out the CoW fork because it is called
> > when async writeback can't get an inode lock and will try again.
> > 
> > v4: Use bmapi_read to iterate the CoW fork instead of calling the
> > iext functions directly, and make the CoW remapping atomic by
> > using the deferred ops mechanism which takes care of logging redo
> > items for us.
> > 
> > v5: Unlock the inode if cancelling the CoW reservation fails.
> > ---
> >  fs/xfs/xfs_aops.c    |   22 ++++
> >  fs/xfs/xfs_reflink.c |  271 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_reflink.h |    8 +
> >  3 files changed, 299 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 7b1e9de..aa23993 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -288,6 +288,23 @@ xfs_end_io(
> >  		error = -EIO;
> >  
> >  	/*
> > +	 * For a CoW extent, we need to move the mapping from the CoW fork
> > +	 * to the data fork.  If instead an error happened, just dump the
> > +	 * new blocks.
> > +	 */
> > +	if (ioend->io_type == XFS_IO_COW) {
> > +		if (ioend->io_bio->bi_error) {
> > +			error = xfs_reflink_cancel_cow_range(ip,
> > +					ioend->io_offset, ioend->io_size);
> > +			goto done;
> > +		}
> 
> I'm a little confused why we'd clear out delalloc blocks here but not if
> the write happens to fail in the first place (but I take it the
> explanation for my previous comment still applies..).

Originally I thought that it would be a good idea to get rid of blocks
if the actual write IO fails.  Maybe we should just leave it, in case
a subsequent retry does not also fail.

> > +		error = xfs_reflink_end_cow(ip, ioend->io_offset,
> > +				ioend->io_size);
> > +		if (error)
> > +			goto done;
> 
> This hunk clobbers error if set by the previous check (not shown in the
> diff) due to fs shutdown.

D'oh.  Well, if the FS is shutdown there's little point in updating
metadata, so I guess we can just jump out.

> > +	}
> > +
> > +	/*
> >  	 * For unwritten extents we need to issue transactions to convert a
> >  	 * range to normal written extens after the data I/O has finished.
> >  	 * Detecting and handling completion IO errors is done individually
> > @@ -302,7 +319,8 @@ xfs_end_io(
> >  	} else if (ioend->io_append_trans) {
> >  		error = xfs_setfilesize_ioend(ioend, error);
> >  	} else {
> > -		ASSERT(!xfs_ioend_is_append(ioend));
> > +		ASSERT(!xfs_ioend_is_append(ioend) ||
> > +		       ioend->io_type == XFS_IO_COW);
> >  	}
> >  
> >  done:
> > @@ -316,7 +334,7 @@ xfs_end_bio(
> >  	struct xfs_ioend	*ioend = bio->bi_private;
> >  	struct xfs_mount	*mp = XFS_I(ioend->io_inode)->i_mount;
> >  
> > -	if (ioend->io_type == XFS_IO_UNWRITTEN)
> > +	if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
> >  		queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> >  	else if (ioend->io_append_trans)
> >  		queue_work(mp->m_data_workqueue, &ioend->io_work);
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index e8c7c85..d913ad1 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -52,6 +52,7 @@
> >  #include "xfs_bmap_btree.h"
> >  #include "xfs_reflink.h"
> >  #include "xfs_iomap.h"
> > +#include "xfs_rmap_btree.h"
> >  
> >  /*
> >   * Copy on Write of Shared Blocks
> > @@ -114,6 +115,37 @@
> >   * ioend, the better.
> >   */
> >  
> > +/* Trim extent to fit a logical block range. */
> > +static void
> > +xfs_trim_extent(
> > +	struct xfs_bmbt_irec	*irec,
> > +	xfs_fileoff_t		bno,
> > +	xfs_filblks_t		len)
> > +{
> > +	xfs_fileoff_t		distance;
> > +	xfs_fileoff_t		end = bno + len;
> > +
> > +	if (irec->br_startoff + irec->br_blockcount <= bno ||
> > +	    irec->br_startoff >= end) {
> 
> Hmm, this seems like slightly strange behavior. Why reset blockcount on
> an extent for a request to trim it to something beyond the extent? Is
> this primarily an error/sanity check or is this an expected case?
> 
> As it is, it looks like bno should point to the end of irec in most
> cases unless the unmap happens to remove less from the data fork than
> what has been allocated in the cow fork (which seems sane). I wonder if
> we just want to ASSERT() that the extent and trim range are sane here?

There are only two callers of this function, both of which can be
replaced by:

	uirec.br_startblock = irec->br_startblock + rlen;
	uirec.br_startoff = irec->br_startoff + rlen;
	uirec.br_blockcount = irec->br_blockcount - rlen;

so I'll just purge this function entirely.

> > +		irec->br_blockcount = 0;
> > +		return;
> > +	}
> > +
> > +	if (irec->br_startoff < bno) {
> > +		distance = bno - irec->br_startoff;
> > +		if (irec->br_startblock != DELAYSTARTBLOCK &&
> > +		    irec->br_startblock != HOLESTARTBLOCK)
> > +			irec->br_startblock += distance;
> > +		irec->br_startoff += distance;
> > +		irec->br_blockcount -= distance;
> > +	}
> > +
> > +	if (end < irec->br_startoff + irec->br_blockcount) {
> > +		distance = irec->br_startoff + irec->br_blockcount - end;
> > +		irec->br_blockcount -= distance;
> > +	}
> > +}
> > +
> >  /*
> >   * Given an AG extent, find the lowest-numbered run of shared blocks within
> >   * that range and return the range in fbno/flen.
> > @@ -400,3 +432,242 @@ xfs_reflink_trim_irec_to_next_cow(
> >  
> >  	return 0;
> >  }
> > +
> > +/*
> > + * Cancel all pending CoW reservations for some block range of an inode.
> > + */
> > +int
> > +xfs_reflink_cancel_cow_blocks(
> > +	struct xfs_inode		*ip,
> > +	struct xfs_trans		**tpp,
> > +	xfs_fileoff_t			offset_fsb,
> > +	xfs_fileoff_t			end_fsb)
> > +{
> > +	struct xfs_bmbt_irec		irec;
> > +	xfs_filblks_t			count_fsb;
> > +	xfs_fsblock_t			firstfsb;
> > +	struct xfs_defer_ops		dfops;
> > +	int				error = 0;
> > +	int				nimaps;
> > +
> > +	if (!xfs_is_reflink_inode(ip))
> > +		return 0;
> > +
> > +	/* Go find the old extent in the CoW fork. */
> > +	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
> > +	while (count_fsb) {
> > +		nimaps = 1;
> > +		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
> > +				&nimaps, XFS_BMAPI_COWFORK);
> > +		if (error)
> > +			break;
> > +		ASSERT(nimaps == 1);
> > +
> > +		trace_xfs_reflink_cancel_cow(ip, &irec);
> > +
> > +		if (irec.br_startblock == DELAYSTARTBLOCK) {
> > +			/* Free a delayed allocation. */
> > +			xfs_mod_fdblocks(ip->i_mount, irec.br_blockcount,
> > +					false);
> > +			ip->i_delayed_blks -= irec.br_blockcount;
> > +
> > +			/* Remove the mapping from the CoW fork. */
> > +			error = xfs_bunmapi_cow(ip, &irec);
> > +			if (error)
> > +				break;
> > +		} else if (irec.br_startblock == HOLESTARTBLOCK) {
> > +			/* empty */
> > +		} else {
> > +			xfs_trans_ijoin(*tpp, ip, 0);
> > +			xfs_defer_init(&dfops, &firstfsb);
> > +
> > +			xfs_bmap_add_free(ip->i_mount, &dfops,
> > +					irec.br_startblock, irec.br_blockcount,
> > +					NULL);
> > +
> > +			/* Update quota accounting */
> > +			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > +					-(long)irec.br_blockcount);
> > +
> > +			/* Roll the transaction */
> > +			error = xfs_defer_finish(tpp, &dfops, ip);
> > +			if (error) {
> > +				xfs_defer_cancel(&dfops);
> > +				break;
> > +			}
> > +
> > +			/* Remove the mapping from the CoW fork. */
> > +			error = xfs_bunmapi_cow(ip, &irec);
> > +			if (error)
> > +				break;
> > +		}
> > +
> > +		/* Roll on... */
> > +		count_fsb -= irec.br_startoff + irec.br_blockcount - offset_fsb;
> 
> Might be wise to safeguard against the extent being larger than the
> range (or just use offset_fsb and kill count_fsb)...

I think it's fine, since bmapi_read trims irec to fit the offset/count
you feed it, but OTOH it /does/ remove a few lines.

> > +		offset_fsb = irec.br_startoff + irec.br_blockcount;
> > +	}
> > +
> > +	return error;
> > +}
> > +
> > +/*
> > + * Cancel all pending CoW reservations for some byte range of an inode.
> > + */
> > +int
> > +xfs_reflink_cancel_cow_range(
> > +	struct xfs_inode	*ip,
> > +	xfs_off_t		offset,
> > +	xfs_off_t		count)
> > +{
> > +	struct xfs_trans	*tp;
> > +	xfs_fileoff_t		offset_fsb;
> > +	xfs_fileoff_t		end_fsb;
> > +	int			error;
> > +
> > +	trace_xfs_reflink_cancel_cow_range(ip, offset, count);
> > +
> > +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > +	if (count == NULLFILEOFF)
> > +		end_fsb = NULLFILEOFF;
> > +	else
> > +		end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > +
> > +	/* Start a rolling transaction to remove the mappings */
> > +	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> > +			0, 0, 0, &tp);
> > +	if (error)
> > +		goto out;
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, 0);
> > +
> > +	/* Scrape out the old CoW reservations */
> > +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, offset_fsb, end_fsb);
> > +	if (error)
> > +		goto out_defer;
> > +
> > +	error = xfs_trans_commit(tp);
> > +
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	return error;
> > +
> > +out_defer:
> 
> out_cancel ?

Yeah.

> > +	xfs_trans_cancel(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +out:
> > +	trace_xfs_reflink_cancel_cow_range_error(ip, error, _RET_IP_);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Remap parts of a file's data fork after a successful CoW.
> > + */
> > +int
> > +xfs_reflink_end_cow(
> > +	struct xfs_inode		*ip,
> > +	xfs_off_t			offset,
> > +	xfs_off_t			count)
> > +{
> > +	struct xfs_bmbt_irec		irec;
> > +	struct xfs_bmbt_irec		uirec;
> > +	struct xfs_trans		*tp;
> > +	xfs_fileoff_t			offset_fsb;
> > +	xfs_fileoff_t			end_fsb;
> > +	xfs_filblks_t			count_fsb;
> > +	xfs_fsblock_t			firstfsb;
> > +	struct xfs_defer_ops		dfops;
> > +	int				error;
> > +	unsigned int			resblks;
> > +	xfs_filblks_t			ilen;
> > +	xfs_filblks_t			rlen;
> > +	int				nimaps;
> > +
> > +	trace_xfs_reflink_end_cow(ip, offset, count);
> > +
> > +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > +	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
> > +
> > +	/* Start a rolling transaction to switch the mappings */
> > +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > +	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> > +			resblks, 0, 0, &tp);
> > +	if (error)
> > +		goto out;
> 
> I forget the exact reason why we preallocate append transactions for I/O
> completion, but it would be nice if Dave or somebody could chime in on
> that to make sure we don't need to do something similar here (and for
> the cancel case).
> 
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, 0);
> > +
> > +	/* Go find the old extent in the CoW fork. */
> > +	while (count_fsb) {
> > +		/* Read extent from the source file */
> > +		nimaps = 1;
> > +		error = xfs_bmapi_read(ip, offset_fsb, count_fsb, &irec,
> > +				&nimaps, XFS_BMAPI_COWFORK);
> > +		if (error)
> > +			goto out_cancel;
> > +		ASSERT(nimaps == 1);
> > +
> > +		ASSERT(irec.br_startblock != DELAYSTARTBLOCK);
> > +		trace_xfs_reflink_cow_remap(ip, &irec);
> > +
> > +		/*
> > +		 * We can have a hole in the CoW fork if part of a directio
> > +		 * write is CoW but part of it isn't.
> > +		 */
> > +		rlen = ilen = irec.br_blockcount;
> > +		if (irec.br_startblock == HOLESTARTBLOCK)
> > +			goto next_extent;
> > +
> > +		/* Unmap the old blocks in the data fork. */
> > +		while (rlen) {
> > +			xfs_defer_init(&dfops, &firstfsb);
> > +			error = __xfs_bunmapi(tp, ip, irec.br_startoff,
> > +					&rlen, 0, 1, &firstfsb, &dfops);
> > +			if (error)
> > +				goto out_defer;
> > +
> > +			/* Trim the extent to whatever got unmapped. */
> > +			uirec = irec;
> > +			xfs_trim_extent(&uirec, irec.br_startoff + rlen,
> > +					irec.br_blockcount - rlen);
> 
> We assign uirec = irec, then pass calculated values based on irec and
> rlen. How about xfs_trim_extent(&uirec, rlen)?
> 
> Also, it took me a while to grok that we "trim" the beginning of the
> extent because bunmapi works backwards. A comment would be appreciated
> here.  ;)

Ok.

> Brian
> 
> > +			irec.br_blockcount = rlen;
> > +			trace_xfs_reflink_cow_remap_piece(ip, &uirec);
> > +
> > +			/* Map the new blocks into the data fork. */
> > +			error = xfs_bmap_map_extent(tp->t_mountp, &dfops,
> > +					ip, XFS_DATA_FORK, &uirec);
> > +			if (error)
> > +				goto out_defer;
> > +
> > +			/* Remove the mapping from the CoW fork. */
> > +			error = xfs_bunmapi_cow(ip, &uirec);
> > +			if (error)
> > +				goto out_defer;
> > +
> > +			error = xfs_defer_finish(&tp, &dfops, ip);
> > +			if (error)
> > +				goto out_defer;
> > +		}
> > +
> > +next_extent:
> > +		/* Roll on... */
> > +		count_fsb -= irec.br_startoff + ilen - offset_fsb;
> > +		offset_fsb = irec.br_startoff + ilen;
> > +	}
> > +
> > +	error = xfs_trans_commit(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	if (error)
> > +		goto out;
> > +	return 0;
> > +
> > +out_defer:
> > +	xfs_defer_cancel(&dfops);
> > +out_cancel:
> > +	xfs_trans_cancel(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +out:
> > +	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index 11408c0..bffa4be 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -33,4 +33,12 @@ extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> >  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> >  		xfs_fileoff_t offset_fsb, struct xfs_bmbt_irec *imap);
> >  
> > +extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
> > +		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
> > +		xfs_fileoff_t end_fsb);
> > +extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> > +		xfs_off_t count);
> > +extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> > +		xfs_off_t count);
> > +
> >  #endif /* __XFS_REFLINK_H */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-05 20:55     ` Darrick J. Wong
@ 2016-10-06 12:20       ` Brian Foster
  2016-10-07  1:02         ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-06 12:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Wed, Oct 05, 2016 at 01:55:42PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 05, 2016 at 02:27:10PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:09:40PM -0700, Darrick J. Wong wrote:
> > > For O_DIRECT writes to shared blocks, we have to CoW them just like
> > > we would with buffered writes.  For writes that are not block-aligned,
> > > just bounce them to the page cache.
> > > 
> > > For block-aligned writes, however, we can do better than that.  Use
> > > the same mechanisms that we employ for buffered CoW to set up a
> > > delalloc reservation, allocate all the blocks at once, issue the
> > > writes against the new blocks and use the same ioend functions to
> > > remap the blocks after the write.  This should be fairly performant.
> > > 
> > > Christoph discovered that xfs_reflink_allocate_cow_range may stumble
> > > over invalid entries in the extent array given that it drops the ilock
> > > but still expects the index to be stable.  Simple fixing it to a new
> > > lookup for every iteration still isn't correct given that
> > > xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
> > > there is nothing preventing a xfs_bunmapi_cow call removing extents
> > > once we dropped the ilock either.
> > > 
> > > This patch duplicates the inner loop of xfs_bmapi_allocate into a
> > > helper for xfs_reflink_allocate_cow_range so that it can be done under
> > > the same ilock critical section as our CoW fork delayed allocation.
> > > The directio CoW warts will be revisited in a later patch.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > > v2: Turns out that there's no way for xfs_end_io_direct_write to know
> > > if the write completed successfully.  Therefore, do /not/ use the
> > > ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
> > > where we *can* tell if the write succeeded or not.
> > > 
> > > v3: Update the file size if we do a directio CoW across EOF.  This
> > > can happen if the last block is shared, the cowextsize hint is set,
> > > and we do a dio write past the end of the file.
> > > 
> > > v4: Christoph rewrote the allocate code to fix some concurrency
> > > problems as part of migrating the code to support iomap.
> > > ---
> > >  fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
> > >  fs/xfs/xfs_file.c    |   20 ++++++++-
> > >  fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  fs/xfs/xfs_reflink.h |    2 +
> > >  fs/xfs/xfs_trace.h   |    1 
> > >  5 files changed, 208 insertions(+), 13 deletions(-)
> > > 
> > > 
...
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index f99d7fa..025d52f 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -38,6 +38,7 @@
> > >  #include "xfs_icache.h"
> > >  #include "xfs_pnfs.h"
> > >  #include "xfs_iomap.h"
> > > +#include "xfs_reflink.h"
> > >  
> > >  #include <linux/dcache.h>
> > >  #include <linux/falloc.h>
> > > @@ -672,6 +673,13 @@ xfs_file_dio_aio_write(
> > >  
> > >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> > >  
> > > +	/* If this is a block-aligned directio CoW, remap immediately. */
> > > +	if (xfs_is_reflink_inode(ip) && !unaligned_io) {
> > > +		ret = xfs_reflink_allocate_cow_range(ip, iocb->ki_pos, count);
> > > +		if (ret)
> > > +			goto out;
> > > +	}
> > 
> > Is the fact that we do this allocation up front rather than via
> > get_blocks() (like traditional direct write) one of the "warts" that
> > needs cleaning, or for some other reason?
> 
> "Yes". :)
> 
> We do the allocation here because we know the exact size of the IO that
> userspace is asking for, so we might as well do all the allocations
> at once instead of repeatedly calling back into the allocator for each
> shared segment that gets fed into get_blocks.  Sort of warty.
> 
> I think this could get moved to get_blocks, though TBH I've been
> wondering if all this will just get replaced with iomap as part of
> killing buffer heads.
> 

Ok, kind of nasty with all of the various paths through get_blocks(),
but hopefully that dies off with buffer heads.

> > > +
> > >  	data = *from;
> > >  	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
> > >  			xfs_get_blocks_direct, xfs_end_io_direct_write,
...
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index d913ad1..c95cdc3 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
...
> > > @@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
> > >  	return error;
> > >  }
> > >  
> > > +/* Allocate all CoW reservations covering a range of blocks in a file. */
> > > +static int
> > > +__xfs_reflink_allocate_cow(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_fileoff_t		*offset_fsb,
> > > +	xfs_fileoff_t		end_fsb)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_bmbt_irec	imap;
> > > +	struct xfs_defer_ops	dfops;
> > > +	struct xfs_trans	*tp;
> > > +	xfs_fsblock_t		first_block;
> > > +	xfs_fileoff_t		next_fsb;
> > > +	int			nimaps = 1, error;
> > > +	bool			skipped = false;
> > > +
> > > +	xfs_defer_init(&dfops, &first_block);
> > > +
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> > > +			XFS_TRANS_RESERVE, &tp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +
> > > +	next_fsb = *offset_fsb;
> > > +	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
> > > +	if (error)
> > > +		goto out_trans_cancel;
> > 
> > Do we really need to do the delayed allocation that results from this?
> > Couldn't we factor out the shared extent walking that allows us to just
> > perform the real allocations below?
> 
> The delayed reservation -> allocation two-step is necessary to create
> replacement that are aligned to the CoW extent size hint.  This is
> important for aligning extents in the same way as the regular extent
> size hint, and critical for detecting random writes and landing them all
> in as close to a contiguous physical extent as possible.  This helps us
> to reduce cow-related fragmentation to manageable levels, which is
> necessary to avoid ENOMEM problems with the current incore extent tree.
> 

The cow extent size hint thing makes sense, but I don't see why we need
to do delayed allocation to incorporate it. Can we not accomodate a cow
extent size hint for a real allocation in the cow fork the same way a
direct write accomodates a traditional extent size hint in the data
fork? In fact, we've had logic for a while now that explicitly avoids
delayed allocation when a traditional extent size hint is set.

> Reducing fragmentation also helps us avoid problems seen on some other
> filesystem where reflinking of a 64G root image takes minutes after a
> couple of weeks of normal operations because the average extent size is
> now 2 blocks.
> 
> (By contrast we're still averaging ~800 blocks per extent.)
> 
> > It looks like speculative preallocation for dio is at least one strange
> > side effect that can result from this...
> 
> Christoph separated the delalloc reservation into separate functions for
> the data fork and the CoW fork.  xfs_file_iomap_begin_delay() is for the
> data fork (and does speculative prealloc), whereas
> __xfs_reflink_reserve_cow() is for the CoW fork and doesn't know about
> speculative prealloc.
> 

Ah, right. Then there's a bit of boilerplate code in
__xfs_reflink_reserve_cow() associated with 'orig_end_fsb' that can be
removed.

> > > +
> > > +	if (skipped) {
> > > +		*offset_fsb = next_fsb;
> > > +		goto out_trans_cancel;
> > > +	}
> > > +
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > > +	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
> > > +			XFS_BMAPI_COWFORK, &first_block,
> > > +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> > > +			&imap, &nimaps, &dfops);
> > > +	if (error)
> > > +		goto out_trans_cancel;
> > 
> > Should we be using unwritten extents (BMAPI_PREALLOC) to avoid stale
> > data exposure similar to traditional direct write (or is the cow fork
> > extent never accessible until it is remapped)?
> 
> Correct.  CoW fork extents are not accessible until after remapping.
> 

Got it, thanks.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > +
> > > +	/* We might not have been able to map the whole delalloc extent */
> > > +	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
> > > +
> > > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > > +	if (error)
> > > +		goto out_trans_cancel;
> > > +
> > > +	error = xfs_trans_commit(tp);
> > > +
> > > +out_unlock:
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return error;
> > > +out_trans_cancel:
> > > +	xfs_defer_cancel(&dfops);
> > > +	xfs_trans_cancel(tp);
> > > +	goto out_unlock;
> > > +}
> > > +
> > > +/* Allocate all CoW reservations covering a part of a file. */
> > > +int
> > > +xfs_reflink_allocate_cow_range(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_off_t		offset,
> > > +	xfs_off_t		count)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > > +	int			error;
> > > +
> > > +	ASSERT(xfs_is_reflink_inode(ip));
> > > +
> > > +	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
> > > +
> > > +	/*
> > > +	 * Make sure that the dquots are there.
> > > +	 */
> > > +	error = xfs_qm_dqattach(ip, 0);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	while (offset_fsb < end_fsb) {
> > > +		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
> > > +		if (error) {
> > > +			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> > > +					_RET_IP_);
> > > +			break;
> > > +		}
> > > +	}
> > > +
> > > +	return error;
> > > +}
> > > +
> > >  /*
> > >   * Find the CoW reservation (and whether or not it needs block allocation)
> > >   * for a given byte offset of a file.
> > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > index bffa4be..c0c989a 100644
> > > --- a/fs/xfs/xfs_reflink.h
> > > +++ b/fs/xfs/xfs_reflink.h
> > > @@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > >  
> > >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > >  		xfs_off_t offset, xfs_off_t count);
> > > +extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> > > +		xfs_off_t offset, xfs_off_t count);
> > >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > >  		struct xfs_bmbt_irec *imap, bool *need_alloc);
> > >  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 7612096..8e89223 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> > >  
> > >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
> > >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > > -DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
> > >  
> > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
> > >  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks
  2016-09-30  3:09 ` [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
  2016-09-30  7:47   ` Christoph Hellwig
@ 2016-10-06 16:44   ` Brian Foster
  2016-10-07  0:40     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-06 16:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:09:46PM -0700, Darrick J. Wong wrote:
> When we're freeing blocks (truncate, punch, etc.), clear all CoW
> reservations in the range being freed.  If the file block count
> drops to zero, also clear the inode reflink flag.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_inode.c |   13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 0c25a76..8c971fd 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -49,6 +49,7 @@
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
>  #include "xfs_bmap_btree.h"
> +#include "xfs_reflink.h"
>  
>  kmem_zone_t *xfs_inode_zone;
>  
> @@ -1586,6 +1587,18 @@ xfs_itruncate_extents(
>  			goto out;
>  	}
>  
> +	/* Remove all pending CoW reservations. */
> +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, first_unmap_block,
> +			last_block);
> +	if (error)
> +		goto out;
> +
> +	/*
> +	 * Clear the reflink flag if we truncated everything.
> +	 */
> +	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip))
> +		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> +

So this implies the reflink flag is more of an optimization than an
accurate assessment of reflink state (e.g., we can have DIFLAG2_REFLINK
w/o shared extents, but we should never have shared extents w/o the
flag) because this wouldn't clear the flag on a partial truncate that
happened to unmap all of the shared extents from the file.

That seems fine, I'll just note there's one small potential conflict
ahead where we (dis)allow setting the realtime state based on the
reflink flag.

Brian

>  	/*
>  	 * Always re-log the inode so that our permanent transaction can keep
>  	 * on rolling it forward in the log.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes
  2016-09-30  3:09 ` [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
  2016-09-30  7:47   ` Christoph Hellwig
@ 2016-10-06 16:44   ` Brian Foster
  2016-10-07  0:42     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-06 16:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:09:52PM -0700, Darrick J. Wong wrote:
> When destroying the inode, cancel all pending reservations in the CoW
> fork so that all the reserved blocks go back to the free pile.  In
> theory this sort of cleanup is only needed to clean up after write
> errors.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_super.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 204b794..26b45b3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -49,6 +49,7 @@
>  #include "xfs_rmap_item.h"
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
> +#include "xfs_reflink.h"
>  
>  #include <linux/namei.h>
>  #include <linux/init.h>
> @@ -938,6 +939,7 @@ xfs_fs_destroy_inode(
>  	struct inode		*inode)
>  {
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	int			error;
>  
>  	trace_xfs_destroy_inode(ip);
>  
> @@ -945,6 +947,12 @@ xfs_fs_destroy_inode(
>  	XFS_STATS_INC(ip->i_mount, vn_rele);
>  	XFS_STATS_INC(ip->i_mount, vn_remove);
>  
> +	error = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
> +	if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount))
> +		xfs_warn(ip->i_mount, "Error %d while evicting CoW blocks "
> +				"for inode %llu.",
> +				error, ip->i_ino);
> +

We don't actually check the inode reflink flag until down in
xfs_reflink_cancel_cow_blocks(), after we've allocated a transaction.
That seems potentially heavy weight for reclaim, as if there's a
performance impact (which I haven't tested) it would affect all
filesystems afaict (i.e., regardless of whether reflink is supported).

Further, if this isn't a reflink inode, it looks like we actually commit
the transaction rather than cancel it as well.

Brian

>  	xfs_inactive(ip);
>  
>  	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks
  2016-10-06 16:44   ` Brian Foster
@ 2016-10-07  0:40     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07  0:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Thu, Oct 06, 2016 at 12:44:33PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:09:46PM -0700, Darrick J. Wong wrote:
> > When we're freeing blocks (truncate, punch, etc.), clear all CoW
> > reservations in the range being freed.  If the file block count
> > drops to zero, also clear the inode reflink flag.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_inode.c |   13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 0c25a76..8c971fd 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -49,6 +49,7 @@
> >  #include "xfs_trans_priv.h"
> >  #include "xfs_log.h"
> >  #include "xfs_bmap_btree.h"
> > +#include "xfs_reflink.h"
> >  
> >  kmem_zone_t *xfs_inode_zone;
> >  
> > @@ -1586,6 +1587,18 @@ xfs_itruncate_extents(
> >  			goto out;
> >  	}
> >  
> > +	/* Remove all pending CoW reservations. */
> > +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, first_unmap_block,
> > +			last_block);
> > +	if (error)
> > +		goto out;
> > +
> > +	/*
> > +	 * Clear the reflink flag if we truncated everything.
> > +	 */
> > +	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip))
> > +		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> > +
> 
> So this implies the reflink flag is more of an optimization than an
> accurate assessment of reflink state (e.g., we can have DIFLAG2_REFLINK
> w/o shared extents, but we should never have shared extents w/o the
> flag) because this wouldn't clear the flag on a partial truncate that
> happened to unmap all of the shared extents from the file.

Yes, not to mention that (bugs notwithstanding) all the CoW stuff
shuts off if the inode flag isn't set.

> That seems fine, I'll just note there's one small potential conflict
> ahead where we (dis)allow setting the realtime state based on the
> reflink flag.

Hmm.  You're right, if we're able to set the rt flag (i.e. data size is zero)
then we ought to just clear the reflink flag.

--D

> 
> Brian
> 
> >  	/*
> >  	 * Always re-log the inode so that our permanent transaction can keep
> >  	 * on rolling it forward in the log.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes
  2016-10-06 16:44   ` Brian Foster
@ 2016-10-07  0:42     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07  0:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Thu, Oct 06, 2016 at 12:44:39PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:09:52PM -0700, Darrick J. Wong wrote:
> > When destroying the inode, cancel all pending reservations in the CoW
> > fork so that all the reserved blocks go back to the free pile.  In
> > theory this sort of cleanup is only needed to clean up after write
> > errors.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_super.c |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 204b794..26b45b3 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -49,6 +49,7 @@
> >  #include "xfs_rmap_item.h"
> >  #include "xfs_refcount_item.h"
> >  #include "xfs_bmap_item.h"
> > +#include "xfs_reflink.h"
> >  
> >  #include <linux/namei.h>
> >  #include <linux/init.h>
> > @@ -938,6 +939,7 @@ xfs_fs_destroy_inode(
> >  	struct inode		*inode)
> >  {
> >  	struct xfs_inode	*ip = XFS_I(inode);
> > +	int			error;
> >  
> >  	trace_xfs_destroy_inode(ip);
> >  
> > @@ -945,6 +947,12 @@ xfs_fs_destroy_inode(
> >  	XFS_STATS_INC(ip->i_mount, vn_rele);
> >  	XFS_STATS_INC(ip->i_mount, vn_remove);
> >  
> > +	error = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF);
> > +	if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount))
> > +		xfs_warn(ip->i_mount, "Error %d while evicting CoW blocks "
> > +				"for inode %llu.",
> > +				error, ip->i_ino);
> > +
> 
> We don't actually check the inode reflink flag until down in
> xfs_reflink_cancel_cow_blocks(), after we've allocated a transaction.
> That seems potentially heavy weight for reclaim, as if there's a
> performance impact (which I haven't tested) it would affect all
> filesystems afaict (i.e., regardless of whether reflink is supported).
> 
> Further, if this isn't a reflink inode, it looks like we actually commit
> the transaction rather than cancel it as well.

Eeek.  xfs_reflink_cancel_cow_range (or something) should jump out if
it's not a reflink inode.

--D

> 
> Brian
> 
> >  	xfs_inactive(ip);
> >  
> >  	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-06 12:20       ` Brian Foster
@ 2016-10-07  1:02         ` Darrick J. Wong
  2016-10-07  6:17           ` Christoph Hellwig
  2016-10-07 12:15           ` Brian Foster
  0 siblings, 2 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07  1:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Oct 06, 2016 at 08:20:08AM -0400, Brian Foster wrote:
> On Wed, Oct 05, 2016 at 01:55:42PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 05, 2016 at 02:27:10PM -0400, Brian Foster wrote:
> > > On Thu, Sep 29, 2016 at 08:09:40PM -0700, Darrick J. Wong wrote:
> > > > For O_DIRECT writes to shared blocks, we have to CoW them just like
> > > > we would with buffered writes.  For writes that are not block-aligned,
> > > > just bounce them to the page cache.
> > > > 
> > > > For block-aligned writes, however, we can do better than that.  Use
> > > > the same mechanisms that we employ for buffered CoW to set up a
> > > > delalloc reservation, allocate all the blocks at once, issue the
> > > > writes against the new blocks and use the same ioend functions to
> > > > remap the blocks after the write.  This should be fairly performant.
> > > > 
> > > > Christoph discovered that xfs_reflink_allocate_cow_range may stumble
> > > > over invalid entries in the extent array given that it drops the ilock
> > > > but still expects the index to be stable.  Simple fixing it to a new
> > > > lookup for every iteration still isn't correct given that
> > > > xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
> > > > there is nothing preventing a xfs_bunmapi_cow call removing extents
> > > > once we dropped the ilock either.
> > > > 
> > > > This patch duplicates the inner loop of xfs_bmapi_allocate into a
> > > > helper for xfs_reflink_allocate_cow_range so that it can be done under
> > > > the same ilock critical section as our CoW fork delayed allocation.
> > > > The directio CoW warts will be revisited in a later patch.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > ---
> > > > v2: Turns out that there's no way for xfs_end_io_direct_write to know
> > > > if the write completed successfully.  Therefore, do /not/ use the
> > > > ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
> > > > where we *can* tell if the write succeeded or not.
> > > > 
> > > > v3: Update the file size if we do a directio CoW across EOF.  This
> > > > can happen if the last block is shared, the cowextsize hint is set,
> > > > and we do a dio write past the end of the file.
> > > > 
> > > > v4: Christoph rewrote the allocate code to fix some concurrency
> > > > problems as part of migrating the code to support iomap.
> > > > ---
> > > >  fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
> > > >  fs/xfs/xfs_file.c    |   20 ++++++++-
> > > >  fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > > >  fs/xfs/xfs_reflink.h |    2 +
> > > >  fs/xfs/xfs_trace.h   |    1 
> > > >  5 files changed, 208 insertions(+), 13 deletions(-)
> > > > 
> > > > 
> ...
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index f99d7fa..025d52f 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -38,6 +38,7 @@
> > > >  #include "xfs_icache.h"
> > > >  #include "xfs_pnfs.h"
> > > >  #include "xfs_iomap.h"
> > > > +#include "xfs_reflink.h"
> > > >  
> > > >  #include <linux/dcache.h>
> > > >  #include <linux/falloc.h>
> > > > @@ -672,6 +673,13 @@ xfs_file_dio_aio_write(
> > > >  
> > > >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> > > >  
> > > > +	/* If this is a block-aligned directio CoW, remap immediately. */
> > > > +	if (xfs_is_reflink_inode(ip) && !unaligned_io) {
> > > > +		ret = xfs_reflink_allocate_cow_range(ip, iocb->ki_pos, count);
> > > > +		if (ret)
> > > > +			goto out;
> > > > +	}
> > > 
> > > Is the fact that we do this allocation up front rather than via
> > > get_blocks() (like traditional direct write) one of the "warts" that
> > > needs cleaning, or for some other reason?
> > 
> > "Yes". :)
> > 
> > We do the allocation here because we know the exact size of the IO that
> > userspace is asking for, so we might as well do all the allocations
> > at once instead of repeatedly calling back into the allocator for each
> > shared segment that gets fed into get_blocks.  Sort of warty.
> > 
> > I think this could get moved to get_blocks, though TBH I've been
> > wondering if all this will just get replaced with iomap as part of
> > killing buffer heads.
> > 
> 
> Ok, kind of nasty with all of the various paths through get_blocks(),
> but hopefully that dies off with buffer heads.

It's possible Christoph might have further cleanup patches for reflink
that fix this?  <shrug>

> > > > +
> > > >  	data = *from;
> > > >  	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
> > > >  			xfs_get_blocks_direct, xfs_end_io_direct_write,
> ...
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index d913ad1..c95cdc3 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> ...
> > > > @@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
> > > >  	return error;
> > > >  }
> > > >  
> > > > +/* Allocate all CoW reservations covering a range of blocks in a file. */
> > > > +static int
> > > > +__xfs_reflink_allocate_cow(
> > > > +	struct xfs_inode	*ip,
> > > > +	xfs_fileoff_t		*offset_fsb,
> > > > +	xfs_fileoff_t		end_fsb)
> > > > +{
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	struct xfs_bmbt_irec	imap;
> > > > +	struct xfs_defer_ops	dfops;
> > > > +	struct xfs_trans	*tp;
> > > > +	xfs_fsblock_t		first_block;
> > > > +	xfs_fileoff_t		next_fsb;
> > > > +	int			nimaps = 1, error;
> > > > +	bool			skipped = false;
> > > > +
> > > > +	xfs_defer_init(&dfops, &first_block);
> > > > +
> > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> > > > +			XFS_TRANS_RESERVE, &tp);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > +
> > > > +	next_fsb = *offset_fsb;
> > > > +	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
> > > > +	if (error)
> > > > +		goto out_trans_cancel;
> > > 
> > > Do we really need to do the delayed allocation that results from this?
> > > Couldn't we factor out the shared extent walking that allows us to just
> > > perform the real allocations below?
> > 
> > The delayed reservation -> allocation two-step is necessary to create
> > replacement that are aligned to the CoW extent size hint.  This is
> > important for aligning extents in the same way as the regular extent
> > size hint, and critical for detecting random writes and landing them all
> > in as close to a contiguous physical extent as possible.  This helps us
> > to reduce cow-related fragmentation to manageable levels, which is
> > necessary to avoid ENOMEM problems with the current incore extent tree.
> > 
> 
> The cow extent size hint thing makes sense, but I don't see why we need
> to do delayed allocation to incorporate it. Can we not accomodate a cow
> extent size hint for a real allocation in the cow fork the same way a
> direct write accomodates a traditional extent size hint in the data
> fork? In fact, we've had logic for a while now that explicitly avoids
> delayed allocation when a traditional extent size hint is set.

Yes, that would have been another way to implement it.  I think I
finally see your point about using the delalloc mechanism -- since we've
converted the buffered write path to iomap and therefore know exactly
how much userspace wants to write in both buffered and directio cases,
we could just allocate the cow extent right then and there, skipping the
overhead of writing a delalloc reservation and then changing it.

For buffered writes, though, it's nice to be able to use the DA
mechanism so that we can ask the allocator for as big of an extent as we
have contiguous dirty pages.  Hm.  I guess for directio then we could
just fill in the holes directly and convert any delalloc reservations
that happened already to be there, which requires only a single loop.

Will ponder this some more, thx for the pushback. :)

> > Reducing fragmentation also helps us avoid problems seen on some other
> > filesystem where reflinking of a 64G root image takes minutes after a
> > couple of weeks of normal operations because the average extent size is
> > now 2 blocks.
> > 
> > (By contrast we're still averaging ~800 blocks per extent.)
> > 
> > > It looks like speculative preallocation for dio is at least one strange
> > > side effect that can result from this...
> > 
> > Christoph separated the delalloc reservation into separate functions for
> > the data fork and the CoW fork.  xfs_file_iomap_begin_delay() is for the
> > data fork (and does speculative prealloc), whereas
> > __xfs_reflink_reserve_cow() is for the CoW fork and doesn't know about
> > speculative prealloc.
> > 
> 
> Ah, right. Then there's a bit of boilerplate code in
> __xfs_reflink_reserve_cow() associated with 'orig_end_fsb' that can be
> removed.

The CoW extent size hint code will use orig_end_fsb to tag the inode
as potentially needing to gc any CoW leftovers during its periodic
scans.

--D

> 
> > > > +
> > > > +	if (skipped) {
> > > > +		*offset_fsb = next_fsb;
> > > > +		goto out_trans_cancel;
> > > > +	}
> > > > +
> > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > +	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
> > > > +			XFS_BMAPI_COWFORK, &first_block,
> > > > +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> > > > +			&imap, &nimaps, &dfops);
> > > > +	if (error)
> > > > +		goto out_trans_cancel;
> > > 
> > > Should we be using unwritten extents (BMAPI_PREALLOC) to avoid stale
> > > data exposure similar to traditional direct write (or is the cow fork
> > > extent never accessible until it is remapped)?
> > 
> > Correct.  CoW fork extents are not accessible until after remapping.
> > 
> 
> Got it, thanks.
> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > +
> > > > +	/* We might not have been able to map the whole delalloc extent */
> > > > +	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
> > > > +
> > > > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > > > +	if (error)
> > > > +		goto out_trans_cancel;
> > > > +
> > > > +	error = xfs_trans_commit(tp);
> > > > +
> > > > +out_unlock:
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +	return error;
> > > > +out_trans_cancel:
> > > > +	xfs_defer_cancel(&dfops);
> > > > +	xfs_trans_cancel(tp);
> > > > +	goto out_unlock;
> > > > +}
> > > > +
> > > > +/* Allocate all CoW reservations covering a part of a file. */
> > > > +int
> > > > +xfs_reflink_allocate_cow_range(
> > > > +	struct xfs_inode	*ip,
> > > > +	xfs_off_t		offset,
> > > > +	xfs_off_t		count)
> > > > +{
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > > +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > > > +	int			error;
> > > > +
> > > > +	ASSERT(xfs_is_reflink_inode(ip));
> > > > +
> > > > +	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
> > > > +
> > > > +	/*
> > > > +	 * Make sure that the dquots are there.
> > > > +	 */
> > > > +	error = xfs_qm_dqattach(ip, 0);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	while (offset_fsb < end_fsb) {
> > > > +		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
> > > > +		if (error) {
> > > > +			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> > > > +					_RET_IP_);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Find the CoW reservation (and whether or not it needs block allocation)
> > > >   * for a given byte offset of a file.
> > > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > > index bffa4be..c0c989a 100644
> > > > --- a/fs/xfs/xfs_reflink.h
> > > > +++ b/fs/xfs/xfs_reflink.h
> > > > @@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > > >  
> > > >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > > >  		xfs_off_t offset, xfs_off_t count);
> > > > +extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> > > > +		xfs_off_t offset, xfs_off_t count);
> > > >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > > >  		struct xfs_bmbt_irec *imap, bool *need_alloc);
> > > >  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index 7612096..8e89223 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> > > >  
> > > >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
> > > >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > > > -DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
> > > >  
> > > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
> > > >  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-07  1:02         ` Darrick J. Wong
@ 2016-10-07  6:17           ` Christoph Hellwig
  2016-10-07 12:16             ` Brian Foster
  2016-10-07 12:15           ` Brian Foster
  1 sibling, 1 reply; 187+ messages in thread
From: Christoph Hellwig @ 2016-10-07  6:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, david, linux-xfs, Christoph Hellwig

On Thu, Oct 06, 2016 at 06:02:25PM -0700, Darrick J. Wong wrote:
> > Ok, kind of nasty with all of the various paths through get_blocks(),
> > but hopefully that dies off with buffer heads.
> 
> It's possible Christoph might have further cleanup patches for reflink
> that fix this?  <shrug>

I do.  It's stuck on not getting one of the corner cases right at the
moment, and I had to take a break from it yesterday to go insane, but
I hope I'll be able to post it soon.

> Yes, that would have been another way to implement it.  I think I
> finally see your point about using the delalloc mechanism -- since we've
> converted the buffered write path to iomap and therefore know exactly
> how much userspace wants to write in both buffered and directio cases,
> we could just allocate the cow extent right then and there, skipping the
> overhead of writing a delalloc reservation and then changing it.
> 
> For buffered writes, though, it's nice to be able to use the DA
> mechanism so that we can ask the allocator for as big of an extent as we
> have contiguous dirty pages.  Hm.  I guess for directio then we could
> just fill in the holes directly and convert any delalloc reservations
> that happened already to be there, which requires only a single loop.
> 
> Will ponder this some more, thx for the pushback. :)

Having spent a lot of the time with the COW and non-COW I/O path
lately here is my 2 cents:  delalloc for the buffered write path
inherently make sense and mirrors what we do for non-COW I/O.  I
have patches to rewrite parts of how we do it, but I see no
reason to change that we are doing delayed allocations.

For the direct I/O path there is absolutely no point in doing delayed
allocation and I plan to get rid of them for the next merge window.

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-07  1:02         ` Darrick J. Wong
  2016-10-07  6:17           ` Christoph Hellwig
@ 2016-10-07 12:15           ` Brian Foster
  2016-10-13 18:14             ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-07 12:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Oct 06, 2016 at 06:02:25PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 06, 2016 at 08:20:08AM -0400, Brian Foster wrote:
> > On Wed, Oct 05, 2016 at 01:55:42PM -0700, Darrick J. Wong wrote:
> > > On Wed, Oct 05, 2016 at 02:27:10PM -0400, Brian Foster wrote:
> > > > On Thu, Sep 29, 2016 at 08:09:40PM -0700, Darrick J. Wong wrote:
> > > > > For O_DIRECT writes to shared blocks, we have to CoW them just like
> > > > > we would with buffered writes.  For writes that are not block-aligned,
> > > > > just bounce them to the page cache.
> > > > > 
> > > > > For block-aligned writes, however, we can do better than that.  Use
> > > > > the same mechanisms that we employ for buffered CoW to set up a
> > > > > delalloc reservation, allocate all the blocks at once, issue the
> > > > > writes against the new blocks and use the same ioend functions to
> > > > > remap the blocks after the write.  This should be fairly performant.
> > > > > 
> > > > > Christoph discovered that xfs_reflink_allocate_cow_range may stumble
> > > > > over invalid entries in the extent array given that it drops the ilock
> > > > > but still expects the index to be stable.  Simple fixing it to a new
> > > > > lookup for every iteration still isn't correct given that
> > > > > xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
> > > > > there is nothing preventing a xfs_bunmapi_cow call removing extents
> > > > > once we dropped the ilock either.
> > > > > 
> > > > > This patch duplicates the inner loop of xfs_bmapi_allocate into a
> > > > > helper for xfs_reflink_allocate_cow_range so that it can be done under
> > > > > the same ilock critical section as our CoW fork delayed allocation.
> > > > > The directio CoW warts will be revisited in a later patch.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > > ---
> > > > > v2: Turns out that there's no way for xfs_end_io_direct_write to know
> > > > > if the write completed successfully.  Therefore, do /not/ use the
> > > > > ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
> > > > > where we *can* tell if the write succeeded or not.
> > > > > 
> > > > > v3: Update the file size if we do a directio CoW across EOF.  This
> > > > > can happen if the last block is shared, the cowextsize hint is set,
> > > > > and we do a dio write past the end of the file.
> > > > > 
> > > > > v4: Christoph rewrote the allocate code to fix some concurrency
> > > > > problems as part of migrating the code to support iomap.
> > > > > ---
> > > > >  fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
> > > > >  fs/xfs/xfs_file.c    |   20 ++++++++-
> > > > >  fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > > > >  fs/xfs/xfs_reflink.h |    2 +
> > > > >  fs/xfs/xfs_trace.h   |    1 
> > > > >  5 files changed, 208 insertions(+), 13 deletions(-)
> > > > > 
> > > > > 
> > ...
...
> 
> > > > > +
> > > > >  	data = *from;
> > > > >  	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
> > > > >  			xfs_get_blocks_direct, xfs_end_io_direct_write,
> > ...
> > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > > index d913ad1..c95cdc3 100644
> > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > +++ b/fs/xfs/xfs_reflink.c
> > ...
> > > > > @@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
> > > > >  	return error;
> > > > >  }
> > > > >  
> > > > > +/* Allocate all CoW reservations covering a range of blocks in a file. */
> > > > > +static int
> > > > > +__xfs_reflink_allocate_cow(
> > > > > +	struct xfs_inode	*ip,
> > > > > +	xfs_fileoff_t		*offset_fsb,
> > > > > +	xfs_fileoff_t		end_fsb)
> > > > > +{
> > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > +	struct xfs_bmbt_irec	imap;
> > > > > +	struct xfs_defer_ops	dfops;
> > > > > +	struct xfs_trans	*tp;
> > > > > +	xfs_fsblock_t		first_block;
> > > > > +	xfs_fileoff_t		next_fsb;
> > > > > +	int			nimaps = 1, error;
> > > > > +	bool			skipped = false;
> > > > > +
> > > > > +	xfs_defer_init(&dfops, &first_block);
> > > > > +
> > > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> > > > > +			XFS_TRANS_RESERVE, &tp);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +
> > > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > +
> > > > > +	next_fsb = *offset_fsb;
> > > > > +	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
> > > > > +	if (error)
> > > > > +		goto out_trans_cancel;
> > > > 
> > > > Do we really need to do the delayed allocation that results from this?
> > > > Couldn't we factor out the shared extent walking that allows us to just
> > > > perform the real allocations below?
> > > 
> > > The delayed reservation -> allocation two-step is necessary to create
> > > replacement that are aligned to the CoW extent size hint.  This is
> > > important for aligning extents in the same way as the regular extent
> > > size hint, and critical for detecting random writes and landing them all
> > > in as close to a contiguous physical extent as possible.  This helps us
> > > to reduce cow-related fragmentation to manageable levels, which is
> > > necessary to avoid ENOMEM problems with the current incore extent tree.
> > > 
> > 
> > The cow extent size hint thing makes sense, but I don't see why we need
> > to do delayed allocation to incorporate it. Can we not accomodate a cow
> > extent size hint for a real allocation in the cow fork the same way a
> > direct write accomodates a traditional extent size hint in the data
> > fork? In fact, we've had logic for a while now that explicitly avoids
> > delayed allocation when a traditional extent size hint is set.
> 
> Yes, that would have been another way to implement it.  I think I
> finally see your point about using the delalloc mechanism -- since we've
> converted the buffered write path to iomap and therefore know exactly
> how much userspace wants to write in both buffered and directio cases,
> we could just allocate the cow extent right then and there, skipping the
> overhead of writing a delalloc reservation and then changing it.
> 

Pretty much...

> For buffered writes, though, it's nice to be able to use the DA
> mechanism so that we can ask the allocator for as big of an extent as we
> have contiguous dirty pages.  Hm.  I guess for directio then we could
> just fill in the holes directly and convert any delalloc reservations
> that happened already to be there, which requires only a single loop.
> 

Sure. I'm basically just poking at why we appear to take a different
approach for each of the buffered/direct I/O mechanisms to the cow fork
as opposed to the data fork (with regard to block allocation, at least).

So using delayed allocation for cow buffered I/O certainly makes sense
to me for basically the same reasons we use it for normal buffered
I/O...

> Will ponder this some more, thx for the pushback. :)
> 
> > > Reducing fragmentation also helps us avoid problems seen on some other
> > > filesystem where reflinking of a 64G root image takes minutes after a
> > > couple of weeks of normal operations because the average extent size is
> > > now 2 blocks.
> > > 
> > > (By contrast we're still averaging ~800 blocks per extent.)
> > > 
> > > > It looks like speculative preallocation for dio is at least one strange
> > > > side effect that can result from this...
> > > 
> > > Christoph separated the delalloc reservation into separate functions for
> > > the data fork and the CoW fork.  xfs_file_iomap_begin_delay() is for the
> > > data fork (and does speculative prealloc), whereas
> > > __xfs_reflink_reserve_cow() is for the CoW fork and doesn't know about
> > > speculative prealloc.
> > > 
> > 
> > Ah, right. Then there's a bit of boilerplate code in
> > __xfs_reflink_reserve_cow() associated with 'orig_end_fsb' that can be
> > removed.
> 
> The CoW extent size hint code will use orig_end_fsb to tag the inode
> as potentially needing to gc any CoW leftovers during its periodic
> scans.
> 

Oops, missed that. Hmm, this seems like kind of confused behavior
overall because (I thought) an extent size hint should force aligned
(start and end) mapping of extents. In the normal case, extsz forces
real block allocation, but I don't think that was always the case so
I'll ignore that for the moment.

So here, we apply an (cow) extent size hint to a delayed allocation but
sort of treat it like speculative preallocation (or the allocation size
mount time option) in that we try to trim off the end and retry the
request in the event of ENOSPC. AFAICT, xfs_bmapi_reserve_delalloc()
still does the start/end alignment for cow fork allocations, so really
how useful is a truncate and retry in this case? In fact, it looks like
*_reserve_delalloc() would just repeat the same allocation request again
because the cow extent size hint is still set...

Am I missing something?

Brian

> --D
> 
> > 
> > > > > +
> > > > > +	if (skipped) {
> > > > > +		*offset_fsb = next_fsb;
> > > > > +		goto out_trans_cancel;
> > > > > +	}
> > > > > +
> > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > > +	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
> > > > > +			XFS_BMAPI_COWFORK, &first_block,
> > > > > +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> > > > > +			&imap, &nimaps, &dfops);
> > > > > +	if (error)
> > > > > +		goto out_trans_cancel;
> > > > 
> > > > Should we be using unwritten extents (BMAPI_PREALLOC) to avoid stale
> > > > data exposure similar to traditional direct write (or is the cow fork
> > > > extent never accessible until it is remapped)?
> > > 
> > > Correct.  CoW fork extents are not accessible until after remapping.
> > > 
> > 
> > Got it, thanks.
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > +
> > > > > +	/* We might not have been able to map the whole delalloc extent */
> > > > > +	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
> > > > > +
> > > > > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > > > > +	if (error)
> > > > > +		goto out_trans_cancel;
> > > > > +
> > > > > +	error = xfs_trans_commit(tp);
> > > > > +
> > > > > +out_unlock:
> > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > +	return error;
> > > > > +out_trans_cancel:
> > > > > +	xfs_defer_cancel(&dfops);
> > > > > +	xfs_trans_cancel(tp);
> > > > > +	goto out_unlock;
> > > > > +}
> > > > > +
> > > > > +/* Allocate all CoW reservations covering a part of a file. */
> > > > > +int
> > > > > +xfs_reflink_allocate_cow_range(
> > > > > +	struct xfs_inode	*ip,
> > > > > +	xfs_off_t		offset,
> > > > > +	xfs_off_t		count)
> > > > > +{
> > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > +	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > > > +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > > > > +	int			error;
> > > > > +
> > > > > +	ASSERT(xfs_is_reflink_inode(ip));
> > > > > +
> > > > > +	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
> > > > > +
> > > > > +	/*
> > > > > +	 * Make sure that the dquots are there.
> > > > > +	 */
> > > > > +	error = xfs_qm_dqattach(ip, 0);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +
> > > > > +	while (offset_fsb < end_fsb) {
> > > > > +		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
> > > > > +		if (error) {
> > > > > +			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> > > > > +					_RET_IP_);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Find the CoW reservation (and whether or not it needs block allocation)
> > > > >   * for a given byte offset of a file.
> > > > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > > > index bffa4be..c0c989a 100644
> > > > > --- a/fs/xfs/xfs_reflink.h
> > > > > +++ b/fs/xfs/xfs_reflink.h
> > > > > @@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > > > >  
> > > > >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > > > >  		xfs_off_t offset, xfs_off_t count);
> > > > > +extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> > > > > +		xfs_off_t offset, xfs_off_t count);
> > > > >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > > > >  		struct xfs_bmbt_irec *imap, bool *need_alloc);
> > > > >  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > > index 7612096..8e89223 100644
> > > > > --- a/fs/xfs/xfs_trace.h
> > > > > +++ b/fs/xfs/xfs_trace.h
> > > > > @@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> > > > >  
> > > > >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
> > > > >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > > > > -DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
> > > > >  
> > > > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
> > > > >  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-07  6:17           ` Christoph Hellwig
@ 2016-10-07 12:16             ` Brian Foster
  0 siblings, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-07 12:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, david, linux-xfs

On Fri, Oct 07, 2016 at 08:17:29AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 06, 2016 at 06:02:25PM -0700, Darrick J. Wong wrote:
> > > Ok, kind of nasty with all of the various paths through get_blocks(),
> > > but hopefully that dies off with buffer heads.
> > 
> > It's possible Christoph might have further cleanup patches for reflink
> > that fix this?  <shrug>
> 
> I do.  It's stuck on not getting one of the corner cases right at the
> moment, and I had to take a break from it yesterday to go insane, but
> I hope I'll be able to post it soon.
> 
> > Yes, that would have been another way to implement it.  I think I
> > finally see your point about using the delalloc mechanism -- since we've
> > converted the buffered write path to iomap and therefore know exactly
> > how much userspace wants to write in both buffered and directio cases,
> > we could just allocate the cow extent right then and there, skipping the
> > overhead of writing a delalloc reservation and then changing it.
> > 
> > For buffered writes, though, it's nice to be able to use the DA
> > mechanism so that we can ask the allocator for as big of an extent as we
> > have contiguous dirty pages.  Hm.  I guess for directio then we could
> > just fill in the holes directly and convert any delalloc reservations
> > that happened already to be there, which requires only a single loop.
> > 
> > Will ponder this some more, thx for the pushback. :)
> 
> Having spent a lot of the time with the COW and non-COW I/O path
> lately here is my 2 cents:  delalloc for the buffered write path
> inherently make sense and mirrors what we do for non-COW I/O.  I
> have patches to rewrite parts of how we do it, but I see no
> reason to change that we are doing delayed allocations.
> 
> For the direct I/O path there is absolutely no point in doing delayed
> allocation and I plan to get rid of them for the next merge window.

Agreed, thanks.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree
  2016-09-30  3:09 ` [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
  2016-09-30  7:49   ` Christoph Hellwig
@ 2016-10-07 18:04   ` Brian Foster
  2016-10-07 19:18     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-07 18:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:09:59PM -0700, Darrick J. Wong wrote:
> Due to the way the CoW algorithm in XFS works, there's an interval
> during which blocks allocated to handle a CoW can be lost -- if the FS
> goes down after the blocks are allocated but before the block
> remapping takes place.  This is exacerbated by the cowextsz hint --
> allocated reservations can sit around for a while, waiting to get
> used.
> 
> Since the refcount btree doesn't normally store records with refcount
> of 1, we can use it to record these in-progress extents.  In-progress
> blocks cannot be shared because they're not user-visible, so there
> shouldn't be any conflicts with other programs.  This is a better
> solution than holding EFIs during writeback because (a) EFIs can't be
> relogged currently, (b) even if they could, EFIs are bound by
> available log space, which puts an unnecessary upper bound on how much
> CoW we can have in flight, and (c) we already have a mechanism to
> track blocks.
> 
> At mount time, read the refcount records and free anything we find
> with a refcount of 1 because those were in-progress when the FS went
> down.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: Use the deferred operations system to avoid deadlocks and blowing
> out the transaction reservation.  This allows us to unmap a CoW
> extent from the refcountbt and into a file atomically.
> ---
>  fs/xfs/libxfs/xfs_bmap.c     |   11 +
>  fs/xfs/libxfs/xfs_format.h   |    3 
>  fs/xfs/libxfs/xfs_refcount.c |  336 +++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/libxfs/xfs_refcount.h |   10 +
>  fs/xfs/xfs_mount.c           |   12 ++
>  fs/xfs/xfs_refcount_item.c   |   12 ++
>  fs/xfs/xfs_reflink.c         |  150 +++++++++++++++++++
>  fs/xfs/xfs_reflink.h         |    1 
>  fs/xfs/xfs_super.c           |    9 +
>  fs/xfs/xfs_trace.h           |    4 +
>  10 files changed, 537 insertions(+), 11 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index c95cdc3..673ecc1 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
...
> @@ -772,3 +787,138 @@ xfs_reflink_end_cow(
>  	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>  	return error;
>  }
...
> +STATIC int
> +xfs_reflink_recover_cow_ag(
> +	struct xfs_mount		*mp,
> +	xfs_agnumber_t			agno)
> +{
> +	struct xfs_trans		*tp;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_buf			*agbp;
> +	struct xfs_reflink_recovery	*rr, *n;
> +	struct list_head		debris;
> +	union xfs_btree_irec		low;
> +	union xfs_btree_irec		high;
> +	struct xfs_defer_ops		dfops;
> +	xfs_fsblock_t			fsb;
> +	int				error;
> +
> +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> +	if (error)
> +		return error;
> +	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
> +
> +	/* Find all the leftover CoW staging extents. */
> +	INIT_LIST_HEAD(&debris);
> +	memset(&low, 0, sizeof(low));
> +	memset(&high, 0, sizeof(high));
> +	low.rc.rc_startblock = 0;
> +	high.rc.rc_startblock = -1U;
> +	error = xfs_btree_query_range(cur, &low, &high,
> +			xfs_reflink_recover_extent, &debris);
> +	if (error)
> +		goto out_error;

Potential memory leak of debris list entries on error.

> +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> +	xfs_buf_relse(agbp);
> +
> +	/* Now iterate the list to free the leftovers */
> +	list_for_each_entry(rr, &debris, rr_list) {
> +		/* Set up transaction. */
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> +		if (error)
> +			goto out_free;
> +
> +		trace_xfs_reflink_recover_extent(mp, agno, &rr->rr_rrec);
> +
> +		/* Free the orphan record */
> +		xfs_defer_init(&dfops, &fsb);
> +		fsb = XFS_AGB_TO_FSB(mp, agno, rr->rr_rrec.rc_startblock);
> +		error = xfs_refcount_free_cow_extent(mp, &dfops, fsb,
> +				rr->rr_rrec.rc_blockcount);
> +		if (error)
> +			goto out_defer;
> +
> +		/* Free the block. */
> +		xfs_bmap_add_free(mp, &dfops, fsb,
> +				rr->rr_rrec.rc_blockcount, NULL);
> +
> +		error = xfs_defer_finish(&tp, &dfops, NULL);
> +		if (error)
> +			goto out_defer;
> +
> +		error = xfs_trans_commit(tp);
> +		if (error)
> +			goto out_cancel;
> +	}
> +	goto out_free;

return 0 ?

Brian

> +
> +out_defer:
> +	xfs_defer_cancel(&dfops);
> +out_cancel:
> +	xfs_trans_cancel(tp);
> +
> +out_free:
> +	/* Free the leftover list */
> +	list_for_each_entry_safe(rr, n, &debris, rr_list) {
> +		list_del(&rr->rr_list);
> +		kmem_free(rr);
> +	}
> +
> +	return error;
> +
> +out_error:
> +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> +	xfs_buf_relse(agbp);
> +	return error;
> +}
> +
> +/*
> + * Free leftover CoW reservations that didn't get cleaned out.
> + */
> +int
> +xfs_reflink_recover_cow(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_agnumber_t		agno;
> +	int			error = 0;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return 0;
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_reflink_recover_cow_ag(mp, agno);
> +		if (error)
> +			break;
> +	}
> +
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index c0c989a..1d2f180 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -42,5 +42,6 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
>  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
> +extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
>  
>  #endif /* __XFS_REFLINK_H */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 26b45b3..e6aaa91 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1306,6 +1306,15 @@ xfs_fs_remount(
>  		xfs_restore_resvblks(mp);
>  		xfs_log_work_queue(mp);
>  		xfs_queue_eofblocks(mp);
> +
> +		/* Recover any CoW blocks that never got remapped. */
> +		error = xfs_reflink_recover_cow(mp);
> +		if (error) {
> +			xfs_err(mp,
> +	"Error %d recovering leftover CoW allocations.", error);
> +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +			return error;
> +		}
>  	}
>  
>  	/* rw -> ro */
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 8e89223..ca0930b 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2916,14 +2916,18 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_update_error);
>  /* refcount adjustment tracepoints */
>  DEFINE_AG_EXTENT_EVENT(xfs_refcount_increase);
>  DEFINE_AG_EXTENT_EVENT(xfs_refcount_decrease);
> +DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_increase);
> +DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_decrease);
>  DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(xfs_refcount_merge_center_extents);
>  DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_modify_extent);
> +DEFINE_REFCOUNT_EXTENT_EVENT(xfs_reflink_recover_extent);
>  DEFINE_REFCOUNT_EXTENT_AT_EVENT(xfs_refcount_split_extent);
>  DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_left_extent);
>  DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_right_extent);
>  DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_left_extent);
>  DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_right_extent);
>  DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_error);
> +DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_cow_error);
>  DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_center_extents_error);
>  DEFINE_AG_ERROR_EVENT(xfs_refcount_modify_extent_error);
>  DEFINE_AG_ERROR_EVENT(xfs_refcount_split_extent_error);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 41/63] xfs: reflink extents from one file to another
  2016-09-30  3:10 ` [PATCH 41/63] xfs: reflink extents from one file to another Darrick J. Wong
  2016-09-30  7:50   ` Christoph Hellwig
@ 2016-10-07 18:04   ` Brian Foster
  2016-10-07 19:44     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-07 18:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:10:05PM -0700, Darrick J. Wong wrote:
> Reflink extents from one file to another; that is to say, iteratively
> remove the mappings from the destination file, copy the mappings from
> the source file to the destination file, and increment the reference
> count of all the blocks that got remapped.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: Call xfs_defer_cancel before cancelling the transaction if the
> remap operation fails.  Use the deferred operations system to avoid
> deadlocks or blowing out the transaction reservation, and make the
> entire reflink operation atomic for each extent being remapped.  The
> destination file's i_size will be updated if necessary to avoid
> violating the assumption that there are no shared blocks past the EOF
> block.
> ---
>  fs/xfs/xfs_reflink.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_reflink.h |    2 
>  2 files changed, 427 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 673ecc1..94c19fff 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -922,3 +922,428 @@ xfs_reflink_recover_cow(
>  
>  	return error;
>  }
...
> +/*
> + * Unmap a range of blocks from a file, then map other blocks into the hole.
> + * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> + * The extent irec is mapped into dest at irec->br_startoff.
> + */
> +STATIC int
> +xfs_reflink_remap_extent(
> +	struct xfs_inode	*ip,
> +	struct xfs_bmbt_irec	*irec,
> +	xfs_fileoff_t		destoff,
> +	xfs_off_t		new_isize)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*tp;
> +	xfs_fsblock_t		firstfsb;
> +	unsigned int		resblks;
> +	struct xfs_defer_ops	dfops;
> +	struct xfs_bmbt_irec	uirec;
> +	bool			real_extent;
> +	xfs_filblks_t		rlen;
> +	xfs_filblks_t		unmap_len;
> +	xfs_off_t		newlen;
> +	int			error;
> +
> +	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> +	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> +
> +	/* Only remap normal extents. */
> +	real_extent =  (irec->br_startblock != HOLESTARTBLOCK &&
> +			irec->br_startblock != DELAYSTARTBLOCK &&
> +			!ISUNWRITTEN(irec));
> +
> +	/* Start a rolling transaction to switch the mappings */
> +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> +	if (error)
> +		goto out;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	/* If we're not just clearing space, then do we have enough quota? */
> +	if (real_extent) {
> +		error = xfs_trans_reserve_quota_nblks(tp, ip,
> +				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> +		if (error)
> +			goto out_cancel;
> +	}
> +
> +	trace_xfs_reflink_remap(ip, irec->br_startoff,
> +				irec->br_blockcount, irec->br_startblock);
> +
> +	/* Unmap the old blocks in the data fork. */
> +	rlen = unmap_len;
> +	while (rlen) {
> +		xfs_defer_init(&dfops, &firstfsb);
> +		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
> +				&firstfsb, &dfops);
> +		if (error)
> +			goto out_defer;
> +
> +		/* Trim the extent to whatever got unmapped. */
> +		uirec = *irec;
> +		xfs_trim_extent(&uirec, destoff + rlen, unmap_len - rlen);
> +		unmap_len = rlen;
> +
> +		/* If this isn't a real mapping, we're done. */
> +		if (!real_extent || uirec.br_blockcount == 0)
> +			goto next_extent;
> +

Any reason we couldn't reuse existing mechanisms for this? E.g., hole
punch the dest file range before we remap the source file extents. That
might change behavior in the event of a partial/failed reflink, but it's
not clear to me that matters.

> +		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> +				uirec.br_blockcount, uirec.br_startblock);
> +
...
> +}
> +
> +/*
> + * Iteratively remap one file's extents (and holes) to another's.
> + */
> +STATIC int
> +xfs_reflink_remap_blocks(
> +	struct xfs_inode	*src,
> +	xfs_fileoff_t		srcoff,
> +	struct xfs_inode	*dest,
> +	xfs_fileoff_t		destoff,
> +	xfs_filblks_t		len,
> +	xfs_off_t		new_isize)
> +{
> +	struct xfs_bmbt_irec	imap;
> +	int			nimaps;
> +	int			error = 0;
> +	xfs_filblks_t		range_len;
> +
> +	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> +	while (len) {
> +		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> +				dest, destoff);
> +		/* Read extent from the source file */
> +		nimaps = 1;
> +		xfs_ilock(src, XFS_ILOCK_EXCL);
> +		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
> +		xfs_iunlock(src, XFS_ILOCK_EXCL);
> +		if (error)
> +			goto err;
> +		ASSERT(nimaps == 1);
> +
> +		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
> +				&imap);
> +
> +		/* Translate imap into the destination file. */
> +		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> +		imap.br_startoff += destoff - srcoff;
> +

Just FYI... these are all unsigned vars...

Brian

> +		/* Clear dest from destoff to the end of imap and map it in. */
> +		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> +				new_isize);
> +		if (error)
> +			goto err;
> +
> +		if (fatal_signal_pending(current)) {
> +			error = -EINTR;
> +			goto err;
> +		}
> +
> +		/* Advance drange/srange */
> +		srcoff += range_len;
> +		destoff += range_len;
> +		len -= range_len;
> +	}
> +
> +	return 0;
> +
> +err:
> +	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
> +	return error;
> +}
> +
> +/*
> + * Link a range of blocks from one file to another.
> + */
> +int
> +xfs_reflink_remap_range(
> +	struct xfs_inode	*src,
> +	xfs_off_t		srcoff,
> +	struct xfs_inode	*dest,
> +	xfs_off_t		destoff,
> +	xfs_off_t		len)
> +{
> +	struct xfs_mount	*mp = src->i_mount;
> +	xfs_fileoff_t		sfsbno, dfsbno;
> +	xfs_filblks_t		fsblen;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	/* Don't reflink realtime inodes */
> +	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> +		return -EINVAL;
> +
> +	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
> +
> +	/* Lock both files against IO */
> +	if (src->i_ino == dest->i_ino) {
> +		xfs_ilock(src, XFS_IOLOCK_EXCL);
> +		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
> +	} else {
> +		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
> +		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
> +	}
> +
> +	error = xfs_reflink_set_inode_flag(src, dest);
> +	if (error)
> +		goto out_error;
> +
> +	/*
> +	 * Invalidate the page cache so that we can clear any CoW mappings
> +	 * in the destination file.
> +	 */
> +	truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
> +				   PAGE_ALIGN(destoff + len) - 1);
> +
> +	dfsbno = XFS_B_TO_FSBT(mp, destoff);
> +	sfsbno = XFS_B_TO_FSBT(mp, srcoff);
> +	fsblen = XFS_B_TO_FSB(mp, len);
> +	error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> +			destoff + len);
> +	if (error)
> +		goto out_error;
> +
> +	error = xfs_reflink_update_dest(dest, destoff + len);
> +	if (error)
> +		goto out_error;
> +
> +out_error:
> +	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> +	xfs_iunlock(src, XFS_IOLOCK_EXCL);
> +	if (src->i_ino != dest->i_ino) {
> +		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> +		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
> +	}
> +	if (error)
> +		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 1d2f180..c35ce29 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -43,5 +43,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
>  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
>  extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> +extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> +		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
>  
>  #endif /* __XFS_REFLINK_H */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 42/63] xfs: add clone file and clone range vfs functions
  2016-09-30  3:10 ` [PATCH 42/63] xfs: add clone file and clone range vfs functions Darrick J. Wong
  2016-09-30  7:51   ` Christoph Hellwig
@ 2016-10-07 18:04   ` Brian Foster
  2016-10-07 20:31     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-07 18:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Sep 29, 2016 at 08:10:14PM -0700, Darrick J. Wong wrote:
> Define two VFS functions which allow userspace to reflink a range of
> blocks between two files or to reflink one file's contents to another.
> These functions fit the new VFS ioctls that standardize the checking
> for the btrfs CLONE and CLONE RANGE ioctls.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: Plug into the VFS function pointers instead of handling ioctls
> directly.
> ---
>  fs/xfs/xfs_file.c |  142 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 142 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 025d52f..3db3f34 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -974,6 +974,146 @@ xfs_file_fallocate(
...
> +/* Hook up to the VFS reflink function */
> +STATIC int
> +xfs_file_share_range(
> +	struct file	*file_in,
> +	loff_t		pos_in,
> +	struct file	*file_out,
> +	loff_t		pos_out,
> +	u64		len)
> +{
...
> +	/* Wait for the completion of any pending IOs on srcfile */
> +	ret = xfs_file_wait_for_io(inode_in, pos_in, len);
> +	if (ret)
> +		goto out_unlock;
> +	ret = xfs_file_wait_for_io(inode_out, pos_out, len);
> +	if (ret)
> +		goto out_unlock;
> +
> +	ret = xfs_reflink_remap_range(XFS_I(inode_in), pos_in, XFS_I(inode_out),
> +			pos_out, len);
> +	if (ret < 0)
> +		goto out_unlock;
> +
> +out_unlock:

Nit: 'out:'

Brian

> +	return ret;
> +}
> +
> +STATIC ssize_t
> +xfs_file_copy_range(
> +	struct file	*file_in,
> +	loff_t		pos_in,
> +	struct file	*file_out,
> +	loff_t		pos_out,
> +	size_t		len,
> +	unsigned int	flags)
> +{
> +	int		error;
> +
> +	error = xfs_file_share_range(file_in, pos_in, file_out, pos_out,
> +				     len);
> +	if (error)
> +		return error;
> +	return len;
> +}
> +
> +STATIC int
> +xfs_file_clone_range(
> +	struct file	*file_in,
> +	loff_t		pos_in,
> +	struct file	*file_out,
> +	loff_t		pos_out,
> +	u64		len)
> +{
> +	return xfs_file_share_range(file_in, pos_in, file_out, pos_out,
> +				     len);
> +}
>  
>  STATIC int
>  xfs_file_open(
> @@ -1634,6 +1774,8 @@ const struct file_operations xfs_file_operations = {
>  	.release	= xfs_file_release,
>  	.fsync		= xfs_file_fsync,
>  	.fallocate	= xfs_file_fallocate,
> +	.copy_file_range = xfs_file_copy_range,
> +	.clone_file_range = xfs_file_clone_range,
>  };
>  
>  const struct file_operations xfs_dir_file_operations = {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-09-30  3:10 ` [PATCH 46/63] xfs: unshare a range of blocks via fallocate Darrick J. Wong
  2016-09-30  7:54   ` Christoph Hellwig
@ 2016-10-07 18:05   ` Brian Foster
  2016-10-07 20:26     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-07 18:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Sep 29, 2016 at 08:10:39PM -0700, Darrick J. Wong wrote:
> Unshare all shared extents if the user calls fallocate with the new
> unshare mode flag set, so that we can guarantee that a subsequent
> write will not ENOSPC.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> [hch: pass inode instead of file to xfs_reflink_dirty_range,
>       use iomap infrastructure for copy up]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_file.c    |   10 ++
>  fs/xfs/xfs_reflink.c |  237 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_reflink.h |    2 
>  3 files changed, 247 insertions(+), 2 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 77ac810..065e836 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1472,3 +1472,240 @@ xfs_reflink_remap_range(
>  		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
>  	return error;
>  }
...
> +/* Iterate the extents; if there are no reflinked blocks, clear the flag. */
> +STATIC int
> +xfs_reflink_try_clear_inode_flag(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		old_isize)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*tp;
> +	xfs_fileoff_t		fbno;
> +	xfs_filblks_t		end;
> +	xfs_agnumber_t		agno;
> +	xfs_agblock_t		agbno;
> +	xfs_extlen_t		aglen;
> +	xfs_agblock_t		rbno;
> +	xfs_extlen_t		rlen;
> +	struct xfs_bmbt_irec	map[2];
> +	int			nmaps;
> +	int			error = 0;
> +
> +	/* Start a rolling transaction to remove the mappings */
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	if (old_isize != i_size_read(VFS_I(ip)))
> +		goto cancel;
> +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK))
> +		goto cancel;
> +

The code that has been merged is now different from this code :/, but
just a heads up that the code in the tree looks like it has another one
of those potentially blind transaction commit sequences between
xfs_reflink_try_clear_inode_flag() and xfs_reflink_clear_inode_flag().

It doesn't appear to be a problem in how it is actually used in this
patch, but for reference, I think it's better practice for lower level
functions like xfs_reflink_clear_inode_flag() to assert that the flag is
set and make it the responsibility of the caller to check for it and do
the right thing. Just my .02 though.

> +	fbno = 0;
> +	end = XFS_B_TO_FSB(mp, old_isize);
> +	while (end - fbno > 0) {
> +		nmaps = 1;
> +		/*
> +		 * Look for extents in the file.  Skip holes, delalloc, or
> +		 * unwritten extents; they can't be reflinked.
> +		 */
> +		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
> +		if (error)
> +			goto cancel;
> +		if (nmaps == 0)
> +			break;
> +		if (map[0].br_startblock == HOLESTARTBLOCK ||
> +		    map[0].br_startblock == DELAYSTARTBLOCK ||
> +		    ISUNWRITTEN(&map[0]))
> +			goto next;
> +
> +		map[1] = map[0];
> +		while (map[1].br_blockcount) {
> +			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
> +			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
> +			aglen = map[1].br_blockcount;
> +
> +			error = xfs_reflink_find_shared(mp, agno, agbno, aglen,
> +					&rbno, &rlen, false);
> +			if (error)
> +				goto cancel;
> +			/* Is there still a shared block here? */
> +			if (rlen > 0) {
> +				error = 0;
> +				goto cancel;
> +			}
> +
> +			map[1].br_blockcount -= aglen;
> +			map[1].br_startoff += aglen;
> +			map[1].br_startblock += aglen;

This is basically doing:

	map[1] = map[0];
	while (map[1].br_blockcount) {
		aglen = map[1].br_blockcount;
		...
		map[1].br_blockcount -= aglen;
	}

So the loop appears to be completely superfluous.

> +		}
> +
> +next:
> +		fbno = map[0].br_startoff + map[0].br_blockcount;
> +	}
> +
> +	/*
> +	 * We didn't find any shared blocks so turn off the reflink flag.
> +	 * First, get rid of any leftover CoW mappings.
> +	 */
> +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, 0, NULLFILEOFF);
> +	if (error)
> +		goto cancel;
> +
> +	/* Clear the inode flag. */
> +	trace_xfs_reflink_unset_inode_flag(ip);
> +	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> +	xfs_trans_ijoin(tp, ip, 0);
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +
> +	error = xfs_trans_commit(tp);
> +	if (error)
> +		goto out;
> +
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return 0;
> +cancel:
> +	xfs_trans_cancel(tp);
> +out:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +}
> +
> +/*
> + * Pre-COW all shared blocks within a given byte range of a file and turn off
> + * the reflink flag if we unshare all of the file's blocks.
> + */
> +int
> +xfs_reflink_unshare(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		len)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fileoff_t		fbno;
> +	xfs_filblks_t		end;
> +	xfs_off_t		old_isize, isize;
> +	int			error;
> +
> +	if (!xfs_is_reflink_inode(ip))
> +		return 0;
> +
> +	trace_xfs_reflink_unshare(ip, offset, len);
> +
> +	inode_dio_wait(VFS_I(ip));
> +
> +	/* Try to CoW the selected ranges */
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	fbno = XFS_B_TO_FSB(mp, offset);

XFS_B_TO_FSBT() ?

> +	old_isize = isize = i_size_read(VFS_I(ip));
> +	end = XFS_B_TO_FSB(mp, offset + len);
> +	error = xfs_reflink_dirty_extents(ip, fbno, end, isize);
> +	if (error)
> +		goto out_unlock;
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	/* Wait for the IO to finish */
> +	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
> +	if (error)
> +		goto out;
> +
> +	/* Turn off the reflink flag if we unshared the whole file */
> +	if (offset == 0 && len == isize) {

Isn't this valid if len is larger than isize (similar check in
xfs_reflink_try_clear_inode_flag() might defeat this as well)?

FWIW, this has a similar issue as the earlier truncate code in that we
might just unshare the shared regions and thus retain the flag.

Brian

> +		error = xfs_reflink_try_clear_inode_flag(ip, old_isize);
> +		if (error)
> +			goto out;
> +	}
> +
> +	return 0;
> +
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +out:
> +	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index df82b20..ad4fc61 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -48,5 +48,7 @@ extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
>  extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
>  		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
>  		unsigned int flags);
> +extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
> +		xfs_off_t len);
>  
>  #endif /* __XFS_REFLINK_H */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree
  2016-10-07 18:04   ` Brian Foster
@ 2016-10-07 19:18     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07 19:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Fri, Oct 07, 2016 at 02:04:05PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:09:59PM -0700, Darrick J. Wong wrote:
> > Due to the way the CoW algorithm in XFS works, there's an interval
> > during which blocks allocated to handle a CoW can be lost -- if the FS
> > goes down after the blocks are allocated but before the block
> > remapping takes place.  This is exacerbated by the cowextsz hint --
> > allocated reservations can sit around for a while, waiting to get
> > used.
> > 
> > Since the refcount btree doesn't normally store records with refcount
> > of 1, we can use it to record these in-progress extents.  In-progress
> > blocks cannot be shared because they're not user-visible, so there
> > shouldn't be any conflicts with other programs.  This is a better
> > solution than holding EFIs during writeback because (a) EFIs can't be
> > relogged currently, (b) even if they could, EFIs are bound by
> > available log space, which puts an unnecessary upper bound on how much
> > CoW we can have in flight, and (c) we already have a mechanism to
> > track blocks.
> > 
> > At mount time, read the refcount records and free anything we find
> > with a refcount of 1 because those were in-progress when the FS went
> > down.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: Use the deferred operations system to avoid deadlocks and blowing
> > out the transaction reservation.  This allows us to unmap a CoW
> > extent from the refcountbt and into a file atomically.
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c     |   11 +
> >  fs/xfs/libxfs/xfs_format.h   |    3 
> >  fs/xfs/libxfs/xfs_refcount.c |  336 +++++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/libxfs/xfs_refcount.h |   10 +
> >  fs/xfs/xfs_mount.c           |   12 ++
> >  fs/xfs/xfs_refcount_item.c   |   12 ++
> >  fs/xfs/xfs_reflink.c         |  150 +++++++++++++++++++
> >  fs/xfs/xfs_reflink.h         |    1 
> >  fs/xfs/xfs_super.c           |    9 +
> >  fs/xfs/xfs_trace.h           |    4 +
> >  10 files changed, 537 insertions(+), 11 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index c95cdc3..673ecc1 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> ...
> > @@ -772,3 +787,138 @@ xfs_reflink_end_cow(
> >  	trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> >  	return error;
> >  }
> ...
> > +STATIC int
> > +xfs_reflink_recover_cow_ag(
> > +	struct xfs_mount		*mp,
> > +	xfs_agnumber_t			agno)
> > +{
> > +	struct xfs_trans		*tp;
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_buf			*agbp;
> > +	struct xfs_reflink_recovery	*rr, *n;
> > +	struct list_head		debris;
> > +	union xfs_btree_irec		low;
> > +	union xfs_btree_irec		high;
> > +	struct xfs_defer_ops		dfops;
> > +	xfs_fsblock_t			fsb;
> > +	int				error;
> > +
> > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > +	if (error)
> > +		return error;
> > +	cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
> > +
> > +	/* Find all the leftover CoW staging extents. */
> > +	INIT_LIST_HEAD(&debris);
> > +	memset(&low, 0, sizeof(low));
> > +	memset(&high, 0, sizeof(high));
> > +	low.rc.rc_startblock = 0;
> > +	high.rc.rc_startblock = -1U;
> > +	error = xfs_btree_query_range(cur, &low, &high,
> > +			xfs_reflink_recover_extent, &debris);
> > +	if (error)
> > +		goto out_error;
> 
> Potential memory leak of debris list entries on error.

Ugh, the error handling in this whole function needs reworking.

> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +	xfs_buf_relse(agbp);
> > +
> > +	/* Now iterate the list to free the leftovers */
> > +	list_for_each_entry(rr, &debris, rr_list) {
> > +		/* Set up transaction. */
> > +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> > +		if (error)
> > +			goto out_free;
> > +
> > +		trace_xfs_reflink_recover_extent(mp, agno, &rr->rr_rrec);
> > +
> > +		/* Free the orphan record */
> > +		xfs_defer_init(&dfops, &fsb);
> > +		fsb = XFS_AGB_TO_FSB(mp, agno, rr->rr_rrec.rc_startblock);
> > +		error = xfs_refcount_free_cow_extent(mp, &dfops, fsb,
> > +				rr->rr_rrec.rc_blockcount);
> > +		if (error)
> > +			goto out_defer;
> > +
> > +		/* Free the block. */
> > +		xfs_bmap_add_free(mp, &dfops, fsb,
> > +				rr->rr_rrec.rc_blockcount, NULL);
> > +
> > +		error = xfs_defer_finish(&tp, &dfops, NULL);
> > +		if (error)
> > +			goto out_defer;
> > +
> > +		error = xfs_trans_commit(tp);
> > +		if (error)
> > +			goto out_cancel;
> > +	}
> > +	goto out_free;
> 
> return 0 ?

Yep.  Good catch.

--D

> 
> Brian
> 
> > +
> > +out_defer:
> > +	xfs_defer_cancel(&dfops);
> > +out_cancel:
> > +	xfs_trans_cancel(tp);
> > +
> > +out_free:
> > +	/* Free the leftover list */
> > +	list_for_each_entry_safe(rr, n, &debris, rr_list) {
> > +		list_del(&rr->rr_list);
> > +		kmem_free(rr);
> > +	}
> > +
> > +	return error;
> > +
> > +out_error:
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > +	xfs_buf_relse(agbp);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Free leftover CoW reservations that didn't get cleaned out.
> > + */
> > +int
> > +xfs_reflink_recover_cow(
> > +	struct xfs_mount	*mp)
> > +{
> > +	xfs_agnumber_t		agno;
> > +	int			error = 0;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return 0;
> > +
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_reflink_recover_cow_ag(mp, agno);
> > +		if (error)
> > +			break;
> > +	}
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index c0c989a..1d2f180 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -42,5 +42,6 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> >  		xfs_off_t count);
> >  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> >  		xfs_off_t count);
> > +extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> >  
> >  #endif /* __XFS_REFLINK_H */
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 26b45b3..e6aaa91 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1306,6 +1306,15 @@ xfs_fs_remount(
> >  		xfs_restore_resvblks(mp);
> >  		xfs_log_work_queue(mp);
> >  		xfs_queue_eofblocks(mp);
> > +
> > +		/* Recover any CoW blocks that never got remapped. */
> > +		error = xfs_reflink_recover_cow(mp);
> > +		if (error) {
> > +			xfs_err(mp,
> > +	"Error %d recovering leftover CoW allocations.", error);
> > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +			return error;
> > +		}
> >  	}
> >  
> >  	/* rw -> ro */
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 8e89223..ca0930b 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -2916,14 +2916,18 @@ DEFINE_AG_ERROR_EVENT(xfs_refcount_update_error);
> >  /* refcount adjustment tracepoints */
> >  DEFINE_AG_EXTENT_EVENT(xfs_refcount_increase);
> >  DEFINE_AG_EXTENT_EVENT(xfs_refcount_decrease);
> > +DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_increase);
> > +DEFINE_AG_EXTENT_EVENT(xfs_refcount_cow_decrease);
> >  DEFINE_REFCOUNT_TRIPLE_EXTENT_EVENT(xfs_refcount_merge_center_extents);
> >  DEFINE_REFCOUNT_EXTENT_EVENT(xfs_refcount_modify_extent);
> > +DEFINE_REFCOUNT_EXTENT_EVENT(xfs_reflink_recover_extent);
> >  DEFINE_REFCOUNT_EXTENT_AT_EVENT(xfs_refcount_split_extent);
> >  DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_left_extent);
> >  DEFINE_REFCOUNT_DOUBLE_EXTENT_EVENT(xfs_refcount_merge_right_extent);
> >  DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_left_extent);
> >  DEFINE_REFCOUNT_DOUBLE_EXTENT_AT_EVENT(xfs_refcount_find_right_extent);
> >  DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_error);
> > +DEFINE_AG_ERROR_EVENT(xfs_refcount_adjust_cow_error);
> >  DEFINE_AG_ERROR_EVENT(xfs_refcount_merge_center_extents_error);
> >  DEFINE_AG_ERROR_EVENT(xfs_refcount_modify_extent_error);
> >  DEFINE_AG_ERROR_EVENT(xfs_refcount_split_extent_error);
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 41/63] xfs: reflink extents from one file to another
  2016-10-07 18:04   ` Brian Foster
@ 2016-10-07 19:44     ` Darrick J. Wong
  2016-10-07 20:48       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07 19:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Fri, Oct 07, 2016 at 02:04:15PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:10:05PM -0700, Darrick J. Wong wrote:
> > Reflink extents from one file to another; that is to say, iteratively
> > remove the mappings from the destination file, copy the mappings from
> > the source file to the destination file, and increment the reference
> > count of all the blocks that got remapped.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: Call xfs_defer_cancel before cancelling the transaction if the
> > remap operation fails.  Use the deferred operations system to avoid
> > deadlocks or blowing out the transaction reservation, and make the
> > entire reflink operation atomic for each extent being remapped.  The
> > destination file's i_size will be updated if necessary to avoid
> > violating the assumption that there are no shared blocks past the EOF
> > block.
> > ---
> >  fs/xfs/xfs_reflink.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_reflink.h |    2 
> >  2 files changed, 427 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 673ecc1..94c19fff 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -922,3 +922,428 @@ xfs_reflink_recover_cow(
> >  
> >  	return error;
> >  }
> ...
> > +/*
> > + * Unmap a range of blocks from a file, then map other blocks into the hole.
> > + * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > + * The extent irec is mapped into dest at irec->br_startoff.
> > + */
> > +STATIC int
> > +xfs_reflink_remap_extent(
> > +	struct xfs_inode	*ip,
> > +	struct xfs_bmbt_irec	*irec,
> > +	xfs_fileoff_t		destoff,
> > +	xfs_off_t		new_isize)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_trans	*tp;
> > +	xfs_fsblock_t		firstfsb;
> > +	unsigned int		resblks;
> > +	struct xfs_defer_ops	dfops;
> > +	struct xfs_bmbt_irec	uirec;
> > +	bool			real_extent;
> > +	xfs_filblks_t		rlen;
> > +	xfs_filblks_t		unmap_len;
> > +	xfs_off_t		newlen;
> > +	int			error;
> > +
> > +	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > +	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > +
> > +	/* Only remap normal extents. */
> > +	real_extent =  (irec->br_startblock != HOLESTARTBLOCK &&
> > +			irec->br_startblock != DELAYSTARTBLOCK &&
> > +			!ISUNWRITTEN(irec));
> > +
> > +	/* Start a rolling transaction to switch the mappings */
> > +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > +	if (error)
> > +		goto out;
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, 0);
> > +
> > +	/* If we're not just clearing space, then do we have enough quota? */
> > +	if (real_extent) {
> > +		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > +				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > +		if (error)
> > +			goto out_cancel;
> > +	}
> > +
> > +	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > +				irec->br_blockcount, irec->br_startblock);
> > +
> > +	/* Unmap the old blocks in the data fork. */
> > +	rlen = unmap_len;
> > +	while (rlen) {
> > +		xfs_defer_init(&dfops, &firstfsb);
> > +		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
> > +				&firstfsb, &dfops);
> > +		if (error)
> > +			goto out_defer;
> > +
> > +		/* Trim the extent to whatever got unmapped. */
> > +		uirec = *irec;
> > +		xfs_trim_extent(&uirec, destoff + rlen, unmap_len - rlen);
> > +		unmap_len = rlen;
> > +
> > +		/* If this isn't a real mapping, we're done. */
> > +		if (!real_extent || uirec.br_blockcount == 0)
> > +			goto next_extent;
> > +
> 
> Any reason we couldn't reuse existing mechanisms for this? E.g., hole
> punch the dest file range before we remap the source file extents. That
> might change behavior in the event of a partial/failed reflink, but it's
> not clear to me that matters.

It matters a lot for the dedupe operation -- the unmap and remap
operations must be atomic with each other so that if the dedupe
operation fails, the user will still see the same file contents after
reboot/recovery.  We don't want users to find their files suddenly full
of zeroes.

For reflink I suspect that you're right, but we already guarantee that
the user sees either the old contents or the new contents, so yay. :)

> 
> > +		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > +				uirec.br_blockcount, uirec.br_startblock);
> > +
> ...
> > +}
> > +
> > +/*
> > + * Iteratively remap one file's extents (and holes) to another's.
> > + */
> > +STATIC int
> > +xfs_reflink_remap_blocks(
> > +	struct xfs_inode	*src,
> > +	xfs_fileoff_t		srcoff,
> > +	struct xfs_inode	*dest,
> > +	xfs_fileoff_t		destoff,
> > +	xfs_filblks_t		len,
> > +	xfs_off_t		new_isize)
> > +{
> > +	struct xfs_bmbt_irec	imap;
> > +	int			nimaps;
> > +	int			error = 0;
> > +	xfs_filblks_t		range_len;
> > +
> > +	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > +	while (len) {
> > +		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > +				dest, destoff);
> > +		/* Read extent from the source file */
> > +		nimaps = 1;
> > +		xfs_ilock(src, XFS_ILOCK_EXCL);
> > +		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
> > +		xfs_iunlock(src, XFS_ILOCK_EXCL);
> > +		if (error)
> > +			goto err;
> > +		ASSERT(nimaps == 1);
> > +
> > +		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
> > +				&imap);
> > +
> > +		/* Translate imap into the destination file. */
> > +		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > +		imap.br_startoff += destoff - srcoff;
> > +
> 
> Just FYI... these are all unsigned vars...

Yeah.  It should handle that correctly.  See generic/30[34].

--D

> 
> Brian
> 
> > +		/* Clear dest from destoff to the end of imap and map it in. */
> > +		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > +				new_isize);
> > +		if (error)
> > +			goto err;
> > +
> > +		if (fatal_signal_pending(current)) {
> > +			error = -EINTR;
> > +			goto err;
> > +		}
> > +
> > +		/* Advance drange/srange */
> > +		srcoff += range_len;
> > +		destoff += range_len;
> > +		len -= range_len;
> > +	}
> > +
> > +	return 0;
> > +
> > +err:
> > +	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Link a range of blocks from one file to another.
> > + */
> > +int
> > +xfs_reflink_remap_range(
> > +	struct xfs_inode	*src,
> > +	xfs_off_t		srcoff,
> > +	struct xfs_inode	*dest,
> > +	xfs_off_t		destoff,
> > +	xfs_off_t		len)
> > +{
> > +	struct xfs_mount	*mp = src->i_mount;
> > +	xfs_fileoff_t		sfsbno, dfsbno;
> > +	xfs_filblks_t		fsblen;
> > +	int			error;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	if (XFS_FORCED_SHUTDOWN(mp))
> > +		return -EIO;
> > +
> > +	/* Don't reflink realtime inodes */
> > +	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> > +		return -EINVAL;
> > +
> > +	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
> > +
> > +	/* Lock both files against IO */
> > +	if (src->i_ino == dest->i_ino) {
> > +		xfs_ilock(src, XFS_IOLOCK_EXCL);
> > +		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
> > +	} else {
> > +		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
> > +		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
> > +	}
> > +
> > +	error = xfs_reflink_set_inode_flag(src, dest);
> > +	if (error)
> > +		goto out_error;
> > +
> > +	/*
> > +	 * Invalidate the page cache so that we can clear any CoW mappings
> > +	 * in the destination file.
> > +	 */
> > +	truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
> > +				   PAGE_ALIGN(destoff + len) - 1);
> > +
> > +	dfsbno = XFS_B_TO_FSBT(mp, destoff);
> > +	sfsbno = XFS_B_TO_FSBT(mp, srcoff);
> > +	fsblen = XFS_B_TO_FSB(mp, len);
> > +	error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> > +			destoff + len);
> > +	if (error)
> > +		goto out_error;
> > +
> > +	error = xfs_reflink_update_dest(dest, destoff + len);
> > +	if (error)
> > +		goto out_error;
> > +
> > +out_error:
> > +	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> > +	xfs_iunlock(src, XFS_IOLOCK_EXCL);
> > +	if (src->i_ino != dest->i_ino) {
> > +		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > +		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
> > +	}
> > +	if (error)
> > +		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index 1d2f180..c35ce29 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -43,5 +43,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> >  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> >  		xfs_off_t count);
> >  extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> > +extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> > +		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
> >  
> >  #endif /* __XFS_REFLINK_H */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-10-07 18:05   ` Brian Foster
@ 2016-10-07 20:26     ` Darrick J. Wong
  2016-10-07 20:58       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07 20:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Fri, Oct 07, 2016 at 02:05:07PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:10:39PM -0700, Darrick J. Wong wrote:
> > Unshare all shared extents if the user calls fallocate with the new
> > unshare mode flag set, so that we can guarantee that a subsequent
> > write will not ENOSPC.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > [hch: pass inode instead of file to xfs_reflink_dirty_range,
> >       use iomap infrastructure for copy up]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/xfs_file.c    |   10 ++
> >  fs/xfs/xfs_reflink.c |  237 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_reflink.h |    2 
> >  3 files changed, 247 insertions(+), 2 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 77ac810..065e836 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1472,3 +1472,240 @@ xfs_reflink_remap_range(
> >  		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> >  	return error;
> >  }
> ...
> > +/* Iterate the extents; if there are no reflinked blocks, clear the flag. */
> > +STATIC int
> > +xfs_reflink_try_clear_inode_flag(
> > +	struct xfs_inode	*ip,
> > +	xfs_off_t		old_isize)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_trans	*tp;
> > +	xfs_fileoff_t		fbno;
> > +	xfs_filblks_t		end;
> > +	xfs_agnumber_t		agno;
> > +	xfs_agblock_t		agbno;
> > +	xfs_extlen_t		aglen;
> > +	xfs_agblock_t		rbno;
> > +	xfs_extlen_t		rlen;
> > +	struct xfs_bmbt_irec	map[2];
> > +	int			nmaps;
> > +	int			error = 0;
> > +
> > +	/* Start a rolling transaction to remove the mappings */
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> > +	if (error)
> > +		return error;
> > +
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, 0);
> > +
> > +	if (old_isize != i_size_read(VFS_I(ip)))
> > +		goto cancel;
> > +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK))
> > +		goto cancel;
> > +
> 
> The code that has been merged is now different from this code :/, but
> just a heads up that the code in the tree looks like it has another one
> of those potentially blind transaction commit sequences between
> xfs_reflink_try_clear_inode_flag() and xfs_reflink_clear_inode_flag().

_reflink_unshare jumps out if it's not a reflink inode before
calling _reflink_try_clear_inode_flag -> _reflink_clear_inode_flag.
We do not call _reflink_clear_inode_flag with a non-reflink inode.
As for blindly committing a transaction with no dirty data, that's
fine, _trans_commit checks for that case and simply frees everything
attached to the transaction.

> It doesn't appear to be a problem in how it is actually used in this
> patch, but for reference, I think it's better practice for lower level
> functions like xfs_reflink_clear_inode_flag() to assert that the flag is
> set and make it the responsibility of the caller to check for it and do
> the right thing. Just my .02 though.

Ok, I'll add an assert.

> > +	fbno = 0;
> > +	end = XFS_B_TO_FSB(mp, old_isize);
> > +	while (end - fbno > 0) {
> > +		nmaps = 1;
> > +		/*
> > +		 * Look for extents in the file.  Skip holes, delalloc, or
> > +		 * unwritten extents; they can't be reflinked.
> > +		 */
> > +		error = xfs_bmapi_read(ip, fbno, end - fbno, map, &nmaps, 0);
> > +		if (error)
> > +			goto cancel;
> > +		if (nmaps == 0)
> > +			break;
> > +		if (map[0].br_startblock == HOLESTARTBLOCK ||
> > +		    map[0].br_startblock == DELAYSTARTBLOCK ||
> > +		    ISUNWRITTEN(&map[0]))
> > +			goto next;
> > +
> > +		map[1] = map[0];
> > +		while (map[1].br_blockcount) {
> > +			agno = XFS_FSB_TO_AGNO(mp, map[1].br_startblock);
> > +			agbno = XFS_FSB_TO_AGBNO(mp, map[1].br_startblock);
> > +			aglen = map[1].br_blockcount;
> > +
> > +			error = xfs_reflink_find_shared(mp, agno, agbno, aglen,
> > +					&rbno, &rlen, false);
> > +			if (error)
> > +				goto cancel;
> > +			/* Is there still a shared block here? */
> > +			if (rlen > 0) {
> > +				error = 0;
> > +				goto cancel;
> > +			}
> > +
> > +			map[1].br_blockcount -= aglen;
> > +			map[1].br_startoff += aglen;
> > +			map[1].br_startblock += aglen;
> 
> This is basically doing:
> 
> 	map[1] = map[0];
> 	while (map[1].br_blockcount) {
> 		aglen = map[1].br_blockcount;
> 		...
> 		map[1].br_blockcount -= aglen;
> 	}
> 
> So the loop appears to be completely superfluous.

<nod>

> 
> > +		}
> > +
> > +next:
> > +		fbno = map[0].br_startoff + map[0].br_blockcount;
> > +	}
> > +
> > +	/*
> > +	 * We didn't find any shared blocks so turn off the reflink flag.
> > +	 * First, get rid of any leftover CoW mappings.
> > +	 */
> > +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, 0, NULLFILEOFF);
> > +	if (error)
> > +		goto cancel;
> > +
> > +	/* Clear the inode flag. */
> > +	trace_xfs_reflink_unset_inode_flag(ip);
> > +	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> > +	xfs_trans_ijoin(tp, ip, 0);
> > +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > +
> > +	error = xfs_trans_commit(tp);
> > +	if (error)
> > +		goto out;
> > +
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	return 0;
> > +cancel:
> > +	xfs_trans_cancel(tp);
> > +out:
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Pre-COW all shared blocks within a given byte range of a file and turn off
> > + * the reflink flag if we unshare all of the file's blocks.
> > + */
> > +int
> > +xfs_reflink_unshare(
> > +	struct xfs_inode	*ip,
> > +	xfs_off_t		offset,
> > +	xfs_off_t		len)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_fileoff_t		fbno;
> > +	xfs_filblks_t		end;
> > +	xfs_off_t		old_isize, isize;
> > +	int			error;
> > +
> > +	if (!xfs_is_reflink_inode(ip))
> > +		return 0;
> > +
> > +	trace_xfs_reflink_unshare(ip, offset, len);
> > +
> > +	inode_dio_wait(VFS_I(ip));
> > +
> > +	/* Try to CoW the selected ranges */
> > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > +	fbno = XFS_B_TO_FSB(mp, offset);
> 
> XFS_B_TO_FSBT() ?
> 
> > +	old_isize = isize = i_size_read(VFS_I(ip));
> > +	end = XFS_B_TO_FSB(mp, offset + len);
> > +	error = xfs_reflink_dirty_extents(ip, fbno, end, isize);
> > +	if (error)
> > +		goto out_unlock;
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +
> > +	/* Wait for the IO to finish */
> > +	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
> > +	if (error)
> > +		goto out;
> > +
> > +	/* Turn off the reflink flag if we unshared the whole file */
> > +	if (offset == 0 && len == isize) {
> 
> Isn't this valid if len is larger than isize (similar check in
> xfs_reflink_try_clear_inode_flag() might defeat this as well)?
> 
> FWIW, this has a similar issue as the earlier truncate code in that we
> might just unshare the shared regions and thus retain the flag.

Yes, it is suboptimal for the flag to be set when there are no shared
extents.  I'm not sure when is a good opportunity to try to turn off the
flag -- certainly we don't want to do that after every CoW operation.
Doing it as part of a fallocate operation seems reasonable enough.
Prior to the removal of the UNSHARE flag it would try to clear the flag
any time the user asked for an unshare, but when I removed the flag from
the interface I decided we should only do that if the user fallocated
the entire file.

Now that UNSHARE has been re-added to the interface, I'll just take out
these weird checks.  Note that the upcoming online repair patchset will
try to unset the flag.

--D

> 
> Brian
> 
> > +		error = xfs_reflink_try_clear_inode_flag(ip, old_isize);
> > +		if (error)
> > +			goto out;
> > +	}
> > +
> > +	return 0;
> > +
> > +out_unlock:
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +out:
> > +	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index df82b20..ad4fc61 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -48,5 +48,7 @@ extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> >  extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> >  		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
> >  		unsigned int flags);
> > +extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
> > +		xfs_off_t len);
> >  
> >  #endif /* __XFS_REFLINK_H */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 42/63] xfs: add clone file and clone range vfs functions
  2016-10-07 18:04   ` Brian Foster
@ 2016-10-07 20:31     ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07 20:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Fri, Oct 07, 2016 at 02:04:30PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:10:14PM -0700, Darrick J. Wong wrote:
> > Define two VFS functions which allow userspace to reflink a range of
> > blocks between two files or to reflink one file's contents to another.
> > These functions fit the new VFS ioctls that standardize the checking
> > for the btrfs CLONE and CLONE RANGE ioctls.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: Plug into the VFS function pointers instead of handling ioctls
> > directly.
> > ---
> >  fs/xfs/xfs_file.c |  142 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 142 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 025d52f..3db3f34 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -974,6 +974,146 @@ xfs_file_fallocate(
> ...
> > +/* Hook up to the VFS reflink function */
> > +STATIC int
> > +xfs_file_share_range(
> > +	struct file	*file_in,
> > +	loff_t		pos_in,
> > +	struct file	*file_out,
> > +	loff_t		pos_out,
> > +	u64		len)
> > +{
> ...
> > +	/* Wait for the completion of any pending IOs on srcfile */
> > +	ret = xfs_file_wait_for_io(inode_in, pos_in, len);
> > +	if (ret)
> > +		goto out_unlock;
> > +	ret = xfs_file_wait_for_io(inode_out, pos_out, len);
> > +	if (ret)
> > +		goto out_unlock;
> > +
> > +	ret = xfs_reflink_remap_range(XFS_I(inode_in), pos_in, XFS_I(inode_out),
> > +			pos_out, len);
> > +	if (ret < 0)
> > +		goto out_unlock;
> > +
> > +out_unlock:
> 
> Nit: 'out:'

Fixed.

--D

> 
> Brian
> 
> > +	return ret;
> > +}
> > +
> > +STATIC ssize_t
> > +xfs_file_copy_range(
> > +	struct file	*file_in,
> > +	loff_t		pos_in,
> > +	struct file	*file_out,
> > +	loff_t		pos_out,
> > +	size_t		len,
> > +	unsigned int	flags)
> > +{
> > +	int		error;
> > +
> > +	error = xfs_file_share_range(file_in, pos_in, file_out, pos_out,
> > +				     len);
> > +	if (error)
> > +		return error;
> > +	return len;
> > +}
> > +
> > +STATIC int
> > +xfs_file_clone_range(
> > +	struct file	*file_in,
> > +	loff_t		pos_in,
> > +	struct file	*file_out,
> > +	loff_t		pos_out,
> > +	u64		len)
> > +{
> > +	return xfs_file_share_range(file_in, pos_in, file_out, pos_out,
> > +				     len);
> > +}
> >  
> >  STATIC int
> >  xfs_file_open(
> > @@ -1634,6 +1774,8 @@ const struct file_operations xfs_file_operations = {
> >  	.release	= xfs_file_release,
> >  	.fsync		= xfs_file_fsync,
> >  	.fallocate	= xfs_file_fallocate,
> > +	.copy_file_range = xfs_file_copy_range,
> > +	.clone_file_range = xfs_file_clone_range,
> >  };
> >  
> >  const struct file_operations xfs_dir_file_operations = {
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 41/63] xfs: reflink extents from one file to another
  2016-10-07 19:44     ` Darrick J. Wong
@ 2016-10-07 20:48       ` Brian Foster
  2016-10-07 21:41         ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-07 20:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Fri, Oct 07, 2016 at 12:44:30PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 07, 2016 at 02:04:15PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:10:05PM -0700, Darrick J. Wong wrote:
> > > Reflink extents from one file to another; that is to say, iteratively
> > > remove the mappings from the destination file, copy the mappings from
> > > the source file to the destination file, and increment the reference
> > > count of all the blocks that got remapped.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > v2: Call xfs_defer_cancel before cancelling the transaction if the
> > > remap operation fails.  Use the deferred operations system to avoid
> > > deadlocks or blowing out the transaction reservation, and make the
> > > entire reflink operation atomic for each extent being remapped.  The
> > > destination file's i_size will be updated if necessary to avoid
> > > violating the assumption that there are no shared blocks past the EOF
> > > block.
> > > ---
> > >  fs/xfs/xfs_reflink.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_reflink.h |    2 
> > >  2 files changed, 427 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 673ecc1..94c19fff 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -922,3 +922,428 @@ xfs_reflink_recover_cow(
> > >  
> > >  	return error;
> > >  }
> > ...
> > > +/*
> > > + * Unmap a range of blocks from a file, then map other blocks into the hole.
> > > + * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > > + * The extent irec is mapped into dest at irec->br_startoff.
> > > + */
> > > +STATIC int
> > > +xfs_reflink_remap_extent(
> > > +	struct xfs_inode	*ip,
> > > +	struct xfs_bmbt_irec	*irec,
> > > +	xfs_fileoff_t		destoff,
> > > +	xfs_off_t		new_isize)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_trans	*tp;
> > > +	xfs_fsblock_t		firstfsb;
> > > +	unsigned int		resblks;
> > > +	struct xfs_defer_ops	dfops;
> > > +	struct xfs_bmbt_irec	uirec;
> > > +	bool			real_extent;
> > > +	xfs_filblks_t		rlen;
> > > +	xfs_filblks_t		unmap_len;
> > > +	xfs_off_t		newlen;
> > > +	int			error;
> > > +
> > > +	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > > +	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > > +
> > > +	/* Only remap normal extents. */
> > > +	real_extent =  (irec->br_startblock != HOLESTARTBLOCK &&
> > > +			irec->br_startblock != DELAYSTARTBLOCK &&
> > > +			!ISUNWRITTEN(irec));
> > > +
> > > +	/* Start a rolling transaction to switch the mappings */
> > > +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > > +	if (error)
> > > +		goto out;
> > > +
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > > +
> > > +	/* If we're not just clearing space, then do we have enough quota? */
> > > +	if (real_extent) {
> > > +		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > > +				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > +		if (error)
> > > +			goto out_cancel;
> > > +	}
> > > +
> > > +	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > > +				irec->br_blockcount, irec->br_startblock);
> > > +
> > > +	/* Unmap the old blocks in the data fork. */
> > > +	rlen = unmap_len;
> > > +	while (rlen) {
> > > +		xfs_defer_init(&dfops, &firstfsb);
> > > +		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
> > > +				&firstfsb, &dfops);
> > > +		if (error)
> > > +			goto out_defer;
> > > +
> > > +		/* Trim the extent to whatever got unmapped. */
> > > +		uirec = *irec;
> > > +		xfs_trim_extent(&uirec, destoff + rlen, unmap_len - rlen);
> > > +		unmap_len = rlen;
> > > +
> > > +		/* If this isn't a real mapping, we're done. */
> > > +		if (!real_extent || uirec.br_blockcount == 0)
> > > +			goto next_extent;
> > > +
> > 
> > Any reason we couldn't reuse existing mechanisms for this? E.g., hole
> > punch the dest file range before we remap the source file extents. That
> > might change behavior in the event of a partial/failed reflink, but it's
> > not clear to me that matters.
> 
> It matters a lot for the dedupe operation -- the unmap and remap
> operations must be atomic with each other so that if the dedupe
> operation fails, the user will still see the same file contents after
> reboot/recovery.  We don't want users to find their files suddenly full
> of zeroes.
> 

Ok, that makes sense. Though the dedup atomicity is provided simply by
doing each unmap/remap within the same transaction, right? I'm kind of
wondering if we could do something like refactor/reuse
xfs_unmap_extent(), pull the trans alloc/commit and the unmap call up
into xfs_reflink_remap_blocks(), then clean out
xfs_reflink_remap_extent() a bit as a result.

But meh, this stuff is already merged so maybe I should just send a
patch. :P

Brian

> For reflink I suspect that you're right, but we already guarantee that
> the user sees either the old contents or the new contents, so yay. :)
> 
> > 
> > > +		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > > +				uirec.br_blockcount, uirec.br_startblock);
> > > +
> > ...
> > > +}
> > > +
> > > +/*
> > > + * Iteratively remap one file's extents (and holes) to another's.
> > > + */
> > > +STATIC int
> > > +xfs_reflink_remap_blocks(
> > > +	struct xfs_inode	*src,
> > > +	xfs_fileoff_t		srcoff,
> > > +	struct xfs_inode	*dest,
> > > +	xfs_fileoff_t		destoff,
> > > +	xfs_filblks_t		len,
> > > +	xfs_off_t		new_isize)
> > > +{
> > > +	struct xfs_bmbt_irec	imap;
> > > +	int			nimaps;
> > > +	int			error = 0;
> > > +	xfs_filblks_t		range_len;
> > > +
> > > +	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > > +	while (len) {
> > > +		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > > +				dest, destoff);
> > > +		/* Read extent from the source file */
> > > +		nimaps = 1;
> > > +		xfs_ilock(src, XFS_ILOCK_EXCL);
> > > +		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
> > > +		xfs_iunlock(src, XFS_ILOCK_EXCL);
> > > +		if (error)
> > > +			goto err;
> > > +		ASSERT(nimaps == 1);
> > > +
> > > +		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
> > > +				&imap);
> > > +
> > > +		/* Translate imap into the destination file. */
> > > +		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > > +		imap.br_startoff += destoff - srcoff;
> > > +
> > 
> > Just FYI... these are all unsigned vars...
> 
> Yeah.  It should handle that correctly.  See generic/30[34].
> 
> --D
> 
> > 
> > Brian
> > 
> > > +		/* Clear dest from destoff to the end of imap and map it in. */
> > > +		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > > +				new_isize);
> > > +		if (error)
> > > +			goto err;
> > > +
> > > +		if (fatal_signal_pending(current)) {
> > > +			error = -EINTR;
> > > +			goto err;
> > > +		}
> > > +
> > > +		/* Advance drange/srange */
> > > +		srcoff += range_len;
> > > +		destoff += range_len;
> > > +		len -= range_len;
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err:
> > > +	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Link a range of blocks from one file to another.
> > > + */
> > > +int
> > > +xfs_reflink_remap_range(
> > > +	struct xfs_inode	*src,
> > > +	xfs_off_t		srcoff,
> > > +	struct xfs_inode	*dest,
> > > +	xfs_off_t		destoff,
> > > +	xfs_off_t		len)
> > > +{
> > > +	struct xfs_mount	*mp = src->i_mount;
> > > +	xfs_fileoff_t		sfsbno, dfsbno;
> > > +	xfs_filblks_t		fsblen;
> > > +	int			error;
> > > +
> > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (XFS_FORCED_SHUTDOWN(mp))
> > > +		return -EIO;
> > > +
> > > +	/* Don't reflink realtime inodes */
> > > +	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> > > +		return -EINVAL;
> > > +
> > > +	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
> > > +
> > > +	/* Lock both files against IO */
> > > +	if (src->i_ino == dest->i_ino) {
> > > +		xfs_ilock(src, XFS_IOLOCK_EXCL);
> > > +		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
> > > +	} else {
> > > +		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
> > > +		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
> > > +	}
> > > +
> > > +	error = xfs_reflink_set_inode_flag(src, dest);
> > > +	if (error)
> > > +		goto out_error;
> > > +
> > > +	/*
> > > +	 * Invalidate the page cache so that we can clear any CoW mappings
> > > +	 * in the destination file.
> > > +	 */
> > > +	truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
> > > +				   PAGE_ALIGN(destoff + len) - 1);
> > > +
> > > +	dfsbno = XFS_B_TO_FSBT(mp, destoff);
> > > +	sfsbno = XFS_B_TO_FSBT(mp, srcoff);
> > > +	fsblen = XFS_B_TO_FSB(mp, len);
> > > +	error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> > > +			destoff + len);
> > > +	if (error)
> > > +		goto out_error;
> > > +
> > > +	error = xfs_reflink_update_dest(dest, destoff + len);
> > > +	if (error)
> > > +		goto out_error;
> > > +
> > > +out_error:
> > > +	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> > > +	xfs_iunlock(src, XFS_IOLOCK_EXCL);
> > > +	if (src->i_ino != dest->i_ino) {
> > > +		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > > +		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
> > > +	}
> > > +	if (error)
> > > +		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > index 1d2f180..c35ce29 100644
> > > --- a/fs/xfs/xfs_reflink.h
> > > +++ b/fs/xfs/xfs_reflink.h
> > > @@ -43,5 +43,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> > >  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> > >  		xfs_off_t count);
> > >  extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> > > +extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> > > +		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
> > >  
> > >  #endif /* __XFS_REFLINK_H */
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-10-07 20:26     ` Darrick J. Wong
@ 2016-10-07 20:58       ` Brian Foster
  2016-10-07 21:15         ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-07 20:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Fri, Oct 07, 2016 at 01:26:39PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 07, 2016 at 02:05:07PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:10:39PM -0700, Darrick J. Wong wrote:
> > > Unshare all shared extents if the user calls fallocate with the new
> > > unshare mode flag set, so that we can guarantee that a subsequent
> > > write will not ENOSPC.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > [hch: pass inode instead of file to xfs_reflink_dirty_range,
> > >       use iomap infrastructure for copy up]
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/xfs_file.c    |   10 ++
> > >  fs/xfs/xfs_reflink.c |  237 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_reflink.h |    2 
> > >  3 files changed, 247 insertions(+), 2 deletions(-)
> > > 
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 77ac810..065e836 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -1472,3 +1472,240 @@ xfs_reflink_remap_range(
> > >  		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> > >  	return error;
> > >  }
> > ...
> > > +/* Iterate the extents; if there are no reflinked blocks, clear the flag. */
> > > +STATIC int
> > > +xfs_reflink_try_clear_inode_flag(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_off_t		old_isize)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_trans	*tp;
> > > +	xfs_fileoff_t		fbno;
> > > +	xfs_filblks_t		end;
> > > +	xfs_agnumber_t		agno;
> > > +	xfs_agblock_t		agbno;
> > > +	xfs_extlen_t		aglen;
> > > +	xfs_agblock_t		rbno;
> > > +	xfs_extlen_t		rlen;
> > > +	struct xfs_bmbt_irec	map[2];
> > > +	int			nmaps;
> > > +	int			error = 0;
> > > +
> > > +	/* Start a rolling transaction to remove the mappings */
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > > +
> > > +	if (old_isize != i_size_read(VFS_I(ip)))
> > > +		goto cancel;
> > > +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK))
> > > +		goto cancel;
> > > +
> > 
> > The code that has been merged is now different from this code :/, but
> > just a heads up that the code in the tree looks like it has another one
> > of those potentially blind transaction commit sequences between
> > xfs_reflink_try_clear_inode_flag() and xfs_reflink_clear_inode_flag().
> 
> _reflink_unshare jumps out if it's not a reflink inode before
> calling _reflink_try_clear_inode_flag -> _reflink_clear_inode_flag.
> We do not call _reflink_clear_inode_flag with a non-reflink inode.
> As for blindly committing a transaction with no dirty data, that's
> fine, _trans_commit checks for that case and simply frees everything
> attached to the transaction.
> 

Yeah, I saw that. That's what I was alluding to below wrt to the usage
being fine in the patch. It's just the pattern that's used that stands
out.

With regard to the transaction.. sure, that situation may not be broken,
but it's still not ideal if it's a log reservation we didn't have to
make in the first place.

> > It doesn't appear to be a problem in how it is actually used in this
> > patch, but for reference, I think it's better practice for lower level
> > functions like xfs_reflink_clear_inode_flag() to assert that the flag is
> > set and make it the responsibility of the caller to check for it and do
> > the right thing. Just my .02 though.
> 
> Ok, I'll add an assert.
> 
...
> > > +		}
> > > +
> > > +next:
> > > +		fbno = map[0].br_startoff + map[0].br_blockcount;
> > > +	}
> > > +
> > > +	/*
> > > +	 * We didn't find any shared blocks so turn off the reflink flag.
> > > +	 * First, get rid of any leftover CoW mappings.
> > > +	 */
> > > +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, 0, NULLFILEOFF);
> > > +	if (error)
> > > +		goto cancel;
> > > +
> > > +	/* Clear the inode flag. */
> > > +	trace_xfs_reflink_unset_inode_flag(ip);
> > > +	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > > +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > +
> > > +	error = xfs_trans_commit(tp);
> > > +	if (error)
> > > +		goto out;
> > > +
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return 0;
> > > +cancel:
> > > +	xfs_trans_cancel(tp);
> > > +out:
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Pre-COW all shared blocks within a given byte range of a file and turn off
> > > + * the reflink flag if we unshare all of the file's blocks.
> > > + */
> > > +int
> > > +xfs_reflink_unshare(
> > > +	struct xfs_inode	*ip,
> > > +	xfs_off_t		offset,
> > > +	xfs_off_t		len)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	xfs_fileoff_t		fbno;
> > > +	xfs_filblks_t		end;
> > > +	xfs_off_t		old_isize, isize;
> > > +	int			error;
> > > +
> > > +	if (!xfs_is_reflink_inode(ip))
> > > +		return 0;
> > > +
> > > +	trace_xfs_reflink_unshare(ip, offset, len);
> > > +
> > > +	inode_dio_wait(VFS_I(ip));
> > > +
> > > +	/* Try to CoW the selected ranges */
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	fbno = XFS_B_TO_FSB(mp, offset);
> > 
> > XFS_B_TO_FSBT() ?
> > 
> > > +	old_isize = isize = i_size_read(VFS_I(ip));
> > > +	end = XFS_B_TO_FSB(mp, offset + len);
> > > +	error = xfs_reflink_dirty_extents(ip, fbno, end, isize);
> > > +	if (error)
> > > +		goto out_unlock;
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +
> > > +	/* Wait for the IO to finish */
> > > +	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
> > > +	if (error)
> > > +		goto out;
> > > +
> > > +	/* Turn off the reflink flag if we unshared the whole file */
> > > +	if (offset == 0 && len == isize) {
> > 
> > Isn't this valid if len is larger than isize (similar check in
> > xfs_reflink_try_clear_inode_flag() might defeat this as well)?
> > 
> > FWIW, this has a similar issue as the earlier truncate code in that we
> > might just unshare the shared regions and thus retain the flag.
> 
> Yes, it is suboptimal for the flag to be set when there are no shared
> extents.  I'm not sure when is a good opportunity to try to turn off the
> flag -- certainly we don't want to do that after every CoW operation.

Indeed..

> Doing it as part of a fallocate operation seems reasonable enough.
> Prior to the removal of the UNSHARE flag it would try to clear the flag
> any time the user asked for an unshare, but when I removed the flag from
> the interface I decided we should only do that if the user fallocated
> the entire file.
> 
> Now that UNSHARE has been re-added to the interface, I'll just take out
> these weird checks.  Note that the upcoming online repair patchset will
> try to unset the flag.
> 

Ok, it sounds reasonable to me to try the removal on fallocate.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > +		error = xfs_reflink_try_clear_inode_flag(ip, old_isize);
> > > +		if (error)
> > > +			goto out;
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +out_unlock:
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +out:
> > > +	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > index df82b20..ad4fc61 100644
> > > --- a/fs/xfs/xfs_reflink.h
> > > +++ b/fs/xfs/xfs_reflink.h
> > > @@ -48,5 +48,7 @@ extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> > >  extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> > >  		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
> > >  		unsigned int flags);
> > > +extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
> > > +		xfs_off_t len);
> > >  
> > >  #endif /* __XFS_REFLINK_H */
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-10-07 20:58       ` Brian Foster
@ 2016-10-07 21:15         ` Darrick J. Wong
  2016-10-07 22:25           ` Dave Chinner
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07 21:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Fri, Oct 07, 2016 at 04:58:38PM -0400, Brian Foster wrote:
> On Fri, Oct 07, 2016 at 01:26:39PM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 07, 2016 at 02:05:07PM -0400, Brian Foster wrote:
> > > On Thu, Sep 29, 2016 at 08:10:39PM -0700, Darrick J. Wong wrote:
> > > > Unshare all shared extents if the user calls fallocate with the new
> > > > unshare mode flag set, so that we can guarantee that a subsequent
> > > > write will not ENOSPC.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > [hch: pass inode instead of file to xfs_reflink_dirty_range,
> > > >       use iomap infrastructure for copy up]
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > ---
> > > >  fs/xfs/xfs_file.c    |   10 ++
> > > >  fs/xfs/xfs_reflink.c |  237 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_reflink.h |    2 
> > > >  3 files changed, 247 insertions(+), 2 deletions(-)
> > > > 
> > > > 
> > > ...
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index 77ac810..065e836 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -1472,3 +1472,240 @@ xfs_reflink_remap_range(
> > > >  		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> > > >  	return error;
> > > >  }
> > > ...
> > > > +/* Iterate the extents; if there are no reflinked blocks, clear the flag. */
> > > > +STATIC int
> > > > +xfs_reflink_try_clear_inode_flag(
> > > > +	struct xfs_inode	*ip,
> > > > +	xfs_off_t		old_isize)
> > > > +{
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	struct xfs_trans	*tp;
> > > > +	xfs_fileoff_t		fbno;
> > > > +	xfs_filblks_t		end;
> > > > +	xfs_agnumber_t		agno;
> > > > +	xfs_agblock_t		agbno;
> > > > +	xfs_extlen_t		aglen;
> > > > +	xfs_agblock_t		rbno;
> > > > +	xfs_extlen_t		rlen;
> > > > +	struct xfs_bmbt_irec	map[2];
> > > > +	int			nmaps;
> > > > +	int			error = 0;
> > > > +
> > > > +	/* Start a rolling transaction to remove the mappings */
> > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > +
> > > > +	if (old_isize != i_size_read(VFS_I(ip)))
> > > > +		goto cancel;
> > > > +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK))
> > > > +		goto cancel;
> > > > +
> > > 
> > > The code that has been merged is now different from this code :/, but
> > > just a heads up that the code in the tree looks like it has another one
> > > of those potentially blind transaction commit sequences between
> > > xfs_reflink_try_clear_inode_flag() and xfs_reflink_clear_inode_flag().
> > 
> > _reflink_unshare jumps out if it's not a reflink inode before
> > calling _reflink_try_clear_inode_flag -> _reflink_clear_inode_flag.
> > We do not call _reflink_clear_inode_flag with a non-reflink inode.
> > As for blindly committing a transaction with no dirty data, that's
> > fine, _trans_commit checks for that case and simply frees everything
> > attached to the transaction.
> > 
> 
> Yeah, I saw that. That's what I was alluding to below wrt to the usage
> being fine in the patch. It's just the pattern that's used that stands
> out.
> 
> With regard to the transaction.. sure, that situation may not be broken,
> but it's still not ideal if it's a log reservation we didn't have to
> make in the first place.

Yeah.  We must hold the ilock from the start of the extent iteration
until we clear (or not) the inode flag, but we have to allocate the
transaction before grabbing the ilock.  In other words, we don't know if
we need the transaction until it's too late to get one, hence this
suboptimal thing where we sometimes get a reservation and never commit
anything.  I don't know of any way to avoid that.

--D

> > > It doesn't appear to be a problem in how it is actually used in this
> > > patch, but for reference, I think it's better practice for lower level
> > > functions like xfs_reflink_clear_inode_flag() to assert that the flag is
> > > set and make it the responsibility of the caller to check for it and do
> > > the right thing. Just my .02 though.
> > 
> > Ok, I'll add an assert.
> > 
> ...
> > > > +		}
> > > > +
> > > > +next:
> > > > +		fbno = map[0].br_startoff + map[0].br_blockcount;
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * We didn't find any shared blocks so turn off the reflink flag.
> > > > +	 * First, get rid of any leftover CoW mappings.
> > > > +	 */
> > > > +	error = xfs_reflink_cancel_cow_blocks(ip, &tp, 0, NULLFILEOFF);
> > > > +	if (error)
> > > > +		goto cancel;
> > > > +
> > > > +	/* Clear the inode flag. */
> > > > +	trace_xfs_reflink_unset_inode_flag(ip);
> > > > +	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> > > > +
> > > > +	error = xfs_trans_commit(tp);
> > > > +	if (error)
> > > > +		goto out;
> > > > +
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +	return 0;
> > > > +cancel:
> > > > +	xfs_trans_cancel(tp);
> > > > +out:
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Pre-COW all shared blocks within a given byte range of a file and turn off
> > > > + * the reflink flag if we unshare all of the file's blocks.
> > > > + */
> > > > +int
> > > > +xfs_reflink_unshare(
> > > > +	struct xfs_inode	*ip,
> > > > +	xfs_off_t		offset,
> > > > +	xfs_off_t		len)
> > > > +{
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	xfs_fileoff_t		fbno;
> > > > +	xfs_filblks_t		end;
> > > > +	xfs_off_t		old_isize, isize;
> > > > +	int			error;
> > > > +
> > > > +	if (!xfs_is_reflink_inode(ip))
> > > > +		return 0;
> > > > +
> > > > +	trace_xfs_reflink_unshare(ip, offset, len);
> > > > +
> > > > +	inode_dio_wait(VFS_I(ip));
> > > > +
> > > > +	/* Try to CoW the selected ranges */
> > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > +	fbno = XFS_B_TO_FSB(mp, offset);
> > > 
> > > XFS_B_TO_FSBT() ?
> > > 
> > > > +	old_isize = isize = i_size_read(VFS_I(ip));
> > > > +	end = XFS_B_TO_FSB(mp, offset + len);
> > > > +	error = xfs_reflink_dirty_extents(ip, fbno, end, isize);
> > > > +	if (error)
> > > > +		goto out_unlock;
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +
> > > > +	/* Wait for the IO to finish */
> > > > +	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
> > > > +	if (error)
> > > > +		goto out;
> > > > +
> > > > +	/* Turn off the reflink flag if we unshared the whole file */
> > > > +	if (offset == 0 && len == isize) {
> > > 
> > > Isn't this valid if len is larger than isize (similar check in
> > > xfs_reflink_try_clear_inode_flag() might defeat this as well)?
> > > 
> > > FWIW, this has a similar issue as the earlier truncate code in that we
> > > might just unshare the shared regions and thus retain the flag.
> > 
> > Yes, it is suboptimal for the flag to be set when there are no shared
> > extents.  I'm not sure when is a good opportunity to try to turn off the
> > flag -- certainly we don't want to do that after every CoW operation.
> 
> Indeed..
> 
> > Doing it as part of a fallocate operation seems reasonable enough.
> > Prior to the removal of the UNSHARE flag it would try to clear the flag
> > any time the user asked for an unshare, but when I removed the flag from
> > the interface I decided we should only do that if the user fallocated
> > the entire file.
> > 
> > Now that UNSHARE has been re-added to the interface, I'll just take out
> > these weird checks.  Note that the upcoming online repair patchset will
> > try to unset the flag.
> > 
> 
> Ok, it sounds reasonable to me to try the removal on fallocate.
> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > +		error = xfs_reflink_try_clear_inode_flag(ip, old_isize);
> > > > +		if (error)
> > > > +			goto out;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +
> > > > +out_unlock:
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > +out:
> > > > +	trace_xfs_reflink_unshare_error(ip, error, _RET_IP_);
> > > > +	return error;
> > > > +}
> > > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > > index df82b20..ad4fc61 100644
> > > > --- a/fs/xfs/xfs_reflink.h
> > > > +++ b/fs/xfs/xfs_reflink.h
> > > > @@ -48,5 +48,7 @@ extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> > > >  extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> > > >  		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len,
> > > >  		unsigned int flags);
> > > > +extern int xfs_reflink_unshare(struct xfs_inode *ip, xfs_off_t offset,
> > > > +		xfs_off_t len);
> > > >  
> > > >  #endif /* __XFS_REFLINK_H */
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 41/63] xfs: reflink extents from one file to another
  2016-10-07 20:48       ` Brian Foster
@ 2016-10-07 21:41         ` Darrick J. Wong
  2016-10-10 13:17           ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-07 21:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs

On Fri, Oct 07, 2016 at 04:48:52PM -0400, Brian Foster wrote:
> On Fri, Oct 07, 2016 at 12:44:30PM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 07, 2016 at 02:04:15PM -0400, Brian Foster wrote:
> > > On Thu, Sep 29, 2016 at 08:10:05PM -0700, Darrick J. Wong wrote:
> > > > Reflink extents from one file to another; that is to say, iteratively
> > > > remove the mappings from the destination file, copy the mappings from
> > > > the source file to the destination file, and increment the reference
> > > > count of all the blocks that got remapped.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > > v2: Call xfs_defer_cancel before cancelling the transaction if the
> > > > remap operation fails.  Use the deferred operations system to avoid
> > > > deadlocks or blowing out the transaction reservation, and make the
> > > > entire reflink operation atomic for each extent being remapped.  The
> > > > destination file's i_size will be updated if necessary to avoid
> > > > violating the assumption that there are no shared blocks past the EOF
> > > > block.
> > > > ---
> > > >  fs/xfs/xfs_reflink.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_reflink.h |    2 
> > > >  2 files changed, 427 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index 673ecc1..94c19fff 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -922,3 +922,428 @@ xfs_reflink_recover_cow(
> > > >  
> > > >  	return error;
> > > >  }
> > > ...
> > > > +/*
> > > > + * Unmap a range of blocks from a file, then map other blocks into the hole.
> > > > + * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > > > + * The extent irec is mapped into dest at irec->br_startoff.
> > > > + */
> > > > +STATIC int
> > > > +xfs_reflink_remap_extent(
> > > > +	struct xfs_inode	*ip,
> > > > +	struct xfs_bmbt_irec	*irec,
> > > > +	xfs_fileoff_t		destoff,
> > > > +	xfs_off_t		new_isize)
> > > > +{
> > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > +	struct xfs_trans	*tp;
> > > > +	xfs_fsblock_t		firstfsb;
> > > > +	unsigned int		resblks;
> > > > +	struct xfs_defer_ops	dfops;
> > > > +	struct xfs_bmbt_irec	uirec;
> > > > +	bool			real_extent;
> > > > +	xfs_filblks_t		rlen;
> > > > +	xfs_filblks_t		unmap_len;
> > > > +	xfs_off_t		newlen;
> > > > +	int			error;
> > > > +
> > > > +	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > > > +	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > > > +
> > > > +	/* Only remap normal extents. */
> > > > +	real_extent =  (irec->br_startblock != HOLESTARTBLOCK &&
> > > > +			irec->br_startblock != DELAYSTARTBLOCK &&
> > > > +			!ISUNWRITTEN(irec));
> > > > +
> > > > +	/* Start a rolling transaction to switch the mappings */
> > > > +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > > > +	if (error)
> > > > +		goto out;
> > > > +
> > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > +
> > > > +	/* If we're not just clearing space, then do we have enough quota? */
> > > > +	if (real_extent) {
> > > > +		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > > > +				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > +		if (error)
> > > > +			goto out_cancel;
> > > > +	}
> > > > +
> > > > +	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > > > +				irec->br_blockcount, irec->br_startblock);
> > > > +
> > > > +	/* Unmap the old blocks in the data fork. */
> > > > +	rlen = unmap_len;
> > > > +	while (rlen) {
> > > > +		xfs_defer_init(&dfops, &firstfsb);
> > > > +		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
> > > > +				&firstfsb, &dfops);
> > > > +		if (error)
> > > > +			goto out_defer;
> > > > +
> > > > +		/* Trim the extent to whatever got unmapped. */
> > > > +		uirec = *irec;
> > > > +		xfs_trim_extent(&uirec, destoff + rlen, unmap_len - rlen);
> > > > +		unmap_len = rlen;
> > > > +
> > > > +		/* If this isn't a real mapping, we're done. */
> > > > +		if (!real_extent || uirec.br_blockcount == 0)
> > > > +			goto next_extent;
> > > > +
> > > 
> > > Any reason we couldn't reuse existing mechanisms for this? E.g., hole
> > > punch the dest file range before we remap the source file extents. That
> > > might change behavior in the event of a partial/failed reflink, but it's
> > > not clear to me that matters.
> > 
> > It matters a lot for the dedupe operation -- the unmap and remap
> > operations must be atomic with each other so that if the dedupe
> > operation fails, the user will still see the same file contents after
> > reboot/recovery.  We don't want users to find their files suddenly full
> > of zeroes.
> > 
> 
> Ok, that makes sense. Though the dedup atomicity is provided simply by
> doing each unmap/remap within the same transaction, right? I'm kind of

The unmap/remap are done within the same defer_ops, but different transactions.

> wondering if we could do something like refactor/reuse
> xfs_unmap_extent(), pull the trans alloc/commit and the unmap call up
> into xfs_reflink_remap_blocks(), then clean out
> xfs_reflink_remap_extent() a bit as a result.

Hm.  Let's start with the current structure:

for each extent in the source file,
  alloc transaction
  for each extent in the dest file that bunmapi tells us is now empty,
    log refcount increase intent
    log bmap remap intent
    update quota
    update isize if needed
    _defer_finish
  commit transaction

You could flatten _remap_extent and _remap_blocks into a single
function with a double loop, I suppose.  I don't think trying to reuse
_unmap_extent buys us much, however -- for the truncate case we simply
unmapi and _defer_finish, but for reflink we have all those extra steps
that have to go between the bunmapi and the defer_finish.  Furthermore
we still have to use __xfs_bunmapi for reflink because we have to know
exactly which part to remap since we can only unmap one extent per
transaction.

> But meh, this stuff is already merged so maybe I should just send a
> patch. :P

That said, if you send a patch I'll have a look. :)

--D

> 
> Brian
> 
> > For reflink I suspect that you're right, but we already guarantee that
> > the user sees either the old contents or the new contents, so yay. :)
> > 
> > > 
> > > > +		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > > > +				uirec.br_blockcount, uirec.br_startblock);
> > > > +
> > > ...
> > > > +}
> > > > +
> > > > +/*
> > > > + * Iteratively remap one file's extents (and holes) to another's.
> > > > + */
> > > > +STATIC int
> > > > +xfs_reflink_remap_blocks(
> > > > +	struct xfs_inode	*src,
> > > > +	xfs_fileoff_t		srcoff,
> > > > +	struct xfs_inode	*dest,
> > > > +	xfs_fileoff_t		destoff,
> > > > +	xfs_filblks_t		len,
> > > > +	xfs_off_t		new_isize)
> > > > +{
> > > > +	struct xfs_bmbt_irec	imap;
> > > > +	int			nimaps;
> > > > +	int			error = 0;
> > > > +	xfs_filblks_t		range_len;
> > > > +
> > > > +	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > > > +	while (len) {
> > > > +		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > > > +				dest, destoff);
> > > > +		/* Read extent from the source file */
> > > > +		nimaps = 1;
> > > > +		xfs_ilock(src, XFS_ILOCK_EXCL);
> > > > +		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
> > > > +		xfs_iunlock(src, XFS_ILOCK_EXCL);
> > > > +		if (error)
> > > > +			goto err;
> > > > +		ASSERT(nimaps == 1);
> > > > +
> > > > +		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
> > > > +				&imap);
> > > > +
> > > > +		/* Translate imap into the destination file. */
> > > > +		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > > > +		imap.br_startoff += destoff - srcoff;
> > > > +
> > > 
> > > Just FYI... these are all unsigned vars...
> > 
> > Yeah.  It should handle that correctly.  See generic/30[34].
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > +		/* Clear dest from destoff to the end of imap and map it in. */
> > > > +		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > > > +				new_isize);
> > > > +		if (error)
> > > > +			goto err;
> > > > +
> > > > +		if (fatal_signal_pending(current)) {
> > > > +			error = -EINTR;
> > > > +			goto err;
> > > > +		}
> > > > +
> > > > +		/* Advance drange/srange */
> > > > +		srcoff += range_len;
> > > > +		destoff += range_len;
> > > > +		len -= range_len;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +
> > > > +err:
> > > > +	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Link a range of blocks from one file to another.
> > > > + */
> > > > +int
> > > > +xfs_reflink_remap_range(
> > > > +	struct xfs_inode	*src,
> > > > +	xfs_off_t		srcoff,
> > > > +	struct xfs_inode	*dest,
> > > > +	xfs_off_t		destoff,
> > > > +	xfs_off_t		len)
> > > > +{
> > > > +	struct xfs_mount	*mp = src->i_mount;
> > > > +	xfs_fileoff_t		sfsbno, dfsbno;
> > > > +	xfs_filblks_t		fsblen;
> > > > +	int			error;
> > > > +
> > > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (XFS_FORCED_SHUTDOWN(mp))
> > > > +		return -EIO;
> > > > +
> > > > +	/* Don't reflink realtime inodes */
> > > > +	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> > > > +		return -EINVAL;
> > > > +
> > > > +	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
> > > > +
> > > > +	/* Lock both files against IO */
> > > > +	if (src->i_ino == dest->i_ino) {
> > > > +		xfs_ilock(src, XFS_IOLOCK_EXCL);
> > > > +		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
> > > > +	} else {
> > > > +		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
> > > > +		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
> > > > +	}
> > > > +
> > > > +	error = xfs_reflink_set_inode_flag(src, dest);
> > > > +	if (error)
> > > > +		goto out_error;
> > > > +
> > > > +	/*
> > > > +	 * Invalidate the page cache so that we can clear any CoW mappings
> > > > +	 * in the destination file.
> > > > +	 */
> > > > +	truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
> > > > +				   PAGE_ALIGN(destoff + len) - 1);
> > > > +
> > > > +	dfsbno = XFS_B_TO_FSBT(mp, destoff);
> > > > +	sfsbno = XFS_B_TO_FSBT(mp, srcoff);
> > > > +	fsblen = XFS_B_TO_FSB(mp, len);
> > > > +	error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> > > > +			destoff + len);
> > > > +	if (error)
> > > > +		goto out_error;
> > > > +
> > > > +	error = xfs_reflink_update_dest(dest, destoff + len);
> > > > +	if (error)
> > > > +		goto out_error;
> > > > +
> > > > +out_error:
> > > > +	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> > > > +	xfs_iunlock(src, XFS_IOLOCK_EXCL);
> > > > +	if (src->i_ino != dest->i_ino) {
> > > > +		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > > > +		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
> > > > +	}
> > > > +	if (error)
> > > > +		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> > > > +	return error;
> > > > +}
> > > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > > index 1d2f180..c35ce29 100644
> > > > --- a/fs/xfs/xfs_reflink.h
> > > > +++ b/fs/xfs/xfs_reflink.h
> > > > @@ -43,5 +43,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> > > >  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> > > >  		xfs_off_t count);
> > > >  extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> > > > +extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> > > > +		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
> > > >  
> > > >  #endif /* __XFS_REFLINK_H */
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-10-07 21:15         ` Darrick J. Wong
@ 2016-10-07 22:25           ` Dave Chinner
  2016-10-10 17:05             ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Dave Chinner @ 2016-10-07 22:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, Christoph Hellwig

On Fri, Oct 07, 2016 at 02:15:40PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 07, 2016 at 04:58:38PM -0400, Brian Foster wrote:
> > On Fri, Oct 07, 2016 at 01:26:39PM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 07, 2016 at 02:05:07PM -0400, Brian Foster wrote:
> > > > On Thu, Sep 29, 2016 at 08:10:39PM -0700, Darrick J. Wong wrote:
> > > > The code that has been merged is now different from this code :/, but
> > > > just a heads up that the code in the tree looks like it has another one
> > > > of those potentially blind transaction commit sequences between
> > > > xfs_reflink_try_clear_inode_flag() and xfs_reflink_clear_inode_flag().
> > > 
> > > _reflink_unshare jumps out if it's not a reflink inode before
> > > calling _reflink_try_clear_inode_flag -> _reflink_clear_inode_flag.
> > > We do not call _reflink_clear_inode_flag with a non-reflink inode.
> > > As for blindly committing a transaction with no dirty data, that's
> > > fine, _trans_commit checks for that case and simply frees everything
> > > attached to the transaction.
> > > 
> > 
> > Yeah, I saw that. That's what I was alluding to below wrt to the usage
> > being fine in the patch. It's just the pattern that's used that stands
> > out.
> > 
> > With regard to the transaction.. sure, that situation may not be broken,
> > but it's still not ideal if it's a log reservation we didn't have to
> > make in the first place.
> 
> Yeah.  We must hold the ilock from the start of the extent iteration
> until we clear (or not) the inode flag, but we have to allocate the
> transaction before grabbing the ilock.  In other words, we don't know if
> we need the transaction until it's too late to get one, hence this
> suboptimal thing where we sometimes get a reservation and never commit
> anything.  I don't know of any way to avoid that.

Getting a transaction we don't use isn't the end of the world -
in most cases it's just a bit of wasted CPU time. Similarly to
committing an empty transaction it has no actual effect except to
increment the empty transaction stat. In this case, commit is just
fine as xfs_trans_commit will detect that it is empty and do the
cancel work directly.

If I start to see too many empty transaction commits in my
performance test runs, I'll let you know and we can start to look
for solutions. But right now I wouldn't worry about it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 41/63] xfs: reflink extents from one file to another
  2016-10-07 21:41         ` Darrick J. Wong
@ 2016-10-10 13:17           ` Brian Foster
  0 siblings, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-10 13:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Fri, Oct 07, 2016 at 02:41:50PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 07, 2016 at 04:48:52PM -0400, Brian Foster wrote:
> > On Fri, Oct 07, 2016 at 12:44:30PM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 07, 2016 at 02:04:15PM -0400, Brian Foster wrote:
> > > > On Thu, Sep 29, 2016 at 08:10:05PM -0700, Darrick J. Wong wrote:
> > > > > Reflink extents from one file to another; that is to say, iteratively
> > > > > remove the mappings from the destination file, copy the mappings from
> > > > > the source file to the destination file, and increment the reference
> > > > > count of all the blocks that got remapped.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > > v2: Call xfs_defer_cancel before cancelling the transaction if the
> > > > > remap operation fails.  Use the deferred operations system to avoid
> > > > > deadlocks or blowing out the transaction reservation, and make the
> > > > > entire reflink operation atomic for each extent being remapped.  The
> > > > > destination file's i_size will be updated if necessary to avoid
> > > > > violating the assumption that there are no shared blocks past the EOF
> > > > > block.
> > > > > ---
> > > > >  fs/xfs/xfs_reflink.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/xfs_reflink.h |    2 
> > > > >  2 files changed, 427 insertions(+)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > > index 673ecc1..94c19fff 100644
> > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > +++ b/fs/xfs/xfs_reflink.c
> > > > > @@ -922,3 +922,428 @@ xfs_reflink_recover_cow(
> > > > >  
> > > > >  	return error;
> > > > >  }
> > > > ...
> > > > > +/*
> > > > > + * Unmap a range of blocks from a file, then map other blocks into the hole.
> > > > > + * The range to unmap is (destoff : destoff + srcioff + irec->br_blockcount).
> > > > > + * The extent irec is mapped into dest at irec->br_startoff.
> > > > > + */
> > > > > +STATIC int
> > > > > +xfs_reflink_remap_extent(
> > > > > +	struct xfs_inode	*ip,
> > > > > +	struct xfs_bmbt_irec	*irec,
> > > > > +	xfs_fileoff_t		destoff,
> > > > > +	xfs_off_t		new_isize)
> > > > > +{
> > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > +	struct xfs_trans	*tp;
> > > > > +	xfs_fsblock_t		firstfsb;
> > > > > +	unsigned int		resblks;
> > > > > +	struct xfs_defer_ops	dfops;
> > > > > +	struct xfs_bmbt_irec	uirec;
> > > > > +	bool			real_extent;
> > > > > +	xfs_filblks_t		rlen;
> > > > > +	xfs_filblks_t		unmap_len;
> > > > > +	xfs_off_t		newlen;
> > > > > +	int			error;
> > > > > +
> > > > > +	unmap_len = irec->br_startoff + irec->br_blockcount - destoff;
> > > > > +	trace_xfs_reflink_punch_range(ip, destoff, unmap_len);
> > > > > +
> > > > > +	/* Only remap normal extents. */
> > > > > +	real_extent =  (irec->br_startblock != HOLESTARTBLOCK &&
> > > > > +			irec->br_startblock != DELAYSTARTBLOCK &&
> > > > > +			!ISUNWRITTEN(irec));
> > > > > +
> > > > > +	/* Start a rolling transaction to switch the mappings */
> > > > > +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> > > > > +	if (error)
> > > > > +		goto out;
> > > > > +
> > > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > > +
> > > > > +	/* If we're not just clearing space, then do we have enough quota? */
> > > > > +	if (real_extent) {
> > > > > +		error = xfs_trans_reserve_quota_nblks(tp, ip,
> > > > > +				irec->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > > +		if (error)
> > > > > +			goto out_cancel;
> > > > > +	}
> > > > > +
> > > > > +	trace_xfs_reflink_remap(ip, irec->br_startoff,
> > > > > +				irec->br_blockcount, irec->br_startblock);
> > > > > +
> > > > > +	/* Unmap the old blocks in the data fork. */
> > > > > +	rlen = unmap_len;
> > > > > +	while (rlen) {
> > > > > +		xfs_defer_init(&dfops, &firstfsb);
> > > > > +		error = __xfs_bunmapi(tp, ip, destoff, &rlen, 0, 1,
> > > > > +				&firstfsb, &dfops);
> > > > > +		if (error)
> > > > > +			goto out_defer;
> > > > > +
> > > > > +		/* Trim the extent to whatever got unmapped. */
> > > > > +		uirec = *irec;
> > > > > +		xfs_trim_extent(&uirec, destoff + rlen, unmap_len - rlen);
> > > > > +		unmap_len = rlen;
> > > > > +
> > > > > +		/* If this isn't a real mapping, we're done. */
> > > > > +		if (!real_extent || uirec.br_blockcount == 0)
> > > > > +			goto next_extent;
> > > > > +
> > > > 
> > > > Any reason we couldn't reuse existing mechanisms for this? E.g., hole
> > > > punch the dest file range before we remap the source file extents. That
> > > > might change behavior in the event of a partial/failed reflink, but it's
> > > > not clear to me that matters.
> > > 
> > > It matters a lot for the dedupe operation -- the unmap and remap
> > > operations must be atomic with each other so that if the dedupe
> > > operation fails, the user will still see the same file contents after
> > > reboot/recovery.  We don't want users to find their files suddenly full
> > > of zeroes.
> > > 
> > 
> > Ok, that makes sense. Though the dedup atomicity is provided simply by
> > doing each unmap/remap within the same transaction, right? I'm kind of
> 
> The unmap/remap are done within the same defer_ops, but different transactions.
> 

Ok.

> > wondering if we could do something like refactor/reuse
> > xfs_unmap_extent(), pull the trans alloc/commit and the unmap call up
> > into xfs_reflink_remap_blocks(), then clean out
> > xfs_reflink_remap_extent() a bit as a result.
> 
> Hm.  Let's start with the current structure:
> 
> for each extent in the source file,
>   alloc transaction
>   for each extent in the dest file that bunmapi tells us is now empty,
>     log refcount increase intent
>     log bmap remap intent
>     update quota
>     update isize if needed
>     _defer_finish
>   commit transaction
> 
> You could flatten _remap_extent and _remap_blocks into a single
> function with a double loop, I suppose.  I don't think trying to reuse
> _unmap_extent buys us much, however -- for the truncate case we simply
> unmapi and _defer_finish, but for reflink we have all those extra steps
> that have to go between the bunmapi and the defer_finish.  Furthermore
> we still have to use __xfs_bunmapi for reflink because we have to know
> exactly which part to remap since we can only unmap one extent per
> transaction.
> 

Hmm, Ok. I was really just aiming for some cleanup/reuse, but the
requirements here might make it not worthwhile.

> > But meh, this stuff is already merged so maybe I should just send a
> > patch. :P
> 
> That said, if you send a patch I'll have a look. :)
> 

I'll play around with it when I have a chance. If nothing else it will
probably help me understand it better. ;) Thanks.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > For reflink I suspect that you're right, but we already guarantee that
> > > the user sees either the old contents or the new contents, so yay. :)
> > > 
> > > > 
> > > > > +		trace_xfs_reflink_remap(ip, uirec.br_startoff,
> > > > > +				uirec.br_blockcount, uirec.br_startblock);
> > > > > +
> > > > ...
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Iteratively remap one file's extents (and holes) to another's.
> > > > > + */
> > > > > +STATIC int
> > > > > +xfs_reflink_remap_blocks(
> > > > > +	struct xfs_inode	*src,
> > > > > +	xfs_fileoff_t		srcoff,
> > > > > +	struct xfs_inode	*dest,
> > > > > +	xfs_fileoff_t		destoff,
> > > > > +	xfs_filblks_t		len,
> > > > > +	xfs_off_t		new_isize)
> > > > > +{
> > > > > +	struct xfs_bmbt_irec	imap;
> > > > > +	int			nimaps;
> > > > > +	int			error = 0;
> > > > > +	xfs_filblks_t		range_len;
> > > > > +
> > > > > +	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> > > > > +	while (len) {
> > > > > +		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> > > > > +				dest, destoff);
> > > > > +		/* Read extent from the source file */
> > > > > +		nimaps = 1;
> > > > > +		xfs_ilock(src, XFS_ILOCK_EXCL);
> > > > > +		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
> > > > > +		xfs_iunlock(src, XFS_ILOCK_EXCL);
> > > > > +		if (error)
> > > > > +			goto err;
> > > > > +		ASSERT(nimaps == 1);
> > > > > +
> > > > > +		trace_xfs_reflink_remap_imap(src, srcoff, len, XFS_IO_OVERWRITE,
> > > > > +				&imap);
> > > > > +
> > > > > +		/* Translate imap into the destination file. */
> > > > > +		range_len = imap.br_startoff + imap.br_blockcount - srcoff;
> > > > > +		imap.br_startoff += destoff - srcoff;
> > > > > +
> > > > 
> > > > Just FYI... these are all unsigned vars...
> > > 
> > > Yeah.  It should handle that correctly.  See generic/30[34].
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > +		/* Clear dest from destoff to the end of imap and map it in. */
> > > > > +		error = xfs_reflink_remap_extent(dest, &imap, destoff,
> > > > > +				new_isize);
> > > > > +		if (error)
> > > > > +			goto err;
> > > > > +
> > > > > +		if (fatal_signal_pending(current)) {
> > > > > +			error = -EINTR;
> > > > > +			goto err;
> > > > > +		}
> > > > > +
> > > > > +		/* Advance drange/srange */
> > > > > +		srcoff += range_len;
> > > > > +		destoff += range_len;
> > > > > +		len -= range_len;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +
> > > > > +err:
> > > > > +	trace_xfs_reflink_remap_blocks_error(dest, error, _RET_IP_);
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Link a range of blocks from one file to another.
> > > > > + */
> > > > > +int
> > > > > +xfs_reflink_remap_range(
> > > > > +	struct xfs_inode	*src,
> > > > > +	xfs_off_t		srcoff,
> > > > > +	struct xfs_inode	*dest,
> > > > > +	xfs_off_t		destoff,
> > > > > +	xfs_off_t		len)
> > > > > +{
> > > > > +	struct xfs_mount	*mp = src->i_mount;
> > > > > +	xfs_fileoff_t		sfsbno, dfsbno;
> > > > > +	xfs_filblks_t		fsblen;
> > > > > +	int			error;
> > > > > +
> > > > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	if (XFS_FORCED_SHUTDOWN(mp))
> > > > > +		return -EIO;
> > > > > +
> > > > > +	/* Don't reflink realtime inodes */
> > > > > +	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	trace_xfs_reflink_remap_range(src, srcoff, len, dest, destoff);
> > > > > +
> > > > > +	/* Lock both files against IO */
> > > > > +	if (src->i_ino == dest->i_ino) {
> > > > > +		xfs_ilock(src, XFS_IOLOCK_EXCL);
> > > > > +		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
> > > > > +	} else {
> > > > > +		xfs_lock_two_inodes(src, dest, XFS_IOLOCK_EXCL);
> > > > > +		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
> > > > > +	}
> > > > > +
> > > > > +	error = xfs_reflink_set_inode_flag(src, dest);
> > > > > +	if (error)
> > > > > +		goto out_error;
> > > > > +
> > > > > +	/*
> > > > > +	 * Invalidate the page cache so that we can clear any CoW mappings
> > > > > +	 * in the destination file.
> > > > > +	 */
> > > > > +	truncate_inode_pages_range(&VFS_I(dest)->i_data, destoff,
> > > > > +				   PAGE_ALIGN(destoff + len) - 1);
> > > > > +
> > > > > +	dfsbno = XFS_B_TO_FSBT(mp, destoff);
> > > > > +	sfsbno = XFS_B_TO_FSBT(mp, srcoff);
> > > > > +	fsblen = XFS_B_TO_FSB(mp, len);
> > > > > +	error = xfs_reflink_remap_blocks(src, sfsbno, dest, dfsbno, fsblen,
> > > > > +			destoff + len);
> > > > > +	if (error)
> > > > > +		goto out_error;
> > > > > +
> > > > > +	error = xfs_reflink_update_dest(dest, destoff + len);
> > > > > +	if (error)
> > > > > +		goto out_error;
> > > > > +
> > > > > +out_error:
> > > > > +	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> > > > > +	xfs_iunlock(src, XFS_IOLOCK_EXCL);
> > > > > +	if (src->i_ino != dest->i_ino) {
> > > > > +		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > > > > +		xfs_iunlock(dest, XFS_IOLOCK_EXCL);
> > > > > +	}
> > > > > +	if (error)
> > > > > +		trace_xfs_reflink_remap_range_error(dest, error, _RET_IP_);
> > > > > +	return error;
> > > > > +}
> > > > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > > > index 1d2f180..c35ce29 100644
> > > > > --- a/fs/xfs/xfs_reflink.h
> > > > > +++ b/fs/xfs/xfs_reflink.h
> > > > > @@ -43,5 +43,7 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> > > > >  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> > > > >  		xfs_off_t count);
> > > > >  extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> > > > > +extern int xfs_reflink_remap_range(struct xfs_inode *src, xfs_off_t srcoff,
> > > > > +		struct xfs_inode *dest, xfs_off_t destoff, xfs_off_t len);
> > > > >  
> > > > >  #endif /* __XFS_REFLINK_H */
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 46/63] xfs: unshare a range of blocks via fallocate
  2016-10-07 22:25           ` Dave Chinner
@ 2016-10-10 17:05             ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-10 17:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Christoph Hellwig

On Sat, Oct 08, 2016 at 09:25:08AM +1100, Dave Chinner wrote:
> On Fri, Oct 07, 2016 at 02:15:40PM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 07, 2016 at 04:58:38PM -0400, Brian Foster wrote:
> > > On Fri, Oct 07, 2016 at 01:26:39PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Oct 07, 2016 at 02:05:07PM -0400, Brian Foster wrote:
> > > > > On Thu, Sep 29, 2016 at 08:10:39PM -0700, Darrick J. Wong wrote:
> > > > > The code that has been merged is now different from this code :/, but
> > > > > just a heads up that the code in the tree looks like it has another one
> > > > > of those potentially blind transaction commit sequences between
> > > > > xfs_reflink_try_clear_inode_flag() and xfs_reflink_clear_inode_flag().
> > > > 
> > > > _reflink_unshare jumps out if it's not a reflink inode before
> > > > calling _reflink_try_clear_inode_flag -> _reflink_clear_inode_flag.
> > > > We do not call _reflink_clear_inode_flag with a non-reflink inode.
> > > > As for blindly committing a transaction with no dirty data, that's
> > > > fine, _trans_commit checks for that case and simply frees everything
> > > > attached to the transaction.
> > > > 
> > > 
> > > Yeah, I saw that. That's what I was alluding to below wrt to the usage
> > > being fine in the patch. It's just the pattern that's used that stands
> > > out.
> > > 
> > > With regard to the transaction.. sure, that situation may not be broken,
> > > but it's still not ideal if it's a log reservation we didn't have to
> > > make in the first place.
> > 
> > Yeah.  We must hold the ilock from the start of the extent iteration
> > until we clear (or not) the inode flag, but we have to allocate the
> > transaction before grabbing the ilock.  In other words, we don't know if
> > we need the transaction until it's too late to get one, hence this
> > suboptimal thing where we sometimes get a reservation and never commit
> > anything.  I don't know of any way to avoid that.
> 
> Getting a transaction we don't use isn't the end of the world -
> in most cases it's just a bit of wasted CPU time. Similarly to
> committing an empty transaction it has no actual effect except to
> increment the empty transaction stat. In this case, commit is just
> fine as xfs_trans_commit will detect that it is empty and do the
> cancel work directly.

This is going to become a bigger thing once we get to online scrub
because I use empty transactions to avoid deadlock problems.  I observed
that the routine to grab a buffer will lock the buffer and (optionally)
attach it to a transaction.  Subsequent attempts to re-grab a still
locked buffer succeed if the buffer is attached to the transaction, and
are made to wait for the lock if not.

We can use this as a strategy to detect tree cycles:

n0 (root) -> n1 -> n2 -> n3
             ^------------+

By the time we hit the bad pointer in n3, we've locked n0-3 and attached
it to the empty transaction.  Next, we read the bad pointer in n3 and
try to grab n1 again.  Since it's attached and locked to our empty
transaction, we can read the buffer and notice that the level is wrong,
and declare the tree to be corrupt.  On our way out, we call
xfs_trans_cancel to unlock everything.  It's a little uncomfortable to
be (ab)using transactions for their ability to track locked buffers, but
oh well.

Note we can also use this of escaping crosslinked btrees:

bno0 -> b1 -> b2 ---------+
                          V
          rmap0 -> r1 -> r2 -> r3

Let's say we're checking rmap records out of r3 and we want to make sure
that the bnobt does not have a record for this rmapping.  We start down
the bnobt until we hit the bad pointer in b2 that points to a block
we already locked while reading the rmapbt.  Having the transaction
allows the bnobt cursor to read r2 and fail the read verifier, after
which we can cancel the transaction and tell userspace that there's
something wrong.  If we didn't have the transaction, we'd try to lock a
buffer that we already locked, which deadlocks the system.

I suppose I had better write all this down in xfs_scrub.c before I send
out patches for review.

--D

> 
> If I start to see too many empty transaction commits in my
> performance test runs, I'll let you know and we can start to look
> for solutions. But right now I wouldn't worry about it.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-09-30  3:10 ` [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
  2016-09-30  8:19   ` Christoph Hellwig
@ 2016-10-12 18:44   ` Brian Foster
  2016-10-12 20:52     ` Darrick J. Wong
  1 sibling, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-12 18:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> To gracefully handle the situation where a CoW operation turns a
> single refcount extent into a lot of tiny ones and then run out of
> space when a tree split has to happen, use the per-AG reserved block
> pool to pre-allocate all the space we'll ever need for a maximal
> btree.  For a 4K block size, this only costs an overhead of 0.3% of
> available disk space.
> 
> When reflink is enabled, we have an unfortunate problem with rmap --
> since we can share a block billions of times, this means that the
> reverse mapping btree can expand basically infinitely.  When an AG is
> so full that there are no free blocks with which to expand the rmapbt,
> the filesystem will shut down hard.
> 
> This is rather annoying to the user, so use the AG reservation code to
> reserve a "reasonable" amount of space for rmap.  We'll prevent
> reflinks and CoW operations if we think we're getting close to
> exhausting an AG's free space rather than shutting down, but this
> permanent reservation should be enough for "most" users.  Hopefully.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> [hch@lst.de: ensure that we invalidate the freed btree buffer]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> so that we can easily call xfs_trans_binval for both the per-AG pool
> and the real freeing case.  Without this we fail to invalidate the
> btree buffer and will trip over the write verifier on a shrinking
> refcount btree.
> 
> v3: Convert to the new per-AG reservation code.
> 
> v4: Combine this patch with the one that adds the rmapbt reservation,
> since the rmapbt reservation is only needed for reflink filesystems.
> 
> v5: If we detect errors while counting the refcount or rmap btrees,
> shut down the filesystem to avoid the scenario where the fs shuts down
> mid-transaction due to btree corruption, repair refuses to run until
> the log is clean, and the log cannot be cleaned because replay hits
> btree corruption and shuts down.
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
>  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
>  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
>  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
>  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_fsops.h                 |    3 ++
>  fs/xfs/xfs_mount.c                 |    8 +++++
>  fs/xfs/xfs_super.c                 |   12 +++++++
>  9 files changed, 210 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index e3ae0f2..adf770f 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -38,6 +38,7 @@
>  #include "xfs_trans_space.h"
>  #include "xfs_rmap_btree.h"
>  #include "xfs_btree.h"
> +#include "xfs_refcount_btree.h"
>  
>  /*
>   * Per-AG Block Reservations
> @@ -228,6 +229,11 @@ xfs_ag_resv_init(
>  	if (pag->pag_meta_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> +				pag->pag_agno, &ask, &used);
> +		if (error)
> +			goto out;
> +
>  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
>  				ask, used);

Now that I get here, I see we have these per-ag reservation structures
and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
"global pool." Based on the commit log, isn't the intent here to reserve
blocks within each AG? What am I missing?

Brian

>  		if (error)
> @@ -238,6 +244,11 @@ xfs_ag_resv_init(
>  	if (pag->pag_agfl_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> +				&ask, &used);
> +		if (error)
> +			goto out;
> +
>  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
>  		if (error)
>  			goto out;
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 6b5e82b9..453bb27 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
>  	struct xfs_alloc_arg	args;		/* block allocation args */
>  	int			error;		/* error return value */
>  
> +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> +
>  	memset(&args, 0, sizeof(args));
>  	args.tp = cur->bc_tp;
>  	args.mp = cur->bc_mp;
> @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
>  	args.firstblock = args.fsbno;
>  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
>  	args.minlen = args.maxlen = args.prod = 1;
> +	args.resv = XFS_AG_RESV_METADATA;
>  
>  	error = xfs_alloc_vextent(&args);
>  	if (error)
> @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
>  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
>  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
>  	struct xfs_owner_info	oinfo;
> +	int			error;
>  
>  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
>  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
>  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
>  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
>  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> -			&oinfo);
> +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> +			XFS_AG_RESV_METADATA);
> +	if (error)
> +		return error;
>  
> -	return 0;
> +	return error;
>  }
>  
>  STATIC int
> @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
>  
>  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
>  }
> +
> +/*
> + * Figure out how many blocks to reserve and how many are used by this btree.
> + */
> +int
> +xfs_refcountbt_calc_reserves(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_extlen_t		*ask,
> +	xfs_extlen_t		*used)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_agf		*agf;
> +	xfs_extlen_t		tree_len;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return 0;
> +
> +	*ask += xfs_refcountbt_max_size(mp);
> +
> +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> +	if (error)
> +		return error;
> +
> +	agf = XFS_BUF_TO_AGF(agbp);
> +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> +	xfs_buf_relse(agbp);
> +
> +	*used += tree_len;
> +
> +	return error;
> +}
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> index 780b02f..3be7768 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
>  		unsigned long long len);
>  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
>  
> +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> +
>  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index 9c0585e..83e672f 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -35,6 +35,7 @@
>  #include "xfs_cksum.h"
>  #include "xfs_error.h"
>  #include "xfs_extent_busy.h"
> +#include "xfs_ag_resv.h"
>  
>  /*
>   * Reverse map btree.
> @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
>  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
>  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
>  }
> +
> +/* Calculate the refcount btree size for some records. */
> +xfs_extlen_t
> +xfs_rmapbt_calc_size(
> +	struct xfs_mount	*mp,
> +	unsigned long long	len)
> +{
> +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> +}
> +
> +/*
> + * Calculate the maximum refcount btree size.
> + */
> +xfs_extlen_t
> +xfs_rmapbt_max_size(
> +	struct xfs_mount	*mp)
> +{
> +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> +	if (mp->m_rmap_mxr[0] == 0)
> +		return 0;
> +
> +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> +}
> +
> +/*
> + * Figure out how many blocks to reserve and how many are used by this btree.
> + */
> +int
> +xfs_rmapbt_calc_reserves(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_extlen_t		*ask,
> +	xfs_extlen_t		*used)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_agf		*agf;
> +	xfs_extlen_t		pool_len;
> +	xfs_extlen_t		tree_len;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return 0;
> +
> +	/* Reserve 1% of the AG or enough for 1 block per record. */
> +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> +	*ask += pool_len;
> +
> +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> +	if (error)
> +		return error;
> +
> +	agf = XFS_BUF_TO_AGF(agbp);
> +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> +	xfs_buf_relse(agbp);
> +
> +	*used += tree_len;
> +
> +	return error;
> +}
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index e73a553..2a9ac47 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
>  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
>  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
>  
> +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> +		unsigned long long len);
> +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> +
> +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> +
>  #endif	/* __XFS_RMAP_BTREE_H__ */
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 3acbf4e0..93d12fa 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -43,6 +43,7 @@
>  #include "xfs_log.h"
>  #include "xfs_filestream.h"
>  #include "xfs_rmap.h"
> +#include "xfs_ag_resv.h"
>  
>  /*
>   * File system operations
> @@ -630,6 +631,11 @@ xfs_growfs_data_private(
>  	xfs_set_low_space_thresholds(mp);
>  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
> +	/* Reserve AG metadata blocks. */
> +	error = xfs_fs_reserve_ag_blocks(mp);
> +	if (error && error != -ENOSPC)
> +		goto out;
> +
>  	/* update secondary superblocks. */
>  	for (agno = 1; agno < nagcount; agno++) {
>  		error = 0;
> @@ -680,6 +686,8 @@ xfs_growfs_data_private(
>  			continue;
>  		}
>  	}
> +
> + out:
>  	return saved_error ? saved_error : error;
>  
>   error0:
> @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
>  	"Please umount the filesystem and rectify the problem(s)");
>  	}
>  }
> +
> +/*
> + * Reserve free space for per-AG metadata.
> + */
> +int
> +xfs_fs_reserve_ag_blocks(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_agnumber_t		agno;
> +	struct xfs_perag	*pag;
> +	int			error = 0;
> +	int			err2;
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		pag = xfs_perag_get(mp, agno);
> +		err2 = xfs_ag_resv_init(pag);
> +		xfs_perag_put(pag);
> +		if (err2 && !error)
> +			error = err2;
> +	}
> +
> +	if (error && error != -ENOSPC) {
> +		xfs_warn(mp,
> +	"Error %d reserving per-AG metadata reserve pool.", error);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +	}
> +
> +	return error;
> +}
> +
> +/*
> + * Free space reserved for per-AG metadata.
> + */
> +int
> +xfs_fs_unreserve_ag_blocks(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_agnumber_t		agno;
> +	struct xfs_perag	*pag;
> +	int			error = 0;
> +	int			err2;
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		pag = xfs_perag_get(mp, agno);
> +		err2 = xfs_ag_resv_free(pag);
> +		xfs_perag_put(pag);
> +		if (err2 && !error)
> +			error = err2;
> +	}
> +
> +	if (error)
> +		xfs_warn(mp,
> +	"Error %d freeing per-AG metadata reserve pool.", error);
> +
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> index f32713f..f349158 100644
> --- a/fs/xfs/xfs_fsops.h
> +++ b/fs/xfs/xfs_fsops.h
> @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
>  				xfs_fsop_resblks_t *outval);
>  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
>  
> +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> +
>  #endif	/* __XFS_FSOPS_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index caecbd2..b5da81d 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -986,10 +986,17 @@ xfs_mountfs(
>  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  			goto out_quota;
>  		}
> +
> +		/* Reserve AG blocks for future btree expansion. */
> +		error = xfs_fs_reserve_ag_blocks(mp);
> +		if (error && error != -ENOSPC)
> +			goto out_agresv;
>  	}
>  
>  	return 0;
>  
> + out_agresv:
> +	xfs_fs_unreserve_ag_blocks(mp);
>   out_quota:
>  	xfs_qm_unmount_quotas(mp);
>   out_rtunmount:
> @@ -1034,6 +1041,7 @@ xfs_unmountfs(
>  
>  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
>  
> +	xfs_fs_unreserve_ag_blocks(mp);
>  	xfs_qm_unmount_quotas(mp);
>  	xfs_rtunmount_inodes(mp);
>  	IRELE(mp->m_rootip);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e6aaa91..875ab9f 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1315,10 +1315,22 @@ xfs_fs_remount(
>  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  			return error;
>  		}
> +
> +		/* Create the per-AG metadata reservation pool .*/
> +		error = xfs_fs_reserve_ag_blocks(mp);
> +		if (error && error != -ENOSPC)
> +			return error;
>  	}
>  
>  	/* rw -> ro */
>  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> +		/* Free the per-AG metadata reservation pool. */
> +		error = xfs_fs_unreserve_ag_blocks(mp);
> +		if (error) {
> +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +			return error;
> +		}
> +
>  		/*
>  		 * Before we sync the metadata, we need to free up the reserve
>  		 * block pool so that the used block count in the superblock on
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-10-12 18:44   ` Brian Foster
@ 2016-10-12 20:52     ` Darrick J. Wong
  2016-10-12 22:42       ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-12 20:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Wed, Oct 12, 2016 at 02:44:51PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> > To gracefully handle the situation where a CoW operation turns a
> > single refcount extent into a lot of tiny ones and then run out of
> > space when a tree split has to happen, use the per-AG reserved block
> > pool to pre-allocate all the space we'll ever need for a maximal
> > btree.  For a 4K block size, this only costs an overhead of 0.3% of
> > available disk space.
> > 
> > When reflink is enabled, we have an unfortunate problem with rmap --
> > since we can share a block billions of times, this means that the
> > reverse mapping btree can expand basically infinitely.  When an AG is
> > so full that there are no free blocks with which to expand the rmapbt,
> > the filesystem will shut down hard.
> > 
> > This is rather annoying to the user, so use the AG reservation code to
> > reserve a "reasonable" amount of space for rmap.  We'll prevent
> > reflinks and CoW operations if we think we're getting close to
> > exhausting an AG's free space rather than shutting down, but this
> > permanent reservation should be enough for "most" users.  Hopefully.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > [hch@lst.de: ensure that we invalidate the freed btree buffer]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> > so that we can easily call xfs_trans_binval for both the per-AG pool
> > and the real freeing case.  Without this we fail to invalidate the
> > btree buffer and will trip over the write verifier on a shrinking
> > refcount btree.
> > 
> > v3: Convert to the new per-AG reservation code.
> > 
> > v4: Combine this patch with the one that adds the rmapbt reservation,
> > since the rmapbt reservation is only needed for reflink filesystems.
> > 
> > v5: If we detect errors while counting the refcount or rmap btrees,
> > shut down the filesystem to avoid the scenario where the fs shuts down
> > mid-transaction due to btree corruption, repair refuses to run until
> > the log is clean, and the log cannot be cleaned because replay hits
> > btree corruption and shuts down.
> > ---
> >  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
> >  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
> >  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
> >  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
> >  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_fsops.h                 |    3 ++
> >  fs/xfs/xfs_mount.c                 |    8 +++++
> >  fs/xfs/xfs_super.c                 |   12 +++++++
> >  9 files changed, 210 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > index e3ae0f2..adf770f 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > @@ -38,6 +38,7 @@
> >  #include "xfs_trans_space.h"
> >  #include "xfs_rmap_btree.h"
> >  #include "xfs_btree.h"
> > +#include "xfs_refcount_btree.h"
> >  
> >  /*
> >   * Per-AG Block Reservations
> > @@ -228,6 +229,11 @@ xfs_ag_resv_init(
> >  	if (pag->pag_meta_resv.ar_asked == 0) {
> >  		ask = used = 0;
> >  
> > +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> > +				pag->pag_agno, &ask, &used);
> > +		if (error)
> > +			goto out;
> > +
> >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> >  				ask, used);
> 
> Now that I get here, I see we have these per-ag reservation structures
> and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
> xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
> "global pool." Based on the commit log, isn't the intent here to reserve
> blocks within each AG? What am I missing?

The AG reservation code "reserves" blocks in each AG by hiding them from
the allocator.  They're all still there in the bnobt, but we underreport
the length of the longest free extent and the free block count in that
AG to make it look like there's less free space than there is.  Since
those blocks are no longer generally available, we have to decrease the
in-core free block count so we can't create delalloc reservations that
the allocator won't (or can't) satisfy.

Maybe a more concrete way to put that is: say we have 4 AGs with 4 agresv
blocks each, and no other free space left anywhere.  The in-core fdblocks count
should be 0 so that starting a write into a hole returns ENOSPC even if the
write could be done without any btree shape changes.   Otherwise, writepages
tries to allocate the delalloc reservation, fails to find any space because
we've hidden it, and kaboom.

--D

> 
> Brian
> 
> >  		if (error)
> > @@ -238,6 +244,11 @@ xfs_ag_resv_init(
> >  	if (pag->pag_agfl_resv.ar_asked == 0) {
> >  		ask = used = 0;
> >  
> > +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> > +				&ask, &used);
> > +		if (error)
> > +			goto out;
> > +
> >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
> >  		if (error)
> >  			goto out;
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > index 6b5e82b9..453bb27 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
> >  	struct xfs_alloc_arg	args;		/* block allocation args */
> >  	int			error;		/* error return value */
> >  
> > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > +
> >  	memset(&args, 0, sizeof(args));
> >  	args.tp = cur->bc_tp;
> >  	args.mp = cur->bc_mp;
> > @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
> >  	args.firstblock = args.fsbno;
> >  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
> >  	args.minlen = args.maxlen = args.prod = 1;
> > +	args.resv = XFS_AG_RESV_METADATA;
> >  
> >  	error = xfs_alloc_vextent(&args);
> >  	if (error)
> > @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
> >  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> >  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> >  	struct xfs_owner_info	oinfo;
> > +	int			error;
> >  
> >  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> >  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
> >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
> >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> > -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> > -			&oinfo);
> > +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > +			XFS_AG_RESV_METADATA);
> > +	if (error)
> > +		return error;
> >  
> > -	return 0;
> > +	return error;
> >  }
> >  
> >  STATIC int
> > @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
> >  
> >  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> >  }
> > +
> > +/*
> > + * Figure out how many blocks to reserve and how many are used by this btree.
> > + */
> > +int
> > +xfs_refcountbt_calc_reserves(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno,
> > +	xfs_extlen_t		*ask,
> > +	xfs_extlen_t		*used)
> > +{
> > +	struct xfs_buf		*agbp;
> > +	struct xfs_agf		*agf;
> > +	xfs_extlen_t		tree_len;
> > +	int			error;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return 0;
> > +
> > +	*ask += xfs_refcountbt_max_size(mp);
> > +
> > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > +	if (error)
> > +		return error;
> > +
> > +	agf = XFS_BUF_TO_AGF(agbp);
> > +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> > +	xfs_buf_relse(agbp);
> > +
> > +	*used += tree_len;
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > index 780b02f..3be7768 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> >  		unsigned long long len);
> >  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> >  
> > +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > +
> >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > index 9c0585e..83e672f 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > @@ -35,6 +35,7 @@
> >  #include "xfs_cksum.h"
> >  #include "xfs_error.h"
> >  #include "xfs_extent_busy.h"
> > +#include "xfs_ag_resv.h"
> >  
> >  /*
> >   * Reverse map btree.
> > @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
> >  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> >  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> >  }
> > +
> > +/* Calculate the refcount btree size for some records. */
> > +xfs_extlen_t
> > +xfs_rmapbt_calc_size(
> > +	struct xfs_mount	*mp,
> > +	unsigned long long	len)
> > +{
> > +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> > +}
> > +
> > +/*
> > + * Calculate the maximum refcount btree size.
> > + */
> > +xfs_extlen_t
> > +xfs_rmapbt_max_size(
> > +	struct xfs_mount	*mp)
> > +{
> > +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > +	if (mp->m_rmap_mxr[0] == 0)
> > +		return 0;
> > +
> > +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > +}
> > +
> > +/*
> > + * Figure out how many blocks to reserve and how many are used by this btree.
> > + */
> > +int
> > +xfs_rmapbt_calc_reserves(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno,
> > +	xfs_extlen_t		*ask,
> > +	xfs_extlen_t		*used)
> > +{
> > +	struct xfs_buf		*agbp;
> > +	struct xfs_agf		*agf;
> > +	xfs_extlen_t		pool_len;
> > +	xfs_extlen_t		tree_len;
> > +	int			error;
> > +
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return 0;
> > +
> > +	/* Reserve 1% of the AG or enough for 1 block per record. */
> > +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> > +	*ask += pool_len;
> > +
> > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > +	if (error)
> > +		return error;
> > +
> > +	agf = XFS_BUF_TO_AGF(agbp);
> > +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> > +	xfs_buf_relse(agbp);
> > +
> > +	*used += tree_len;
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index e73a553..2a9ac47 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> >  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> >  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> >  
> > +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> > +		unsigned long long len);
> > +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> > +
> > +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > +
> >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 3acbf4e0..93d12fa 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -43,6 +43,7 @@
> >  #include "xfs_log.h"
> >  #include "xfs_filestream.h"
> >  #include "xfs_rmap.h"
> > +#include "xfs_ag_resv.h"
> >  
> >  /*
> >   * File system operations
> > @@ -630,6 +631,11 @@ xfs_growfs_data_private(
> >  	xfs_set_low_space_thresholds(mp);
> >  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> >  
> > +	/* Reserve AG metadata blocks. */
> > +	error = xfs_fs_reserve_ag_blocks(mp);
> > +	if (error && error != -ENOSPC)
> > +		goto out;
> > +
> >  	/* update secondary superblocks. */
> >  	for (agno = 1; agno < nagcount; agno++) {
> >  		error = 0;
> > @@ -680,6 +686,8 @@ xfs_growfs_data_private(
> >  			continue;
> >  		}
> >  	}
> > +
> > + out:
> >  	return saved_error ? saved_error : error;
> >  
> >   error0:
> > @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
> >  	"Please umount the filesystem and rectify the problem(s)");
> >  	}
> >  }
> > +
> > +/*
> > + * Reserve free space for per-AG metadata.
> > + */
> > +int
> > +xfs_fs_reserve_ag_blocks(
> > +	struct xfs_mount	*mp)
> > +{
> > +	xfs_agnumber_t		agno;
> > +	struct xfs_perag	*pag;
> > +	int			error = 0;
> > +	int			err2;
> > +
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		pag = xfs_perag_get(mp, agno);
> > +		err2 = xfs_ag_resv_init(pag);
> > +		xfs_perag_put(pag);
> > +		if (err2 && !error)
> > +			error = err2;
> > +	}
> > +
> > +	if (error && error != -ENOSPC) {
> > +		xfs_warn(mp,
> > +	"Error %d reserving per-AG metadata reserve pool.", error);
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +	}
> > +
> > +	return error;
> > +}
> > +
> > +/*
> > + * Free space reserved for per-AG metadata.
> > + */
> > +int
> > +xfs_fs_unreserve_ag_blocks(
> > +	struct xfs_mount	*mp)
> > +{
> > +	xfs_agnumber_t		agno;
> > +	struct xfs_perag	*pag;
> > +	int			error = 0;
> > +	int			err2;
> > +
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		pag = xfs_perag_get(mp, agno);
> > +		err2 = xfs_ag_resv_free(pag);
> > +		xfs_perag_put(pag);
> > +		if (err2 && !error)
> > +			error = err2;
> > +	}
> > +
> > +	if (error)
> > +		xfs_warn(mp,
> > +	"Error %d freeing per-AG metadata reserve pool.", error);
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> > index f32713f..f349158 100644
> > --- a/fs/xfs/xfs_fsops.h
> > +++ b/fs/xfs/xfs_fsops.h
> > @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
> >  				xfs_fsop_resblks_t *outval);
> >  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
> >  
> > +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> > +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> > +
> >  #endif	/* __XFS_FSOPS_H__ */
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index caecbd2..b5da81d 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -986,10 +986,17 @@ xfs_mountfs(
> >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  			goto out_quota;
> >  		}
> > +
> > +		/* Reserve AG blocks for future btree expansion. */
> > +		error = xfs_fs_reserve_ag_blocks(mp);
> > +		if (error && error != -ENOSPC)
> > +			goto out_agresv;
> >  	}
> >  
> >  	return 0;
> >  
> > + out_agresv:
> > +	xfs_fs_unreserve_ag_blocks(mp);
> >   out_quota:
> >  	xfs_qm_unmount_quotas(mp);
> >   out_rtunmount:
> > @@ -1034,6 +1041,7 @@ xfs_unmountfs(
> >  
> >  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
> >  
> > +	xfs_fs_unreserve_ag_blocks(mp);
> >  	xfs_qm_unmount_quotas(mp);
> >  	xfs_rtunmount_inodes(mp);
> >  	IRELE(mp->m_rootip);
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index e6aaa91..875ab9f 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1315,10 +1315,22 @@ xfs_fs_remount(
> >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  			return error;
> >  		}
> > +
> > +		/* Create the per-AG metadata reservation pool .*/
> > +		error = xfs_fs_reserve_ag_blocks(mp);
> > +		if (error && error != -ENOSPC)
> > +			return error;
> >  	}
> >  
> >  	/* rw -> ro */
> >  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> > +		/* Free the per-AG metadata reservation pool. */
> > +		error = xfs_fs_unreserve_ag_blocks(mp);
> > +		if (error) {
> > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +			return error;
> > +		}
> > +
> >  		/*
> >  		 * Before we sync the metadata, we need to free up the reserve
> >  		 * block pool so that the used block count in the superblock on
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-10-12 20:52     ` Darrick J. Wong
@ 2016-10-12 22:42       ` Brian Foster
  2016-12-06 19:32         ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-10-12 22:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Wed, Oct 12, 2016 at 01:52:57PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 12, 2016 at 02:44:51PM -0400, Brian Foster wrote:
> > On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> > > To gracefully handle the situation where a CoW operation turns a
> > > single refcount extent into a lot of tiny ones and then run out of
> > > space when a tree split has to happen, use the per-AG reserved block
> > > pool to pre-allocate all the space we'll ever need for a maximal
> > > btree.  For a 4K block size, this only costs an overhead of 0.3% of
> > > available disk space.
> > > 
> > > When reflink is enabled, we have an unfortunate problem with rmap --
> > > since we can share a block billions of times, this means that the
> > > reverse mapping btree can expand basically infinitely.  When an AG is
> > > so full that there are no free blocks with which to expand the rmapbt,
> > > the filesystem will shut down hard.
> > > 
> > > This is rather annoying to the user, so use the AG reservation code to
> > > reserve a "reasonable" amount of space for rmap.  We'll prevent
> > > reflinks and CoW operations if we think we're getting close to
> > > exhausting an AG's free space rather than shutting down, but this
> > > permanent reservation should be enough for "most" users.  Hopefully.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > [hch@lst.de: ensure that we invalidate the freed btree buffer]
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > > v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> > > so that we can easily call xfs_trans_binval for both the per-AG pool
> > > and the real freeing case.  Without this we fail to invalidate the
> > > btree buffer and will trip over the write verifier on a shrinking
> > > refcount btree.
> > > 
> > > v3: Convert to the new per-AG reservation code.
> > > 
> > > v4: Combine this patch with the one that adds the rmapbt reservation,
> > > since the rmapbt reservation is only needed for reflink filesystems.
> > > 
> > > v5: If we detect errors while counting the refcount or rmap btrees,
> > > shut down the filesystem to avoid the scenario where the fs shuts down
> > > mid-transaction due to btree corruption, repair refuses to run until
> > > the log is clean, and the log cannot be cleaned because replay hits
> > > btree corruption and shuts down.
> > > ---
> > >  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
> > >  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
> > >  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
> > >  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
> > >  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_fsops.h                 |    3 ++
> > >  fs/xfs/xfs_mount.c                 |    8 +++++
> > >  fs/xfs/xfs_super.c                 |   12 +++++++
> > >  9 files changed, 210 insertions(+), 3 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > index e3ae0f2..adf770f 100644
> > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > @@ -38,6 +38,7 @@
> > >  #include "xfs_trans_space.h"
> > >  #include "xfs_rmap_btree.h"
> > >  #include "xfs_btree.h"
> > > +#include "xfs_refcount_btree.h"
> > >  
> > >  /*
> > >   * Per-AG Block Reservations
> > > @@ -228,6 +229,11 @@ xfs_ag_resv_init(
> > >  	if (pag->pag_meta_resv.ar_asked == 0) {
> > >  		ask = used = 0;
> > >  
> > > +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> > > +				pag->pag_agno, &ask, &used);
> > > +		if (error)
> > > +			goto out;
> > > +
> > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> > >  				ask, used);
> > 
> > Now that I get here, I see we have these per-ag reservation structures
> > and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
> > xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
> > "global pool." Based on the commit log, isn't the intent here to reserve
> > blocks within each AG? What am I missing?
> 
> The AG reservation code "reserves" blocks in each AG by hiding them from
> the allocator.  They're all still there in the bnobt, but we underreport
> the length of the longest free extent and the free block count in that
> AG to make it look like there's less free space than there is.  Since
> those blocks are no longer generally available, we have to decrease the
> in-core free block count so we can't create delalloc reservations that
> the allocator won't (or can't) satisfy.
> 

Yep, I think I get the idea/purpose in principle. It sounds similar to
global reserve pool, where we set aside a count of unallocated blocks
via accounting magic such that we have some available in cases such as
the need to allocate a block to free an extent in low free space
conditions.

In this case, it looks like we reserve blocks in the same manner (via
xfs_mod_fdblocks()) and record the reservation in a new per-ag
reservation structure. The part I'm missing is how we guarantee those
blocks are accessible in the particular AG (or am I entirely mistaken
about the requirement that the per-AG reservation must reside within
that specific AG?).

An example might clarify where my confusion lies... suppose we have a
non-standard configuration with a 1TB ag size and just barely enough
total filesystem size for a second AG, e.g., we have two AGs where AG 0
is 1TB and AG 1 is 16MB. Suppose that the reservation requirement (for
the sake of example, at least) based on sb_agblocks is larger than the
entire size of AG 1. Yet, the xfs_mod_fdblocks() call for the AG 1 res
struct will apparently succeed because there are plenty of blocks in
mp->m_fdblocks. Unless I'm mistaken, shouldn't we not be able to reserve
this many blocks out of AG 1?

Even in the case where AG 1 is large enough for the reservation, what
actually prevents a sequence of single block allocations from using all
of the space in the AG? 

Brian

> Maybe a more concrete way to put that is: say we have 4 AGs with 4 agresv
> blocks each, and no other free space left anywhere.  The in-core fdblocks count
> should be 0 so that starting a write into a hole returns ENOSPC even if the
> write could be done without any btree shape changes.   Otherwise, writepages
> tries to allocate the delalloc reservation, fails to find any space because
> we've hidden it, and kaboom.
> 
> --D
> 
> > 
> > Brian
> > 
> > >  		if (error)
> > > @@ -238,6 +244,11 @@ xfs_ag_resv_init(
> > >  	if (pag->pag_agfl_resv.ar_asked == 0) {
> > >  		ask = used = 0;
> > >  
> > > +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> > > +				&ask, &used);
> > > +		if (error)
> > > +			goto out;
> > > +
> > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
> > >  		if (error)
> > >  			goto out;
> > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > index 6b5e82b9..453bb27 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
> > >  	struct xfs_alloc_arg	args;		/* block allocation args */
> > >  	int			error;		/* error return value */
> > >  
> > > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > +
> > >  	memset(&args, 0, sizeof(args));
> > >  	args.tp = cur->bc_tp;
> > >  	args.mp = cur->bc_mp;
> > > @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
> > >  	args.firstblock = args.fsbno;
> > >  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
> > >  	args.minlen = args.maxlen = args.prod = 1;
> > > +	args.resv = XFS_AG_RESV_METADATA;
> > >  
> > >  	error = xfs_alloc_vextent(&args);
> > >  	if (error)
> > > @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
> > >  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > >  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> > >  	struct xfs_owner_info	oinfo;
> > > +	int			error;
> > >  
> > >  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> > >  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
> > >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
> > >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> > > -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> > > -			&oinfo);
> > > +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > > +			XFS_AG_RESV_METADATA);
> > > +	if (error)
> > > +		return error;
> > >  
> > > -	return 0;
> > > +	return error;
> > >  }
> > >  
> > >  STATIC int
> > > @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
> > >  
> > >  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > >  }
> > > +
> > > +/*
> > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > + */
> > > +int
> > > +xfs_refcountbt_calc_reserves(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno,
> > > +	xfs_extlen_t		*ask,
> > > +	xfs_extlen_t		*used)
> > > +{
> > > +	struct xfs_buf		*agbp;
> > > +	struct xfs_agf		*agf;
> > > +	xfs_extlen_t		tree_len;
> > > +	int			error;
> > > +
> > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > +		return 0;
> > > +
> > > +	*ask += xfs_refcountbt_max_size(mp);
> > > +
> > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> > > +	xfs_buf_relse(agbp);
> > > +
> > > +	*used += tree_len;
> > > +
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > index 780b02f..3be7768 100644
> > > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> > >  		unsigned long long len);
> > >  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> > >  
> > > +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > +
> > >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > index 9c0585e..83e672f 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > @@ -35,6 +35,7 @@
> > >  #include "xfs_cksum.h"
> > >  #include "xfs_error.h"
> > >  #include "xfs_extent_busy.h"
> > > +#include "xfs_ag_resv.h"
> > >  
> > >  /*
> > >   * Reverse map btree.
> > > @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
> > >  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> > >  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> > >  }
> > > +
> > > +/* Calculate the refcount btree size for some records. */
> > > +xfs_extlen_t
> > > +xfs_rmapbt_calc_size(
> > > +	struct xfs_mount	*mp,
> > > +	unsigned long long	len)
> > > +{
> > > +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> > > +}
> > > +
> > > +/*
> > > + * Calculate the maximum refcount btree size.
> > > + */
> > > +xfs_extlen_t
> > > +xfs_rmapbt_max_size(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > > +	if (mp->m_rmap_mxr[0] == 0)
> > > +		return 0;
> > > +
> > > +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > > +}
> > > +
> > > +/*
> > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > + */
> > > +int
> > > +xfs_rmapbt_calc_reserves(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno,
> > > +	xfs_extlen_t		*ask,
> > > +	xfs_extlen_t		*used)
> > > +{
> > > +	struct xfs_buf		*agbp;
> > > +	struct xfs_agf		*agf;
> > > +	xfs_extlen_t		pool_len;
> > > +	xfs_extlen_t		tree_len;
> > > +	int			error;
> > > +
> > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > +		return 0;
> > > +
> > > +	/* Reserve 1% of the AG or enough for 1 block per record. */
> > > +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> > > +	*ask += pool_len;
> > > +
> > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> > > +	xfs_buf_relse(agbp);
> > > +
> > > +	*used += tree_len;
> > > +
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > index e73a553..2a9ac47 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> > >  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> > >  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> > >  
> > > +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> > > +		unsigned long long len);
> > > +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> > > +
> > > +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > +
> > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 3acbf4e0..93d12fa 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -43,6 +43,7 @@
> > >  #include "xfs_log.h"
> > >  #include "xfs_filestream.h"
> > >  #include "xfs_rmap.h"
> > > +#include "xfs_ag_resv.h"
> > >  
> > >  /*
> > >   * File system operations
> > > @@ -630,6 +631,11 @@ xfs_growfs_data_private(
> > >  	xfs_set_low_space_thresholds(mp);
> > >  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > >  
> > > +	/* Reserve AG metadata blocks. */
> > > +	error = xfs_fs_reserve_ag_blocks(mp);
> > > +	if (error && error != -ENOSPC)
> > > +		goto out;
> > > +
> > >  	/* update secondary superblocks. */
> > >  	for (agno = 1; agno < nagcount; agno++) {
> > >  		error = 0;
> > > @@ -680,6 +686,8 @@ xfs_growfs_data_private(
> > >  			continue;
> > >  		}
> > >  	}
> > > +
> > > + out:
> > >  	return saved_error ? saved_error : error;
> > >  
> > >   error0:
> > > @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
> > >  	"Please umount the filesystem and rectify the problem(s)");
> > >  	}
> > >  }
> > > +
> > > +/*
> > > + * Reserve free space for per-AG metadata.
> > > + */
> > > +int
> > > +xfs_fs_reserve_ag_blocks(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	xfs_agnumber_t		agno;
> > > +	struct xfs_perag	*pag;
> > > +	int			error = 0;
> > > +	int			err2;
> > > +
> > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > +		pag = xfs_perag_get(mp, agno);
> > > +		err2 = xfs_ag_resv_init(pag);
> > > +		xfs_perag_put(pag);
> > > +		if (err2 && !error)
> > > +			error = err2;
> > > +	}
> > > +
> > > +	if (error && error != -ENOSPC) {
> > > +		xfs_warn(mp,
> > > +	"Error %d reserving per-AG metadata reserve pool.", error);
> > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > +	}
> > > +
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Free space reserved for per-AG metadata.
> > > + */
> > > +int
> > > +xfs_fs_unreserve_ag_blocks(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	xfs_agnumber_t		agno;
> > > +	struct xfs_perag	*pag;
> > > +	int			error = 0;
> > > +	int			err2;
> > > +
> > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > +		pag = xfs_perag_get(mp, agno);
> > > +		err2 = xfs_ag_resv_free(pag);
> > > +		xfs_perag_put(pag);
> > > +		if (err2 && !error)
> > > +			error = err2;
> > > +	}
> > > +
> > > +	if (error)
> > > +		xfs_warn(mp,
> > > +	"Error %d freeing per-AG metadata reserve pool.", error);
> > > +
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> > > index f32713f..f349158 100644
> > > --- a/fs/xfs/xfs_fsops.h
> > > +++ b/fs/xfs/xfs_fsops.h
> > > @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
> > >  				xfs_fsop_resblks_t *outval);
> > >  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
> > >  
> > > +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> > > +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> > > +
> > >  #endif	/* __XFS_FSOPS_H__ */
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index caecbd2..b5da81d 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -986,10 +986,17 @@ xfs_mountfs(
> > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > >  			goto out_quota;
> > >  		}
> > > +
> > > +		/* Reserve AG blocks for future btree expansion. */
> > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > +		if (error && error != -ENOSPC)
> > > +			goto out_agresv;
> > >  	}
> > >  
> > >  	return 0;
> > >  
> > > + out_agresv:
> > > +	xfs_fs_unreserve_ag_blocks(mp);
> > >   out_quota:
> > >  	xfs_qm_unmount_quotas(mp);
> > >   out_rtunmount:
> > > @@ -1034,6 +1041,7 @@ xfs_unmountfs(
> > >  
> > >  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
> > >  
> > > +	xfs_fs_unreserve_ag_blocks(mp);
> > >  	xfs_qm_unmount_quotas(mp);
> > >  	xfs_rtunmount_inodes(mp);
> > >  	IRELE(mp->m_rootip);
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index e6aaa91..875ab9f 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -1315,10 +1315,22 @@ xfs_fs_remount(
> > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > >  			return error;
> > >  		}
> > > +
> > > +		/* Create the per-AG metadata reservation pool .*/
> > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > +		if (error && error != -ENOSPC)
> > > +			return error;
> > >  	}
> > >  
> > >  	/* rw -> ro */
> > >  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> > > +		/* Free the per-AG metadata reservation pool. */
> > > +		error = xfs_fs_unreserve_ag_blocks(mp);
> > > +		if (error) {
> > > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > +			return error;
> > > +		}
> > > +
> > >  		/*
> > >  		 * Before we sync the metadata, we need to free up the reserve
> > >  		 * block pool so that the used block count in the superblock on
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-07 12:15           ` Brian Foster
@ 2016-10-13 18:14             ` Darrick J. Wong
  2016-10-13 19:01               ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-10-13 18:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Fri, Oct 07, 2016 at 08:15:06AM -0400, Brian Foster wrote:
> On Thu, Oct 06, 2016 at 06:02:25PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 06, 2016 at 08:20:08AM -0400, Brian Foster wrote:
> > > On Wed, Oct 05, 2016 at 01:55:42PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Oct 05, 2016 at 02:27:10PM -0400, Brian Foster wrote:
> > > > > On Thu, Sep 29, 2016 at 08:09:40PM -0700, Darrick J. Wong wrote:
> > > > > > For O_DIRECT writes to shared blocks, we have to CoW them just like
> > > > > > we would with buffered writes.  For writes that are not block-aligned,
> > > > > > just bounce them to the page cache.
> > > > > > 
> > > > > > For block-aligned writes, however, we can do better than that.  Use
> > > > > > the same mechanisms that we employ for buffered CoW to set up a
> > > > > > delalloc reservation, allocate all the blocks at once, issue the
> > > > > > writes against the new blocks and use the same ioend functions to
> > > > > > remap the blocks after the write.  This should be fairly performant.
> > > > > > 
> > > > > > Christoph discovered that xfs_reflink_allocate_cow_range may stumble
> > > > > > over invalid entries in the extent array given that it drops the ilock
> > > > > > but still expects the index to be stable.  Simple fixing it to a new
> > > > > > lookup for every iteration still isn't correct given that
> > > > > > xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
> > > > > > there is nothing preventing a xfs_bunmapi_cow call removing extents
> > > > > > once we dropped the ilock either.
> > > > > > 
> > > > > > This patch duplicates the inner loop of xfs_bmapi_allocate into a
> > > > > > helper for xfs_reflink_allocate_cow_range so that it can be done under
> > > > > > the same ilock critical section as our CoW fork delayed allocation.
> > > > > > The directio CoW warts will be revisited in a later patch.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > > > ---
> > > > > > v2: Turns out that there's no way for xfs_end_io_direct_write to know
> > > > > > if the write completed successfully.  Therefore, do /not/ use the
> > > > > > ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
> > > > > > where we *can* tell if the write succeeded or not.
> > > > > > 
> > > > > > v3: Update the file size if we do a directio CoW across EOF.  This
> > > > > > can happen if the last block is shared, the cowextsize hint is set,
> > > > > > and we do a dio write past the end of the file.
> > > > > > 
> > > > > > v4: Christoph rewrote the allocate code to fix some concurrency
> > > > > > problems as part of migrating the code to support iomap.
> > > > > > ---
> > > > > >  fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
> > > > > >  fs/xfs/xfs_file.c    |   20 ++++++++-
> > > > > >  fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > > > > >  fs/xfs/xfs_reflink.h |    2 +
> > > > > >  fs/xfs/xfs_trace.h   |    1 
> > > > > >  5 files changed, 208 insertions(+), 13 deletions(-)
> > > > > > 
> > > > > > 
> > > ...
> ...
> > 
> > > > > > +
> > > > > >  	data = *from;
> > > > > >  	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
> > > > > >  			xfs_get_blocks_direct, xfs_end_io_direct_write,
> > > ...
> > > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > > > index d913ad1..c95cdc3 100644
> > > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > > +++ b/fs/xfs/xfs_reflink.c
> > > ...
> > > > > > @@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
> > > > > >  	return error;
> > > > > >  }
> > > > > >  
> > > > > > +/* Allocate all CoW reservations covering a range of blocks in a file. */
> > > > > > +static int
> > > > > > +__xfs_reflink_allocate_cow(
> > > > > > +	struct xfs_inode	*ip,
> > > > > > +	xfs_fileoff_t		*offset_fsb,
> > > > > > +	xfs_fileoff_t		end_fsb)
> > > > > > +{
> > > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > > +	struct xfs_bmbt_irec	imap;
> > > > > > +	struct xfs_defer_ops	dfops;
> > > > > > +	struct xfs_trans	*tp;
> > > > > > +	xfs_fsblock_t		first_block;
> > > > > > +	xfs_fileoff_t		next_fsb;
> > > > > > +	int			nimaps = 1, error;
> > > > > > +	bool			skipped = false;
> > > > > > +
> > > > > > +	xfs_defer_init(&dfops, &first_block);
> > > > > > +
> > > > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> > > > > > +			XFS_TRANS_RESERVE, &tp);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > > +
> > > > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > > +
> > > > > > +	next_fsb = *offset_fsb;
> > > > > > +	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
> > > > > > +	if (error)
> > > > > > +		goto out_trans_cancel;
> > > > > 
> > > > > Do we really need to do the delayed allocation that results from this?
> > > > > Couldn't we factor out the shared extent walking that allows us to just
> > > > > perform the real allocations below?
> > > > 
> > > > The delayed reservation -> allocation two-step is necessary to create
> > > > replacement that are aligned to the CoW extent size hint.  This is
> > > > important for aligning extents in the same way as the regular extent
> > > > size hint, and critical for detecting random writes and landing them all
> > > > in as close to a contiguous physical extent as possible.  This helps us
> > > > to reduce cow-related fragmentation to manageable levels, which is
> > > > necessary to avoid ENOMEM problems with the current incore extent tree.
> > > > 
> > > 
> > > The cow extent size hint thing makes sense, but I don't see why we need
> > > to do delayed allocation to incorporate it. Can we not accomodate a cow
> > > extent size hint for a real allocation in the cow fork the same way a
> > > direct write accomodates a traditional extent size hint in the data
> > > fork? In fact, we've had logic for a while now that explicitly avoids
> > > delayed allocation when a traditional extent size hint is set.
> > 
> > Yes, that would have been another way to implement it.  I think I
> > finally see your point about using the delalloc mechanism -- since we've
> > converted the buffered write path to iomap and therefore know exactly
> > how much userspace wants to write in both buffered and directio cases,
> > we could just allocate the cow extent right then and there, skipping the
> > overhead of writing a delalloc reservation and then changing it.
> > 
> 
> Pretty much...
> 
> > For buffered writes, though, it's nice to be able to use the DA
> > mechanism so that we can ask the allocator for as big of an extent as we
> > have contiguous dirty pages.  Hm.  I guess for directio then we could
> > just fill in the holes directly and convert any delalloc reservations
> > that happened already to be there, which requires only a single loop.
> > 
> 
> Sure. I'm basically just poking at why we appear to take a different
> approach for each of the buffered/direct I/O mechanisms to the cow fork
> as opposed to the data fork (with regard to block allocation, at least).
> 
> So using delayed allocation for cow buffered I/O certainly makes sense
> to me for basically the same reasons we use it for normal buffered
> I/O...
> 
> > Will ponder this some more, thx for the pushback. :)
> > 
> > > > Reducing fragmentation also helps us avoid problems seen on some other
> > > > filesystem where reflinking of a 64G root image takes minutes after a
> > > > couple of weeks of normal operations because the average extent size is
> > > > now 2 blocks.
> > > > 
> > > > (By contrast we're still averaging ~800 blocks per extent.)
> > > > 
> > > > > It looks like speculative preallocation for dio is at least one strange
> > > > > side effect that can result from this...
> > > > 
> > > > Christoph separated the delalloc reservation into separate functions for
> > > > the data fork and the CoW fork.  xfs_file_iomap_begin_delay() is for the
> > > > data fork (and does speculative prealloc), whereas
> > > > __xfs_reflink_reserve_cow() is for the CoW fork and doesn't know about
> > > > speculative prealloc.
> > > > 
> > > 
> > > Ah, right. Then there's a bit of boilerplate code in
> > > __xfs_reflink_reserve_cow() associated with 'orig_end_fsb' that can be
> > > removed.
> > 
> > The CoW extent size hint code will use orig_end_fsb to tag the inode
> > as potentially needing to gc any CoW leftovers during its periodic
> > scans.
> > 
> 
> Oops, missed that. Hmm, this seems like kind of confused behavior
> overall because (I thought) an extent size hint should force aligned
> (start and end) mapping of extents. In the normal case, extsz forces
> real block allocation, but I don't think that was always the case so
> I'll ignore that for the moment.
> 
> So here, we apply an (cow) extent size hint to a delayed allocation but
> sort of treat it like speculative preallocation (or the allocation size
> mount time option) in that we try to trim off the end and retry the
> request in the event of ENOSPC. AFAICT, xfs_bmapi_reserve_delalloc()
> still does the start/end alignment for cow fork allocations, so really
> how useful is a truncate and retry in this case? In fact, it looks like
> *_reserve_delalloc() would just repeat the same allocation request again
> because the cow extent size hint is still set...
> 
> Am I missing something?

Yeah, it's a little silly and could use some cleanup.

While we're on the topic of silly trim/realloc loops, something else
I've never resolved:

What to do if we write out a CoW and the ioend receives an IO error?
Right now we just cancel the CoW blocks and hope the caller tries again.
On the one hand it might make more sense just to leave the CoW fork
alone in the hopes that a retry will succeed, but on the other hand we
could just allocate a replacement extent, drop the original allocation,
and retry the write immediately.

For 4.9 I suspect that we probably shouldn't be cancel_cow_blocks
when the ioend encounters some sort of IO error.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > > > > +
> > > > > > +	if (skipped) {
> > > > > > +		*offset_fsb = next_fsb;
> > > > > > +		goto out_trans_cancel;
> > > > > > +	}
> > > > > > +
> > > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > > > +	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
> > > > > > +			XFS_BMAPI_COWFORK, &first_block,
> > > > > > +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> > > > > > +			&imap, &nimaps, &dfops);
> > > > > > +	if (error)
> > > > > > +		goto out_trans_cancel;
> > > > > 
> > > > > Should we be using unwritten extents (BMAPI_PREALLOC) to avoid stale
> > > > > data exposure similar to traditional direct write (or is the cow fork
> > > > > extent never accessible until it is remapped)?
> > > > 
> > > > Correct.  CoW fork extents are not accessible until after remapping.
> > > > 
> > > 
> > > Got it, thanks.
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > 
> > > > > Brian
> > > > > 
> > > > > > +
> > > > > > +	/* We might not have been able to map the whole delalloc extent */
> > > > > > +	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
> > > > > > +
> > > > > > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > > > > > +	if (error)
> > > > > > +		goto out_trans_cancel;
> > > > > > +
> > > > > > +	error = xfs_trans_commit(tp);
> > > > > > +
> > > > > > +out_unlock:
> > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > +	return error;
> > > > > > +out_trans_cancel:
> > > > > > +	xfs_defer_cancel(&dfops);
> > > > > > +	xfs_trans_cancel(tp);
> > > > > > +	goto out_unlock;
> > > > > > +}
> > > > > > +
> > > > > > +/* Allocate all CoW reservations covering a part of a file. */
> > > > > > +int
> > > > > > +xfs_reflink_allocate_cow_range(
> > > > > > +	struct xfs_inode	*ip,
> > > > > > +	xfs_off_t		offset,
> > > > > > +	xfs_off_t		count)
> > > > > > +{
> > > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > > +	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > > > > +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > > > > > +	int			error;
> > > > > > +
> > > > > > +	ASSERT(xfs_is_reflink_inode(ip));
> > > > > > +
> > > > > > +	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Make sure that the dquots are there.
> > > > > > +	 */
> > > > > > +	error = xfs_qm_dqattach(ip, 0);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > > +
> > > > > > +	while (offset_fsb < end_fsb) {
> > > > > > +		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
> > > > > > +		if (error) {
> > > > > > +			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> > > > > > +					_RET_IP_);
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > +
> > > > > >  /*
> > > > > >   * Find the CoW reservation (and whether or not it needs block allocation)
> > > > > >   * for a given byte offset of a file.
> > > > > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > > > > index bffa4be..c0c989a 100644
> > > > > > --- a/fs/xfs/xfs_reflink.h
> > > > > > +++ b/fs/xfs/xfs_reflink.h
> > > > > > @@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > > > > >  
> > > > > >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > > > > >  		xfs_off_t offset, xfs_off_t count);
> > > > > > +extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> > > > > > +		xfs_off_t offset, xfs_off_t count);
> > > > > >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > > > > >  		struct xfs_bmbt_irec *imap, bool *need_alloc);
> > > > > >  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > > > index 7612096..8e89223 100644
> > > > > > --- a/fs/xfs/xfs_trace.h
> > > > > > +++ b/fs/xfs/xfs_trace.h
> > > > > > @@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> > > > > >  
> > > > > >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
> > > > > >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > > > > > -DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
> > > > > >  
> > > > > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
> > > > > >  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 37/63] xfs: implement CoW for directio writes
  2016-10-13 18:14             ` Darrick J. Wong
@ 2016-10-13 19:01               ` Brian Foster
  0 siblings, 0 replies; 187+ messages in thread
From: Brian Foster @ 2016-10-13 19:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Thu, Oct 13, 2016 at 11:14:22AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 07, 2016 at 08:15:06AM -0400, Brian Foster wrote:
> > On Thu, Oct 06, 2016 at 06:02:25PM -0700, Darrick J. Wong wrote:
> > > On Thu, Oct 06, 2016 at 08:20:08AM -0400, Brian Foster wrote:
> > > > On Wed, Oct 05, 2016 at 01:55:42PM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Oct 05, 2016 at 02:27:10PM -0400, Brian Foster wrote:
> > > > > > On Thu, Sep 29, 2016 at 08:09:40PM -0700, Darrick J. Wong wrote:
> > > > > > > For O_DIRECT writes to shared blocks, we have to CoW them just like
> > > > > > > we would with buffered writes.  For writes that are not block-aligned,
> > > > > > > just bounce them to the page cache.
> > > > > > > 
> > > > > > > For block-aligned writes, however, we can do better than that.  Use
> > > > > > > the same mechanisms that we employ for buffered CoW to set up a
> > > > > > > delalloc reservation, allocate all the blocks at once, issue the
> > > > > > > writes against the new blocks and use the same ioend functions to
> > > > > > > remap the blocks after the write.  This should be fairly performant.
> > > > > > > 
> > > > > > > Christoph discovered that xfs_reflink_allocate_cow_range may stumble
> > > > > > > over invalid entries in the extent array given that it drops the ilock
> > > > > > > but still expects the index to be stable.  Simple fixing it to a new
> > > > > > > lookup for every iteration still isn't correct given that
> > > > > > > xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
> > > > > > > there is nothing preventing a xfs_bunmapi_cow call removing extents
> > > > > > > once we dropped the ilock either.
> > > > > > > 
> > > > > > > This patch duplicates the inner loop of xfs_bmapi_allocate into a
> > > > > > > helper for xfs_reflink_allocate_cow_range so that it can be done under
> > > > > > > the same ilock critical section as our CoW fork delayed allocation.
> > > > > > > The directio CoW warts will be revisited in a later patch.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > > > > ---
> > > > > > > v2: Turns out that there's no way for xfs_end_io_direct_write to know
> > > > > > > if the write completed successfully.  Therefore, do /not/ use the
> > > > > > > ioend for dio cow post-processing; instead, move it to xfs_vm_do_dio
> > > > > > > where we *can* tell if the write succeeded or not.
> > > > > > > 
> > > > > > > v3: Update the file size if we do a directio CoW across EOF.  This
> > > > > > > can happen if the last block is shared, the cowextsize hint is set,
> > > > > > > and we do a dio write past the end of the file.
> > > > > > > 
> > > > > > > v4: Christoph rewrote the allocate code to fix some concurrency
> > > > > > > problems as part of migrating the code to support iomap.
> > > > > > > ---
> > > > > > >  fs/xfs/xfs_aops.c    |   91 +++++++++++++++++++++++++++++++++++++++----
> > > > > > >  fs/xfs/xfs_file.c    |   20 ++++++++-
> > > > > > >  fs/xfs/xfs_reflink.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > > > > > >  fs/xfs/xfs_reflink.h |    2 +
> > > > > > >  fs/xfs/xfs_trace.h   |    1 
> > > > > > >  5 files changed, 208 insertions(+), 13 deletions(-)
> > > > > > > 
> > > > > > > 
> > > > ...
> > ...
> > > 
> > > > > > > +
> > > > > > >  	data = *from;
> > > > > > >  	ret = __blockdev_direct_IO(iocb, inode, target->bt_bdev, &data,
> > > > > > >  			xfs_get_blocks_direct, xfs_end_io_direct_write,
> > > > ...
> > > > > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > > > > index d913ad1..c95cdc3 100644
> > > > > > > --- a/fs/xfs/xfs_reflink.c
> > > > > > > +++ b/fs/xfs/xfs_reflink.c
> > > > ...
> > > > > > > @@ -347,6 +352,102 @@ xfs_reflink_reserve_cow_range(
> > > > > > >  	return error;
> > > > > > >  }
> > > > > > >  
> > > > > > > +/* Allocate all CoW reservations covering a range of blocks in a file. */
> > > > > > > +static int
> > > > > > > +__xfs_reflink_allocate_cow(
> > > > > > > +	struct xfs_inode	*ip,
> > > > > > > +	xfs_fileoff_t		*offset_fsb,
> > > > > > > +	xfs_fileoff_t		end_fsb)
> > > > > > > +{
> > > > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > > > +	struct xfs_bmbt_irec	imap;
> > > > > > > +	struct xfs_defer_ops	dfops;
> > > > > > > +	struct xfs_trans	*tp;
> > > > > > > +	xfs_fsblock_t		first_block;
> > > > > > > +	xfs_fileoff_t		next_fsb;
> > > > > > > +	int			nimaps = 1, error;
> > > > > > > +	bool			skipped = false;
> > > > > > > +
> > > > > > > +	xfs_defer_init(&dfops, &first_block);
> > > > > > > +
> > > > > > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
> > > > > > > +			XFS_TRANS_RESERVE, &tp);
> > > > > > > +	if (error)
> > > > > > > +		return error;
> > > > > > > +
> > > > > > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > > > +
> > > > > > > +	next_fsb = *offset_fsb;
> > > > > > > +	error = __xfs_reflink_reserve_cow(ip, &next_fsb, end_fsb, &skipped);
> > > > > > > +	if (error)
> > > > > > > +		goto out_trans_cancel;
> > > > > > 
> > > > > > Do we really need to do the delayed allocation that results from this?
> > > > > > Couldn't we factor out the shared extent walking that allows us to just
> > > > > > perform the real allocations below?
> > > > > 
> > > > > The delayed reservation -> allocation two-step is necessary to create
> > > > > replacement that are aligned to the CoW extent size hint.  This is
> > > > > important for aligning extents in the same way as the regular extent
> > > > > size hint, and critical for detecting random writes and landing them all
> > > > > in as close to a contiguous physical extent as possible.  This helps us
> > > > > to reduce cow-related fragmentation to manageable levels, which is
> > > > > necessary to avoid ENOMEM problems with the current incore extent tree.
> > > > > 
> > > > 
> > > > The cow extent size hint thing makes sense, but I don't see why we need
> > > > to do delayed allocation to incorporate it. Can we not accomodate a cow
> > > > extent size hint for a real allocation in the cow fork the same way a
> > > > direct write accomodates a traditional extent size hint in the data
> > > > fork? In fact, we've had logic for a while now that explicitly avoids
> > > > delayed allocation when a traditional extent size hint is set.
> > > 
> > > Yes, that would have been another way to implement it.  I think I
> > > finally see your point about using the delalloc mechanism -- since we've
> > > converted the buffered write path to iomap and therefore know exactly
> > > how much userspace wants to write in both buffered and directio cases,
> > > we could just allocate the cow extent right then and there, skipping the
> > > overhead of writing a delalloc reservation and then changing it.
> > > 
> > 
> > Pretty much...
> > 
> > > For buffered writes, though, it's nice to be able to use the DA
> > > mechanism so that we can ask the allocator for as big of an extent as we
> > > have contiguous dirty pages.  Hm.  I guess for directio then we could
> > > just fill in the holes directly and convert any delalloc reservations
> > > that happened already to be there, which requires only a single loop.
> > > 
> > 
> > Sure. I'm basically just poking at why we appear to take a different
> > approach for each of the buffered/direct I/O mechanisms to the cow fork
> > as opposed to the data fork (with regard to block allocation, at least).
> > 
> > So using delayed allocation for cow buffered I/O certainly makes sense
> > to me for basically the same reasons we use it for normal buffered
> > I/O...
> > 
> > > Will ponder this some more, thx for the pushback. :)
> > > 
> > > > > Reducing fragmentation also helps us avoid problems seen on some other
> > > > > filesystem where reflinking of a 64G root image takes minutes after a
> > > > > couple of weeks of normal operations because the average extent size is
> > > > > now 2 blocks.
> > > > > 
> > > > > (By contrast we're still averaging ~800 blocks per extent.)
> > > > > 
> > > > > > It looks like speculative preallocation for dio is at least one strange
> > > > > > side effect that can result from this...
> > > > > 
> > > > > Christoph separated the delalloc reservation into separate functions for
> > > > > the data fork and the CoW fork.  xfs_file_iomap_begin_delay() is for the
> > > > > data fork (and does speculative prealloc), whereas
> > > > > __xfs_reflink_reserve_cow() is for the CoW fork and doesn't know about
> > > > > speculative prealloc.
> > > > > 
> > > > 
> > > > Ah, right. Then there's a bit of boilerplate code in
> > > > __xfs_reflink_reserve_cow() associated with 'orig_end_fsb' that can be
> > > > removed.
> > > 
> > > The CoW extent size hint code will use orig_end_fsb to tag the inode
> > > as potentially needing to gc any CoW leftovers during its periodic
> > > scans.
> > > 
> > 
> > Oops, missed that. Hmm, this seems like kind of confused behavior
> > overall because (I thought) an extent size hint should force aligned
> > (start and end) mapping of extents. In the normal case, extsz forces
> > real block allocation, but I don't think that was always the case so
> > I'll ignore that for the moment.
> > 
> > So here, we apply an (cow) extent size hint to a delayed allocation but
> > sort of treat it like speculative preallocation (or the allocation size
> > mount time option) in that we try to trim off the end and retry the
> > request in the event of ENOSPC. AFAICT, xfs_bmapi_reserve_delalloc()
> > still does the start/end alignment for cow fork allocations, so really
> > how useful is a truncate and retry in this case? In fact, it looks like
> > *_reserve_delalloc() would just repeat the same allocation request again
> > because the cow extent size hint is still set...
> > 
> > Am I missing something?
> 
> Yeah, it's a little silly and could use some cleanup.
> 
> While we're on the topic of silly trim/realloc loops, something else
> I've never resolved:
> 
> What to do if we write out a CoW and the ioend receives an IO error?
> Right now we just cancel the CoW blocks and hope the caller tries again.
> On the one hand it might make more sense just to leave the CoW fork
> alone in the hopes that a retry will succeed, but on the other hand we
> could just allocate a replacement extent, drop the original allocation,
> and retry the write immediately.
> 

Hmm, yeah to cancel there does seem like kind of a strange behavior.
What might be interesting with that approach is whether we expose
ourselves to allocation or other problems on the next writeback due to
killing off cow fork blocks. I haven't dug into it, but at that point
wouldn't we have dirty pagecache pages around without any underlying
block allocation (delalloc or otherwise..)?

> For 4.9 I suspect that we probably shouldn't be cancel_cow_blocks
> when the ioend encounters some sort of IO error.
> 

Indeed. It probably requires more thought, but my inclination would be
to retain as similar behavior as possible to the normal failed write
case, which would be to preserve the actual allocation and let whatever
mechanism handles the retry to do so.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > > > > +
> > > > > > > +	if (skipped) {
> > > > > > > +		*offset_fsb = next_fsb;
> > > > > > > +		goto out_trans_cancel;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > > > > +	error = xfs_bmapi_write(tp, ip, *offset_fsb, next_fsb - *offset_fsb,
> > > > > > > +			XFS_BMAPI_COWFORK, &first_block,
> > > > > > > +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK),
> > > > > > > +			&imap, &nimaps, &dfops);
> > > > > > > +	if (error)
> > > > > > > +		goto out_trans_cancel;
> > > > > > 
> > > > > > Should we be using unwritten extents (BMAPI_PREALLOC) to avoid stale
> > > > > > data exposure similar to traditional direct write (or is the cow fork
> > > > > > extent never accessible until it is remapped)?
> > > > > 
> > > > > Correct.  CoW fork extents are not accessible until after remapping.
> > > > > 
> > > > 
> > > > Got it, thanks.
> > > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > +
> > > > > > > +	/* We might not have been able to map the whole delalloc extent */
> > > > > > > +	*offset_fsb = min(*offset_fsb + imap.br_blockcount, next_fsb);
> > > > > > > +
> > > > > > > +	error = xfs_defer_finish(&tp, &dfops, NULL);
> > > > > > > +	if (error)
> > > > > > > +		goto out_trans_cancel;
> > > > > > > +
> > > > > > > +	error = xfs_trans_commit(tp);
> > > > > > > +
> > > > > > > +out_unlock:
> > > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > > +	return error;
> > > > > > > +out_trans_cancel:
> > > > > > > +	xfs_defer_cancel(&dfops);
> > > > > > > +	xfs_trans_cancel(tp);
> > > > > > > +	goto out_unlock;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Allocate all CoW reservations covering a part of a file. */
> > > > > > > +int
> > > > > > > +xfs_reflink_allocate_cow_range(
> > > > > > > +	struct xfs_inode	*ip,
> > > > > > > +	xfs_off_t		offset,
> > > > > > > +	xfs_off_t		count)
> > > > > > > +{
> > > > > > > +	struct xfs_mount	*mp = ip->i_mount;
> > > > > > > +	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > > > > > > +	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, offset + count);
> > > > > > > +	int			error;
> > > > > > > +
> > > > > > > +	ASSERT(xfs_is_reflink_inode(ip));
> > > > > > > +
> > > > > > > +	trace_xfs_reflink_allocate_cow_range(ip, offset, count);
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Make sure that the dquots are there.
> > > > > > > +	 */
> > > > > > > +	error = xfs_qm_dqattach(ip, 0);
> > > > > > > +	if (error)
> > > > > > > +		return error;
> > > > > > > +
> > > > > > > +	while (offset_fsb < end_fsb) {
> > > > > > > +		error = __xfs_reflink_allocate_cow(ip, &offset_fsb, end_fsb);
> > > > > > > +		if (error) {
> > > > > > > +			trace_xfs_reflink_allocate_cow_range_error(ip, error,
> > > > > > > +					_RET_IP_);
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	return error;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /*
> > > > > > >   * Find the CoW reservation (and whether or not it needs block allocation)
> > > > > > >   * for a given byte offset of a file.
> > > > > > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > > > > > index bffa4be..c0c989a 100644
> > > > > > > --- a/fs/xfs/xfs_reflink.h
> > > > > > > +++ b/fs/xfs/xfs_reflink.h
> > > > > > > @@ -28,6 +28,8 @@ extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> > > > > > >  
> > > > > > >  extern int xfs_reflink_reserve_cow_range(struct xfs_inode *ip,
> > > > > > >  		xfs_off_t offset, xfs_off_t count);
> > > > > > > +extern int xfs_reflink_allocate_cow_range(struct xfs_inode *ip,
> > > > > > > +		xfs_off_t offset, xfs_off_t count);
> > > > > > >  extern bool xfs_reflink_find_cow_mapping(struct xfs_inode *ip, xfs_off_t offset,
> > > > > > >  		struct xfs_bmbt_irec *imap, bool *need_alloc);
> > > > > > >  extern int xfs_reflink_trim_irec_to_next_cow(struct xfs_inode *ip,
> > > > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > > > > index 7612096..8e89223 100644
> > > > > > > --- a/fs/xfs/xfs_trace.h
> > > > > > > +++ b/fs/xfs/xfs_trace.h
> > > > > > > @@ -3332,7 +3332,6 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_enospc);
> > > > > > >  
> > > > > > >  DEFINE_RW_EVENT(xfs_reflink_reserve_cow_range);
> > > > > > >  DEFINE_RW_EVENT(xfs_reflink_allocate_cow_range);
> > > > > > > -DEFINE_INODE_IREC_EVENT(xfs_reflink_allocate_cow_extent);
> > > > > > >  
> > > > > > >  DEFINE_INODE_IREC_EVENT(xfs_reflink_bounce_dio_write);
> > > > > > >  DEFINE_IOMAP_EVENT(xfs_reflink_find_cow_mapping);
> > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-10-12 22:42       ` Brian Foster
@ 2016-12-06 19:32         ` Darrick J. Wong
  2016-12-07 11:53           ` Brian Foster
  0 siblings, 1 reply; 187+ messages in thread
From: Darrick J. Wong @ 2016-12-06 19:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Wed, Oct 12, 2016 at 06:42:36PM -0400, Brian Foster wrote:
> On Wed, Oct 12, 2016 at 01:52:57PM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 12, 2016 at 02:44:51PM -0400, Brian Foster wrote:
> > > On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> > > > To gracefully handle the situation where a CoW operation turns a
> > > > single refcount extent into a lot of tiny ones and then run out of
> > > > space when a tree split has to happen, use the per-AG reserved block
> > > > pool to pre-allocate all the space we'll ever need for a maximal
> > > > btree.  For a 4K block size, this only costs an overhead of 0.3% of
> > > > available disk space.
> > > > 
> > > > When reflink is enabled, we have an unfortunate problem with rmap --
> > > > since we can share a block billions of times, this means that the
> > > > reverse mapping btree can expand basically infinitely.  When an AG is
> > > > so full that there are no free blocks with which to expand the rmapbt,
> > > > the filesystem will shut down hard.
> > > > 
> > > > This is rather annoying to the user, so use the AG reservation code to
> > > > reserve a "reasonable" amount of space for rmap.  We'll prevent
> > > > reflinks and CoW operations if we think we're getting close to
> > > > exhausting an AG's free space rather than shutting down, but this
> > > > permanent reservation should be enough for "most" users.  Hopefully.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > [hch@lst.de: ensure that we invalidate the freed btree buffer]
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > ---
> > > > v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> > > > so that we can easily call xfs_trans_binval for both the per-AG pool
> > > > and the real freeing case.  Without this we fail to invalidate the
> > > > btree buffer and will trip over the write verifier on a shrinking
> > > > refcount btree.
> > > > 
> > > > v3: Convert to the new per-AG reservation code.
> > > > 
> > > > v4: Combine this patch with the one that adds the rmapbt reservation,
> > > > since the rmapbt reservation is only needed for reflink filesystems.
> > > > 
> > > > v5: If we detect errors while counting the refcount or rmap btrees,
> > > > shut down the filesystem to avoid the scenario where the fs shuts down
> > > > mid-transaction due to btree corruption, repair refuses to run until
> > > > the log is clean, and the log cannot be cleaned because replay hits
> > > > btree corruption and shuts down.
> > > > ---
> > > >  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
> > > >  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
> > > >  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
> > > >  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
> > > >  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_fsops.h                 |    3 ++
> > > >  fs/xfs/xfs_mount.c                 |    8 +++++
> > > >  fs/xfs/xfs_super.c                 |   12 +++++++
> > > >  9 files changed, 210 insertions(+), 3 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > index e3ae0f2..adf770f 100644
> > > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > @@ -38,6 +38,7 @@
> > > >  #include "xfs_trans_space.h"
> > > >  #include "xfs_rmap_btree.h"
> > > >  #include "xfs_btree.h"
> > > > +#include "xfs_refcount_btree.h"
> > > >  
> > > >  /*
> > > >   * Per-AG Block Reservations
> > > > @@ -228,6 +229,11 @@ xfs_ag_resv_init(
> > > >  	if (pag->pag_meta_resv.ar_asked == 0) {
> > > >  		ask = used = 0;
> > > >  
> > > > +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> > > > +				pag->pag_agno, &ask, &used);
> > > > +		if (error)
> > > > +			goto out;
> > > > +
> > > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> > > >  				ask, used);
> > > 
> > > Now that I get here, I see we have these per-ag reservation structures
> > > and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
> > > xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
> > > "global pool." Based on the commit log, isn't the intent here to reserve
> > > blocks within each AG? What am I missing?
> > 
> > The AG reservation code "reserves" blocks in each AG by hiding them from
> > the allocator.  They're all still there in the bnobt, but we underreport
> > the length of the longest free extent and the free block count in that
> > AG to make it look like there's less free space than there is.  Since
> > those blocks are no longer generally available, we have to decrease the
> > in-core free block count so we can't create delalloc reservations that
> > the allocator won't (or can't) satisfy.
> > 
> 
> Yep, I think I get the idea/purpose in principle. It sounds similar to
> global reserve pool, where we set aside a count of unallocated blocks
> via accounting magic such that we have some available in cases such as
> the need to allocate a block to free an extent in low free space
> conditions.

Correct.

> In this case, it looks like we reserve blocks in the same manner (via
> xfs_mod_fdblocks()) and record the reservation in a new per-ag
> reservation structure. The part I'm missing is how we guarantee those
> blocks are accessible in the particular AG (or am I entirely mistaken
> about the requirement that the per-AG reservation must reside within
> that specific AG?).

You're correct there too.

> An example might clarify where my confusion lies... suppose we have a
> non-standard configuration with a 1TB ag size and just barely enough
> total filesystem size for a second AG, e.g., we have two AGs where AG 0
> is 1TB and AG 1 is 16MB. Suppose that the reservation requirement (for
> the sake of example, at least) based on sb_agblocks is larger than the
> entire size of AG 1. Yet, the xfs_mod_fdblocks() call for the AG 1 res
> struct will apparently succeed because there are plenty of blocks in
> mp->m_fdblocks. Unless I'm mistaken, shouldn't we not be able to reserve
> this many blocks out of AG 1?

You're right, that is a bug.  We /ought/ to be calculating the
reservation ask based on agf_length, not sb_agblocks.  I'll also have to
fix growfs to change the reservation if the length of the last AG
changes.

> Even in the case where AG 1 is large enough for the reservation, what
> actually prevents a sequence of single block allocations from using all
> of the space in the AG? 

AFAICT, the allocator picks an AG and tries to fix the freelist before
allocating blocks.  As part of ensuring the AGFL, we call
xfs_alloc_space_available to decide if there's enough space in the AG
both to satisfy the allocation request and to fix the freelist.

_a_s_a starts by determining the number of blocks that have to stay
reserved in that AG for the given allocation type.  Then it calls
xfs_alloc_longest_free_extent to find the longest free extent in the AG.

_a_l_f_e finds the longest extent and subtracts whatever part of the AG
reservation it can't satisfy out of the non-longest free extents.

Upon returning from _a_l_f_e, _a_s_a rejects the allocation if the
longest extent cannot satisfy the required minimum allocation with the
given alignment constraints.

Next it calculates the space that would remain after the allocation,
which is:

(free space + agfl blocks) - (ag reservation) - (minimum agfl length) -
     (total blocks requested)

If this quantity is less than zero (or less than args->minleft) then the
allocation is also rejected.  I believe this should be sufficient to
prevent a series of single block alloc requests from exhausting the AG
since we're stopped from giving away reserved blocks that we're not
entitled to, even if there are still records in the bnobt.

--D

> 
> Brian
> 
> > Maybe a more concrete way to put that is: say we have 4 AGs with 4 agresv
> > blocks each, and no other free space left anywhere.  The in-core fdblocks count
> > should be 0 so that starting a write into a hole returns ENOSPC even if the
> > write could be done without any btree shape changes.   Otherwise, writepages
> > tries to allocate the delalloc reservation, fails to find any space because
> > we've hidden it, and kaboom.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  		if (error)
> > > > @@ -238,6 +244,11 @@ xfs_ag_resv_init(
> > > >  	if (pag->pag_agfl_resv.ar_asked == 0) {
> > > >  		ask = used = 0;
> > > >  
> > > > +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> > > > +				&ask, &used);
> > > > +		if (error)
> > > > +			goto out;
> > > > +
> > > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
> > > >  		if (error)
> > > >  			goto out;
> > > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > index 6b5e82b9..453bb27 100644
> > > > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
> > > >  	struct xfs_alloc_arg	args;		/* block allocation args */
> > > >  	int			error;		/* error return value */
> > > >  
> > > > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > > +
> > > >  	memset(&args, 0, sizeof(args));
> > > >  	args.tp = cur->bc_tp;
> > > >  	args.mp = cur->bc_mp;
> > > > @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
> > > >  	args.firstblock = args.fsbno;
> > > >  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
> > > >  	args.minlen = args.maxlen = args.prod = 1;
> > > > +	args.resv = XFS_AG_RESV_METADATA;
> > > >  
> > > >  	error = xfs_alloc_vextent(&args);
> > > >  	if (error)
> > > > @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
> > > >  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > > >  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> > > >  	struct xfs_owner_info	oinfo;
> > > > +	int			error;
> > > >  
> > > >  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> > > >  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
> > > >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
> > > >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> > > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> > > > -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> > > > -			&oinfo);
> > > > +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > > > +			XFS_AG_RESV_METADATA);
> > > > +	if (error)
> > > > +		return error;
> > > >  
> > > > -	return 0;
> > > > +	return error;
> > > >  }
> > > >  
> > > >  STATIC int
> > > > @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
> > > >  
> > > >  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > > >  }
> > > > +
> > > > +/*
> > > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > > + */
> > > > +int
> > > > +xfs_refcountbt_calc_reserves(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_agnumber_t		agno,
> > > > +	xfs_extlen_t		*ask,
> > > > +	xfs_extlen_t		*used)
> > > > +{
> > > > +	struct xfs_buf		*agbp;
> > > > +	struct xfs_agf		*agf;
> > > > +	xfs_extlen_t		tree_len;
> > > > +	int			error;
> > > > +
> > > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > > +		return 0;
> > > > +
> > > > +	*ask += xfs_refcountbt_max_size(mp);
> > > > +
> > > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > > +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> > > > +	xfs_buf_relse(agbp);
> > > > +
> > > > +	*used += tree_len;
> > > > +
> > > > +	return error;
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > index 780b02f..3be7768 100644
> > > > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> > > >  		unsigned long long len);
> > > >  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> > > >  
> > > > +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> > > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > > +
> > > >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > index 9c0585e..83e672f 100644
> > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > @@ -35,6 +35,7 @@
> > > >  #include "xfs_cksum.h"
> > > >  #include "xfs_error.h"
> > > >  #include "xfs_extent_busy.h"
> > > > +#include "xfs_ag_resv.h"
> > > >  
> > > >  /*
> > > >   * Reverse map btree.
> > > > @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
> > > >  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> > > >  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> > > >  }
> > > > +
> > > > +/* Calculate the refcount btree size for some records. */
> > > > +xfs_extlen_t
> > > > +xfs_rmapbt_calc_size(
> > > > +	struct xfs_mount	*mp,
> > > > +	unsigned long long	len)
> > > > +{
> > > > +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Calculate the maximum refcount btree size.
> > > > + */
> > > > +xfs_extlen_t
> > > > +xfs_rmapbt_max_size(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > > > +	if (mp->m_rmap_mxr[0] == 0)
> > > > +		return 0;
> > > > +
> > > > +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > > + */
> > > > +int
> > > > +xfs_rmapbt_calc_reserves(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_agnumber_t		agno,
> > > > +	xfs_extlen_t		*ask,
> > > > +	xfs_extlen_t		*used)
> > > > +{
> > > > +	struct xfs_buf		*agbp;
> > > > +	struct xfs_agf		*agf;
> > > > +	xfs_extlen_t		pool_len;
> > > > +	xfs_extlen_t		tree_len;
> > > > +	int			error;
> > > > +
> > > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > +		return 0;
> > > > +
> > > > +	/* Reserve 1% of the AG or enough for 1 block per record. */
> > > > +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> > > > +	*ask += pool_len;
> > > > +
> > > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > > +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> > > > +	xfs_buf_relse(agbp);
> > > > +
> > > > +	*used += tree_len;
> > > > +
> > > > +	return error;
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > index e73a553..2a9ac47 100644
> > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> > > >  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> > > >  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> > > >  
> > > > +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> > > > +		unsigned long long len);
> > > > +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> > > > +
> > > > +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> > > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > > +
> > > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > index 3acbf4e0..93d12fa 100644
> > > > --- a/fs/xfs/xfs_fsops.c
> > > > +++ b/fs/xfs/xfs_fsops.c
> > > > @@ -43,6 +43,7 @@
> > > >  #include "xfs_log.h"
> > > >  #include "xfs_filestream.h"
> > > >  #include "xfs_rmap.h"
> > > > +#include "xfs_ag_resv.h"
> > > >  
> > > >  /*
> > > >   * File system operations
> > > > @@ -630,6 +631,11 @@ xfs_growfs_data_private(
> > > >  	xfs_set_low_space_thresholds(mp);
> > > >  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > > >  
> > > > +	/* Reserve AG metadata blocks. */
> > > > +	error = xfs_fs_reserve_ag_blocks(mp);
> > > > +	if (error && error != -ENOSPC)
> > > > +		goto out;
> > > > +
> > > >  	/* update secondary superblocks. */
> > > >  	for (agno = 1; agno < nagcount; agno++) {
> > > >  		error = 0;
> > > > @@ -680,6 +686,8 @@ xfs_growfs_data_private(
> > > >  			continue;
> > > >  		}
> > > >  	}
> > > > +
> > > > + out:
> > > >  	return saved_error ? saved_error : error;
> > > >  
> > > >   error0:
> > > > @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
> > > >  	"Please umount the filesystem and rectify the problem(s)");
> > > >  	}
> > > >  }
> > > > +
> > > > +/*
> > > > + * Reserve free space for per-AG metadata.
> > > > + */
> > > > +int
> > > > +xfs_fs_reserve_ag_blocks(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	xfs_agnumber_t		agno;
> > > > +	struct xfs_perag	*pag;
> > > > +	int			error = 0;
> > > > +	int			err2;
> > > > +
> > > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > > +		pag = xfs_perag_get(mp, agno);
> > > > +		err2 = xfs_ag_resv_init(pag);
> > > > +		xfs_perag_put(pag);
> > > > +		if (err2 && !error)
> > > > +			error = err2;
> > > > +	}
> > > > +
> > > > +	if (error && error != -ENOSPC) {
> > > > +		xfs_warn(mp,
> > > > +	"Error %d reserving per-AG metadata reserve pool.", error);
> > > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > +	}
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Free space reserved for per-AG metadata.
> > > > + */
> > > > +int
> > > > +xfs_fs_unreserve_ag_blocks(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	xfs_agnumber_t		agno;
> > > > +	struct xfs_perag	*pag;
> > > > +	int			error = 0;
> > > > +	int			err2;
> > > > +
> > > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > > +		pag = xfs_perag_get(mp, agno);
> > > > +		err2 = xfs_ag_resv_free(pag);
> > > > +		xfs_perag_put(pag);
> > > > +		if (err2 && !error)
> > > > +			error = err2;
> > > > +	}
> > > > +
> > > > +	if (error)
> > > > +		xfs_warn(mp,
> > > > +	"Error %d freeing per-AG metadata reserve pool.", error);
> > > > +
> > > > +	return error;
> > > > +}
> > > > diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> > > > index f32713f..f349158 100644
> > > > --- a/fs/xfs/xfs_fsops.h
> > > > +++ b/fs/xfs/xfs_fsops.h
> > > > @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
> > > >  				xfs_fsop_resblks_t *outval);
> > > >  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
> > > >  
> > > > +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> > > > +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> > > > +
> > > >  #endif	/* __XFS_FSOPS_H__ */
> > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > index caecbd2..b5da81d 100644
> > > > --- a/fs/xfs/xfs_mount.c
> > > > +++ b/fs/xfs/xfs_mount.c
> > > > @@ -986,10 +986,17 @@ xfs_mountfs(
> > > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > >  			goto out_quota;
> > > >  		}
> > > > +
> > > > +		/* Reserve AG blocks for future btree expansion. */
> > > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > > +		if (error && error != -ENOSPC)
> > > > +			goto out_agresv;
> > > >  	}
> > > >  
> > > >  	return 0;
> > > >  
> > > > + out_agresv:
> > > > +	xfs_fs_unreserve_ag_blocks(mp);
> > > >   out_quota:
> > > >  	xfs_qm_unmount_quotas(mp);
> > > >   out_rtunmount:
> > > > @@ -1034,6 +1041,7 @@ xfs_unmountfs(
> > > >  
> > > >  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
> > > >  
> > > > +	xfs_fs_unreserve_ag_blocks(mp);
> > > >  	xfs_qm_unmount_quotas(mp);
> > > >  	xfs_rtunmount_inodes(mp);
> > > >  	IRELE(mp->m_rootip);
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index e6aaa91..875ab9f 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -1315,10 +1315,22 @@ xfs_fs_remount(
> > > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > >  			return error;
> > > >  		}
> > > > +
> > > > +		/* Create the per-AG metadata reservation pool .*/
> > > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > > +		if (error && error != -ENOSPC)
> > > > +			return error;
> > > >  	}
> > > >  
> > > >  	/* rw -> ro */
> > > >  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> > > > +		/* Free the per-AG metadata reservation pool. */
> > > > +		error = xfs_fs_unreserve_ag_blocks(mp);
> > > > +		if (error) {
> > > > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > +			return error;
> > > > +		}
> > > > +
> > > >  		/*
> > > >  		 * Before we sync the metadata, we need to free up the reserve
> > > >  		 * block pool so that the used block count in the superblock on
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-12-06 19:32         ` Darrick J. Wong
@ 2016-12-07 11:53           ` Brian Foster
  2016-12-08  6:14             ` Darrick J. Wong
  0 siblings, 1 reply; 187+ messages in thread
From: Brian Foster @ 2016-12-07 11:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs, Christoph Hellwig

On Tue, Dec 06, 2016 at 11:32:29AM -0800, Darrick J. Wong wrote:
> On Wed, Oct 12, 2016 at 06:42:36PM -0400, Brian Foster wrote:
> > On Wed, Oct 12, 2016 at 01:52:57PM -0700, Darrick J. Wong wrote:
> > > On Wed, Oct 12, 2016 at 02:44:51PM -0400, Brian Foster wrote:
> > > > On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> > > > > To gracefully handle the situation where a CoW operation turns a
> > > > > single refcount extent into a lot of tiny ones and then run out of
> > > > > space when a tree split has to happen, use the per-AG reserved block
> > > > > pool to pre-allocate all the space we'll ever need for a maximal
> > > > > btree.  For a 4K block size, this only costs an overhead of 0.3% of
> > > > > available disk space.
> > > > > 
> > > > > When reflink is enabled, we have an unfortunate problem with rmap --
> > > > > since we can share a block billions of times, this means that the
> > > > > reverse mapping btree can expand basically infinitely.  When an AG is
> > > > > so full that there are no free blocks with which to expand the rmapbt,
> > > > > the filesystem will shut down hard.
> > > > > 
> > > > > This is rather annoying to the user, so use the AG reservation code to
> > > > > reserve a "reasonable" amount of space for rmap.  We'll prevent
> > > > > reflinks and CoW operations if we think we're getting close to
> > > > > exhausting an AG's free space rather than shutting down, but this
> > > > > permanent reservation should be enough for "most" users.  Hopefully.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > [hch@lst.de: ensure that we invalidate the freed btree buffer]
> > > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > > ---
> > > > > v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> > > > > so that we can easily call xfs_trans_binval for both the per-AG pool
> > > > > and the real freeing case.  Without this we fail to invalidate the
> > > > > btree buffer and will trip over the write verifier on a shrinking
> > > > > refcount btree.
> > > > > 
> > > > > v3: Convert to the new per-AG reservation code.
> > > > > 
> > > > > v4: Combine this patch with the one that adds the rmapbt reservation,
> > > > > since the rmapbt reservation is only needed for reflink filesystems.
> > > > > 
> > > > > v5: If we detect errors while counting the refcount or rmap btrees,
> > > > > shut down the filesystem to avoid the scenario where the fs shuts down
> > > > > mid-transaction due to btree corruption, repair refuses to run until
> > > > > the log is clean, and the log cannot be cleaned because replay hits
> > > > > btree corruption and shuts down.
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
> > > > >  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
> > > > >  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
> > > > >  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
> > > > >  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/xfs_fsops.h                 |    3 ++
> > > > >  fs/xfs/xfs_mount.c                 |    8 +++++
> > > > >  fs/xfs/xfs_super.c                 |   12 +++++++
> > > > >  9 files changed, 210 insertions(+), 3 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > > index e3ae0f2..adf770f 100644
> > > > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > > @@ -38,6 +38,7 @@
> > > > >  #include "xfs_trans_space.h"
> > > > >  #include "xfs_rmap_btree.h"
> > > > >  #include "xfs_btree.h"
> > > > > +#include "xfs_refcount_btree.h"
> > > > >  
> > > > >  /*
> > > > >   * Per-AG Block Reservations
> > > > > @@ -228,6 +229,11 @@ xfs_ag_resv_init(
> > > > >  	if (pag->pag_meta_resv.ar_asked == 0) {
> > > > >  		ask = used = 0;
> > > > >  
> > > > > +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> > > > > +				pag->pag_agno, &ask, &used);
> > > > > +		if (error)
> > > > > +			goto out;
> > > > > +
> > > > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> > > > >  				ask, used);
> > > > 
> > > > Now that I get here, I see we have these per-ag reservation structures
> > > > and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
> > > > xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
> > > > "global pool." Based on the commit log, isn't the intent here to reserve
> > > > blocks within each AG? What am I missing?
> > > 
> > > The AG reservation code "reserves" blocks in each AG by hiding them from
> > > the allocator.  They're all still there in the bnobt, but we underreport
> > > the length of the longest free extent and the free block count in that
> > > AG to make it look like there's less free space than there is.  Since
> > > those blocks are no longer generally available, we have to decrease the
> > > in-core free block count so we can't create delalloc reservations that
> > > the allocator won't (or can't) satisfy.
> > > 
> > 
> > Yep, I think I get the idea/purpose in principle. It sounds similar to
> > global reserve pool, where we set aside a count of unallocated blocks
> > via accounting magic such that we have some available in cases such as
> > the need to allocate a block to free an extent in low free space
> > conditions.
> 
> Correct.
> 
> > In this case, it looks like we reserve blocks in the same manner (via
> > xfs_mod_fdblocks()) and record the reservation in a new per-ag
> > reservation structure. The part I'm missing is how we guarantee those
> > blocks are accessible in the particular AG (or am I entirely mistaken
> > about the requirement that the per-AG reservation must reside within
> > that specific AG?).
> 
> You're correct there too.
> 
> > An example might clarify where my confusion lies... suppose we have a
> > non-standard configuration with a 1TB ag size and just barely enough
> > total filesystem size for a second AG, e.g., we have two AGs where AG 0
> > is 1TB and AG 1 is 16MB. Suppose that the reservation requirement (for
> > the sake of example, at least) based on sb_agblocks is larger than the
> > entire size of AG 1. Yet, the xfs_mod_fdblocks() call for the AG 1 res
> > struct will apparently succeed because there are plenty of blocks in
> > mp->m_fdblocks. Unless I'm mistaken, shouldn't we not be able to reserve
> > this many blocks out of AG 1?
> 
> You're right, that is a bug.  We /ought/ to be calculating the
> reservation ask based on agf_length, not sb_agblocks.  I'll also have to
> fix growfs to change the reservation if the length of the last AG
> changes.
> 

Yep, makes sense.

IMO it would also be nice to see some kind of assertion at reservation
time that the AG can honor the reservation at the time it is made, since
IIUC that should always be enforced to be true (whether that be DEBUG
code or a simple warning or whatever... just a thought).

> > Even in the case where AG 1 is large enough for the reservation, what
> > actually prevents a sequence of single block allocations from using all
> > of the space in the AG? 
> 
> AFAICT, the allocator picks an AG and tries to fix the freelist before
> allocating blocks.  As part of ensuring the AGFL, we call
> xfs_alloc_space_available to decide if there's enough space in the AG
> both to satisfy the allocation request and to fix the freelist.
> 
> _a_s_a starts by determining the number of blocks that have to stay
> reserved in that AG for the given allocation type.  Then it calls
> xfs_alloc_longest_free_extent to find the longest free extent in the AG.
> 
> _a_l_f_e finds the longest extent and subtracts whatever part of the AG
> reservation it can't satisfy out of the non-longest free extents.
> 
> Upon returning from _a_l_f_e, _a_s_a rejects the allocation if the
> longest extent cannot satisfy the required minimum allocation with the
> given alignment constraints.
> 
> Next it calculates the space that would remain after the allocation,
> which is:
> 
> (free space + agfl blocks) - (ag reservation) - (minimum agfl length) -
>      (total blocks requested)
> 

Ah, Ok. I think I missed that this calculation was tweaked, I'm guessing
because that doesn't appear to have been changed in this patch (granted
this is an old series). Thus I didn't see how the reservation was
ultimately enforced on a particular AG. Makes sense now, thanks for the
explanation!

Brian

> If this quantity is less than zero (or less than args->minleft) then the
> allocation is also rejected.  I believe this should be sufficient to
> prevent a series of single block alloc requests from exhausting the AG
> since we're stopped from giving away reserved blocks that we're not
> entitled to, even if there are still records in the bnobt.
> 
> --D
> 
> > 
> > Brian
> > 
> > > Maybe a more concrete way to put that is: say we have 4 AGs with 4 agresv
> > > blocks each, and no other free space left anywhere.  The in-core fdblocks count
> > > should be 0 so that starting a write into a hole returns ENOSPC even if the
> > > write could be done without any btree shape changes.   Otherwise, writepages
> > > tries to allocate the delalloc reservation, fails to find any space because
> > > we've hidden it, and kaboom.
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > >  		if (error)
> > > > > @@ -238,6 +244,11 @@ xfs_ag_resv_init(
> > > > >  	if (pag->pag_agfl_resv.ar_asked == 0) {
> > > > >  		ask = used = 0;
> > > > >  
> > > > > +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> > > > > +				&ask, &used);
> > > > > +		if (error)
> > > > > +			goto out;
> > > > > +
> > > > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
> > > > >  		if (error)
> > > > >  			goto out;
> > > > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > > index 6b5e82b9..453bb27 100644
> > > > > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > > @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
> > > > >  	struct xfs_alloc_arg	args;		/* block allocation args */
> > > > >  	int			error;		/* error return value */
> > > > >  
> > > > > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > > > +
> > > > >  	memset(&args, 0, sizeof(args));
> > > > >  	args.tp = cur->bc_tp;
> > > > >  	args.mp = cur->bc_mp;
> > > > > @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
> > > > >  	args.firstblock = args.fsbno;
> > > > >  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
> > > > >  	args.minlen = args.maxlen = args.prod = 1;
> > > > > +	args.resv = XFS_AG_RESV_METADATA;
> > > > >  
> > > > >  	error = xfs_alloc_vextent(&args);
> > > > >  	if (error)
> > > > > @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
> > > > >  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > > > >  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> > > > >  	struct xfs_owner_info	oinfo;
> > > > > +	int			error;
> > > > >  
> > > > >  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> > > > >  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
> > > > >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
> > > > >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> > > > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> > > > > -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> > > > > -			&oinfo);
> > > > > +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > > > > +			XFS_AG_RESV_METADATA);
> > > > > +	if (error)
> > > > > +		return error;
> > > > >  
> > > > > -	return 0;
> > > > > +	return error;
> > > > >  }
> > > > >  
> > > > >  STATIC int
> > > > > @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
> > > > >  
> > > > >  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > > > >  }
> > > > > +
> > > > > +/*
> > > > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > > > + */
> > > > > +int
> > > > > +xfs_refcountbt_calc_reserves(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	xfs_agnumber_t		agno,
> > > > > +	xfs_extlen_t		*ask,
> > > > > +	xfs_extlen_t		*used)
> > > > > +{
> > > > > +	struct xfs_buf		*agbp;
> > > > > +	struct xfs_agf		*agf;
> > > > > +	xfs_extlen_t		tree_len;
> > > > > +	int			error;
> > > > > +
> > > > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > > > +		return 0;
> > > > > +
> > > > > +	*ask += xfs_refcountbt_max_size(mp);
> > > > > +
> > > > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +
> > > > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > > > +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> > > > > +	xfs_buf_relse(agbp);
> > > > > +
> > > > > +	*used += tree_len;
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > > index 780b02f..3be7768 100644
> > > > > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > > @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> > > > >  		unsigned long long len);
> > > > >  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> > > > >  
> > > > > +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> > > > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > > > +
> > > > >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > > index 9c0585e..83e672f 100644
> > > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > > @@ -35,6 +35,7 @@
> > > > >  #include "xfs_cksum.h"
> > > > >  #include "xfs_error.h"
> > > > >  #include "xfs_extent_busy.h"
> > > > > +#include "xfs_ag_resv.h"
> > > > >  
> > > > >  /*
> > > > >   * Reverse map btree.
> > > > > @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
> > > > >  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> > > > >  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> > > > >  }
> > > > > +
> > > > > +/* Calculate the refcount btree size for some records. */
> > > > > +xfs_extlen_t
> > > > > +xfs_rmapbt_calc_size(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	unsigned long long	len)
> > > > > +{
> > > > > +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Calculate the maximum refcount btree size.
> > > > > + */
> > > > > +xfs_extlen_t
> > > > > +xfs_rmapbt_max_size(
> > > > > +	struct xfs_mount	*mp)
> > > > > +{
> > > > > +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > > > > +	if (mp->m_rmap_mxr[0] == 0)
> > > > > +		return 0;
> > > > > +
> > > > > +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > > > + */
> > > > > +int
> > > > > +xfs_rmapbt_calc_reserves(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	xfs_agnumber_t		agno,
> > > > > +	xfs_extlen_t		*ask,
> > > > > +	xfs_extlen_t		*used)
> > > > > +{
> > > > > +	struct xfs_buf		*agbp;
> > > > > +	struct xfs_agf		*agf;
> > > > > +	xfs_extlen_t		pool_len;
> > > > > +	xfs_extlen_t		tree_len;
> > > > > +	int			error;
> > > > > +
> > > > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > > +		return 0;
> > > > > +
> > > > > +	/* Reserve 1% of the AG or enough for 1 block per record. */
> > > > > +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> > > > > +	*ask += pool_len;
> > > > > +
> > > > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +
> > > > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > > > +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> > > > > +	xfs_buf_relse(agbp);
> > > > > +
> > > > > +	*used += tree_len;
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > index e73a553..2a9ac47 100644
> > > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> > > > >  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> > > > >  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> > > > >  
> > > > > +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> > > > > +		unsigned long long len);
> > > > > +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> > > > > +
> > > > > +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> > > > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > > > +
> > > > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > > index 3acbf4e0..93d12fa 100644
> > > > > --- a/fs/xfs/xfs_fsops.c
> > > > > +++ b/fs/xfs/xfs_fsops.c
> > > > > @@ -43,6 +43,7 @@
> > > > >  #include "xfs_log.h"
> > > > >  #include "xfs_filestream.h"
> > > > >  #include "xfs_rmap.h"
> > > > > +#include "xfs_ag_resv.h"
> > > > >  
> > > > >  /*
> > > > >   * File system operations
> > > > > @@ -630,6 +631,11 @@ xfs_growfs_data_private(
> > > > >  	xfs_set_low_space_thresholds(mp);
> > > > >  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > > > >  
> > > > > +	/* Reserve AG metadata blocks. */
> > > > > +	error = xfs_fs_reserve_ag_blocks(mp);
> > > > > +	if (error && error != -ENOSPC)
> > > > > +		goto out;
> > > > > +
> > > > >  	/* update secondary superblocks. */
> > > > >  	for (agno = 1; agno < nagcount; agno++) {
> > > > >  		error = 0;
> > > > > @@ -680,6 +686,8 @@ xfs_growfs_data_private(
> > > > >  			continue;
> > > > >  		}
> > > > >  	}
> > > > > +
> > > > > + out:
> > > > >  	return saved_error ? saved_error : error;
> > > > >  
> > > > >   error0:
> > > > > @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
> > > > >  	"Please umount the filesystem and rectify the problem(s)");
> > > > >  	}
> > > > >  }
> > > > > +
> > > > > +/*
> > > > > + * Reserve free space for per-AG metadata.
> > > > > + */
> > > > > +int
> > > > > +xfs_fs_reserve_ag_blocks(
> > > > > +	struct xfs_mount	*mp)
> > > > > +{
> > > > > +	xfs_agnumber_t		agno;
> > > > > +	struct xfs_perag	*pag;
> > > > > +	int			error = 0;
> > > > > +	int			err2;
> > > > > +
> > > > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > > > +		pag = xfs_perag_get(mp, agno);
> > > > > +		err2 = xfs_ag_resv_init(pag);
> > > > > +		xfs_perag_put(pag);
> > > > > +		if (err2 && !error)
> > > > > +			error = err2;
> > > > > +	}
> > > > > +
> > > > > +	if (error && error != -ENOSPC) {
> > > > > +		xfs_warn(mp,
> > > > > +	"Error %d reserving per-AG metadata reserve pool.", error);
> > > > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > > +	}
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Free space reserved for per-AG metadata.
> > > > > + */
> > > > > +int
> > > > > +xfs_fs_unreserve_ag_blocks(
> > > > > +	struct xfs_mount	*mp)
> > > > > +{
> > > > > +	xfs_agnumber_t		agno;
> > > > > +	struct xfs_perag	*pag;
> > > > > +	int			error = 0;
> > > > > +	int			err2;
> > > > > +
> > > > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > > > +		pag = xfs_perag_get(mp, agno);
> > > > > +		err2 = xfs_ag_resv_free(pag);
> > > > > +		xfs_perag_put(pag);
> > > > > +		if (err2 && !error)
> > > > > +			error = err2;
> > > > > +	}
> > > > > +
> > > > > +	if (error)
> > > > > +		xfs_warn(mp,
> > > > > +	"Error %d freeing per-AG metadata reserve pool.", error);
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> > > > > index f32713f..f349158 100644
> > > > > --- a/fs/xfs/xfs_fsops.h
> > > > > +++ b/fs/xfs/xfs_fsops.h
> > > > > @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
> > > > >  				xfs_fsop_resblks_t *outval);
> > > > >  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
> > > > >  
> > > > > +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> > > > > +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> > > > > +
> > > > >  #endif	/* __XFS_FSOPS_H__ */
> > > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > > index caecbd2..b5da81d 100644
> > > > > --- a/fs/xfs/xfs_mount.c
> > > > > +++ b/fs/xfs/xfs_mount.c
> > > > > @@ -986,10 +986,17 @@ xfs_mountfs(
> > > > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > >  			goto out_quota;
> > > > >  		}
> > > > > +
> > > > > +		/* Reserve AG blocks for future btree expansion. */
> > > > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > > > +		if (error && error != -ENOSPC)
> > > > > +			goto out_agresv;
> > > > >  	}
> > > > >  
> > > > >  	return 0;
> > > > >  
> > > > > + out_agresv:
> > > > > +	xfs_fs_unreserve_ag_blocks(mp);
> > > > >   out_quota:
> > > > >  	xfs_qm_unmount_quotas(mp);
> > > > >   out_rtunmount:
> > > > > @@ -1034,6 +1041,7 @@ xfs_unmountfs(
> > > > >  
> > > > >  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
> > > > >  
> > > > > +	xfs_fs_unreserve_ag_blocks(mp);
> > > > >  	xfs_qm_unmount_quotas(mp);
> > > > >  	xfs_rtunmount_inodes(mp);
> > > > >  	IRELE(mp->m_rootip);
> > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > index e6aaa91..875ab9f 100644
> > > > > --- a/fs/xfs/xfs_super.c
> > > > > +++ b/fs/xfs/xfs_super.c
> > > > > @@ -1315,10 +1315,22 @@ xfs_fs_remount(
> > > > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > >  			return error;
> > > > >  		}
> > > > > +
> > > > > +		/* Create the per-AG metadata reservation pool .*/
> > > > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > > > +		if (error && error != -ENOSPC)
> > > > > +			return error;
> > > > >  	}
> > > > >  
> > > > >  	/* rw -> ro */
> > > > >  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> > > > > +		/* Free the per-AG metadata reservation pool. */
> > > > > +		error = xfs_fs_unreserve_ag_blocks(mp);
> > > > > +		if (error) {
> > > > > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > > +			return error;
> > > > > +		}
> > > > > +
> > > > >  		/*
> > > > >  		 * Before we sync the metadata, we need to free up the reserve
> > > > >  		 * block pool so that the used block count in the superblock on
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

* Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion
  2016-12-07 11:53           ` Brian Foster
@ 2016-12-08  6:14             ` Darrick J. Wong
  0 siblings, 0 replies; 187+ messages in thread
From: Darrick J. Wong @ 2016-12-08  6:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: david, linux-xfs, Christoph Hellwig

On Wed, Dec 07, 2016 at 06:53:24AM -0500, Brian Foster wrote:
> On Tue, Dec 06, 2016 at 11:32:29AM -0800, Darrick J. Wong wrote:
> > On Wed, Oct 12, 2016 at 06:42:36PM -0400, Brian Foster wrote:
> > > On Wed, Oct 12, 2016 at 01:52:57PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Oct 12, 2016 at 02:44:51PM -0400, Brian Foster wrote:
> > > > > On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> > > > > > To gracefully handle the situation where a CoW operation turns a
> > > > > > single refcount extent into a lot of tiny ones and then run out of
> > > > > > space when a tree split has to happen, use the per-AG reserved block
> > > > > > pool to pre-allocate all the space we'll ever need for a maximal
> > > > > > btree.  For a 4K block size, this only costs an overhead of 0.3% of
> > > > > > available disk space.
> > > > > > 
> > > > > > When reflink is enabled, we have an unfortunate problem with rmap --
> > > > > > since we can share a block billions of times, this means that the
> > > > > > reverse mapping btree can expand basically infinitely.  When an AG is
> > > > > > so full that there are no free blocks with which to expand the rmapbt,
> > > > > > the filesystem will shut down hard.
> > > > > > 
> > > > > > This is rather annoying to the user, so use the AG reservation code to
> > > > > > reserve a "reasonable" amount of space for rmap.  We'll prevent
> > > > > > reflinks and CoW operations if we think we're getting close to
> > > > > > exhausting an AG's free space rather than shutting down, but this
> > > > > > permanent reservation should be enough for "most" users.  Hopefully.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > [hch@lst.de: ensure that we invalidate the freed btree buffer]
> > > > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > > > ---
> > > > > > v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> > > > > > so that we can easily call xfs_trans_binval for both the per-AG pool
> > > > > > and the real freeing case.  Without this we fail to invalidate the
> > > > > > btree buffer and will trip over the write verifier on a shrinking
> > > > > > refcount btree.
> > > > > > 
> > > > > > v3: Convert to the new per-AG reservation code.
> > > > > > 
> > > > > > v4: Combine this patch with the one that adds the rmapbt reservation,
> > > > > > since the rmapbt reservation is only needed for reflink filesystems.
> > > > > > 
> > > > > > v5: If we detect errors while counting the refcount or rmap btrees,
> > > > > > shut down the filesystem to avoid the scenario where the fs shuts down
> > > > > > mid-transaction due to btree corruption, repair refuses to run until
> > > > > > the log is clean, and the log cannot be cleaned because replay hits
> > > > > > btree corruption and shuts down.
> > > > > > ---
> > > > > >  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
> > > > > >  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
> > > > > >  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
> > > > > >  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
> > > > > >  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
> > > > > >  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
> > > > > >  fs/xfs/xfs_fsops.h                 |    3 ++
> > > > > >  fs/xfs/xfs_mount.c                 |    8 +++++
> > > > > >  fs/xfs/xfs_super.c                 |   12 +++++++
> > > > > >  9 files changed, 210 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > > > index e3ae0f2..adf770f 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > > > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > > > @@ -38,6 +38,7 @@
> > > > > >  #include "xfs_trans_space.h"
> > > > > >  #include "xfs_rmap_btree.h"
> > > > > >  #include "xfs_btree.h"
> > > > > > +#include "xfs_refcount_btree.h"
> > > > > >  
> > > > > >  /*
> > > > > >   * Per-AG Block Reservations
> > > > > > @@ -228,6 +229,11 @@ xfs_ag_resv_init(
> > > > > >  	if (pag->pag_meta_resv.ar_asked == 0) {
> > > > > >  		ask = used = 0;
> > > > > >  
> > > > > > +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> > > > > > +				pag->pag_agno, &ask, &used);
> > > > > > +		if (error)
> > > > > > +			goto out;
> > > > > > +
> > > > > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> > > > > >  				ask, used);
> > > > > 
> > > > > Now that I get here, I see we have these per-ag reservation structures
> > > > > and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
> > > > > xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
> > > > > "global pool." Based on the commit log, isn't the intent here to reserve
> > > > > blocks within each AG? What am I missing?
> > > > 
> > > > The AG reservation code "reserves" blocks in each AG by hiding them from
> > > > the allocator.  They're all still there in the bnobt, but we underreport
> > > > the length of the longest free extent and the free block count in that
> > > > AG to make it look like there's less free space than there is.  Since
> > > > those blocks are no longer generally available, we have to decrease the
> > > > in-core free block count so we can't create delalloc reservations that
> > > > the allocator won't (or can't) satisfy.
> > > > 
> > > 
> > > Yep, I think I get the idea/purpose in principle. It sounds similar to
> > > global reserve pool, where we set aside a count of unallocated blocks
> > > via accounting magic such that we have some available in cases such as
> > > the need to allocate a block to free an extent in low free space
> > > conditions.
> > 
> > Correct.
> > 
> > > In this case, it looks like we reserve blocks in the same manner (via
> > > xfs_mod_fdblocks()) and record the reservation in a new per-ag
> > > reservation structure. The part I'm missing is how we guarantee those
> > > blocks are accessible in the particular AG (or am I entirely mistaken
> > > about the requirement that the per-AG reservation must reside within
> > > that specific AG?).
> > 
> > You're correct there too.
> > 
> > > An example might clarify where my confusion lies... suppose we have a
> > > non-standard configuration with a 1TB ag size and just barely enough
> > > total filesystem size for a second AG, e.g., we have two AGs where AG 0
> > > is 1TB and AG 1 is 16MB. Suppose that the reservation requirement (for
> > > the sake of example, at least) based on sb_agblocks is larger than the
> > > entire size of AG 1. Yet, the xfs_mod_fdblocks() call for the AG 1 res
> > > struct will apparently succeed because there are plenty of blocks in
> > > mp->m_fdblocks. Unless I'm mistaken, shouldn't we not be able to reserve
> > > this many blocks out of AG 1?
> > 
> > You're right, that is a bug.  We /ought/ to be calculating the
> > reservation ask based on agf_length, not sb_agblocks.  I'll also have to
> > fix growfs to change the reservation if the length of the last AG
> > changes.
> > 
> 
> Yep, makes sense.
> 
> IMO it would also be nice to see some kind of assertion at reservation
> time that the AG can honor the reservation at the time it is made, since
> IIUC that should always be enforced to be true (whether that be DEBUG
> code or a simple warning or whatever... just a thought).

Ok, I'll put an ASSERT into the patch.

--D

> > > Even in the case where AG 1 is large enough for the reservation, what
> > > actually prevents a sequence of single block allocations from using all
> > > of the space in the AG? 
> > 
> > AFAICT, the allocator picks an AG and tries to fix the freelist before
> > allocating blocks.  As part of ensuring the AGFL, we call
> > xfs_alloc_space_available to decide if there's enough space in the AG
> > both to satisfy the allocation request and to fix the freelist.
> > 
> > _a_s_a starts by determining the number of blocks that have to stay
> > reserved in that AG for the given allocation type.  Then it calls
> > xfs_alloc_longest_free_extent to find the longest free extent in the AG.
> > 
> > _a_l_f_e finds the longest extent and subtracts whatever part of the AG
> > reservation it can't satisfy out of the non-longest free extents.
> > 
> > Upon returning from _a_l_f_e, _a_s_a rejects the allocation if the
> > longest extent cannot satisfy the required minimum allocation with the
> > given alignment constraints.
> > 
> > Next it calculates the space that would remain after the allocation,
> > which is:
> > 
> > (free space + agfl blocks) - (ag reservation) - (minimum agfl length) -
> >      (total blocks requested)
> > 
> 
> Ah, Ok. I think I missed that this calculation was tweaked, I'm guessing
> because that doesn't appear to have been changed in this patch (granted
> this is an old series). Thus I didn't see how the reservation was
> ultimately enforced on a particular AG. Makes sense now, thanks for the
> explanation!
> 
> Brian
> 
> > If this quantity is less than zero (or less than args->minleft) then the
> > allocation is also rejected.  I believe this should be sufficient to
> > prevent a series of single block alloc requests from exhausting the AG
> > since we're stopped from giving away reserved blocks that we're not
> > entitled to, even if there are still records in the bnobt.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > Maybe a more concrete way to put that is: say we have 4 AGs with 4 agresv
> > > > blocks each, and no other free space left anywhere.  The in-core fdblocks count
> > > > should be 0 so that starting a write into a hole returns ENOSPC even if the
> > > > write could be done without any btree shape changes.   Otherwise, writepages
> > > > tries to allocate the delalloc reservation, fails to find any space because
> > > > we've hidden it, and kaboom.
> > > > 
> > > > --D
> > > > 
> > > > > 
> > > > > Brian
> > > > > 
> > > > > >  		if (error)
> > > > > > @@ -238,6 +244,11 @@ xfs_ag_resv_init(
> > > > > >  	if (pag->pag_agfl_resv.ar_asked == 0) {
> > > > > >  		ask = used = 0;
> > > > > >  
> > > > > > +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> > > > > > +				&ask, &used);
> > > > > > +		if (error)
> > > > > > +			goto out;
> > > > > > +
> > > > > >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
> > > > > >  		if (error)
> > > > > >  			goto out;
> > > > > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > > > index 6b5e82b9..453bb27 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > > > > > @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
> > > > > >  	struct xfs_alloc_arg	args;		/* block allocation args */
> > > > > >  	int			error;		/* error return value */
> > > > > >  
> > > > > > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > > > > > +
> > > > > >  	memset(&args, 0, sizeof(args));
> > > > > >  	args.tp = cur->bc_tp;
> > > > > >  	args.mp = cur->bc_mp;
> > > > > > @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
> > > > > >  	args.firstblock = args.fsbno;
> > > > > >  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
> > > > > >  	args.minlen = args.maxlen = args.prod = 1;
> > > > > > +	args.resv = XFS_AG_RESV_METADATA;
> > > > > >  
> > > > > >  	error = xfs_alloc_vextent(&args);
> > > > > >  	if (error)
> > > > > > @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
> > > > > >  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> > > > > >  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> > > > > >  	struct xfs_owner_info	oinfo;
> > > > > > +	int			error;
> > > > > >  
> > > > > >  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> > > > > >  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
> > > > > >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
> > > > > >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> > > > > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> > > > > > -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> > > > > > -			&oinfo);
> > > > > > +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > > > > > +			XFS_AG_RESV_METADATA);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > >  
> > > > > > -	return 0;
> > > > > > +	return error;
> > > > > >  }
> > > > > >  
> > > > > >  STATIC int
> > > > > > @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
> > > > > >  
> > > > > >  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > > > > >  }
> > > > > > +
> > > > > > +/*
> > > > > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > > > > + */
> > > > > > +int
> > > > > > +xfs_refcountbt_calc_reserves(
> > > > > > +	struct xfs_mount	*mp,
> > > > > > +	xfs_agnumber_t		agno,
> > > > > > +	xfs_extlen_t		*ask,
> > > > > > +	xfs_extlen_t		*used)
> > > > > > +{
> > > > > > +	struct xfs_buf		*agbp;
> > > > > > +	struct xfs_agf		*agf;
> > > > > > +	xfs_extlen_t		tree_len;
> > > > > > +	int			error;
> > > > > > +
> > > > > > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	*ask += xfs_refcountbt_max_size(mp);
> > > > > > +
> > > > > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > > +
> > > > > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > > > > +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> > > > > > +	xfs_buf_relse(agbp);
> > > > > > +
> > > > > > +	*used += tree_len;
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > > > index 780b02f..3be7768 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > > > > > @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> > > > > >  		unsigned long long len);
> > > > > >  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> > > > > >  
> > > > > > +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> > > > > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > > > > +
> > > > > >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > > > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > > > index 9c0585e..83e672f 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > > > @@ -35,6 +35,7 @@
> > > > > >  #include "xfs_cksum.h"
> > > > > >  #include "xfs_error.h"
> > > > > >  #include "xfs_extent_busy.h"
> > > > > > +#include "xfs_ag_resv.h"
> > > > > >  
> > > > > >  /*
> > > > > >   * Reverse map btree.
> > > > > > @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
> > > > > >  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> > > > > >  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> > > > > >  }
> > > > > > +
> > > > > > +/* Calculate the refcount btree size for some records. */
> > > > > > +xfs_extlen_t
> > > > > > +xfs_rmapbt_calc_size(
> > > > > > +	struct xfs_mount	*mp,
> > > > > > +	unsigned long long	len)
> > > > > > +{
> > > > > > +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Calculate the maximum refcount btree size.
> > > > > > + */
> > > > > > +xfs_extlen_t
> > > > > > +xfs_rmapbt_max_size(
> > > > > > +	struct xfs_mount	*mp)
> > > > > > +{
> > > > > > +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > > > > > +	if (mp->m_rmap_mxr[0] == 0)
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Figure out how many blocks to reserve and how many are used by this btree.
> > > > > > + */
> > > > > > +int
> > > > > > +xfs_rmapbt_calc_reserves(
> > > > > > +	struct xfs_mount	*mp,
> > > > > > +	xfs_agnumber_t		agno,
> > > > > > +	xfs_extlen_t		*ask,
> > > > > > +	xfs_extlen_t		*used)
> > > > > > +{
> > > > > > +	struct xfs_buf		*agbp;
> > > > > > +	struct xfs_agf		*agf;
> > > > > > +	xfs_extlen_t		pool_len;
> > > > > > +	xfs_extlen_t		tree_len;
> > > > > > +	int			error;
> > > > > > +
> > > > > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	/* Reserve 1% of the AG or enough for 1 block per record. */
> > > > > > +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> > > > > > +	*ask += pool_len;
> > > > > > +
> > > > > > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > > +
> > > > > > +	agf = XFS_BUF_TO_AGF(agbp);
> > > > > > +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> > > > > > +	xfs_buf_relse(agbp);
> > > > > > +
> > > > > > +	*used += tree_len;
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > > index e73a553..2a9ac47 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > > > > > @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> > > > > >  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> > > > > >  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> > > > > >  
> > > > > > +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> > > > > > +		unsigned long long len);
> > > > > > +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> > > > > > +
> > > > > > +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> > > > > > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > > > > > +
> > > > > >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > > > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > > > index 3acbf4e0..93d12fa 100644
> > > > > > --- a/fs/xfs/xfs_fsops.c
> > > > > > +++ b/fs/xfs/xfs_fsops.c
> > > > > > @@ -43,6 +43,7 @@
> > > > > >  #include "xfs_log.h"
> > > > > >  #include "xfs_filestream.h"
> > > > > >  #include "xfs_rmap.h"
> > > > > > +#include "xfs_ag_resv.h"
> > > > > >  
> > > > > >  /*
> > > > > >   * File system operations
> > > > > > @@ -630,6 +631,11 @@ xfs_growfs_data_private(
> > > > > >  	xfs_set_low_space_thresholds(mp);
> > > > > >  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > > > > >  
> > > > > > +	/* Reserve AG metadata blocks. */
> > > > > > +	error = xfs_fs_reserve_ag_blocks(mp);
> > > > > > +	if (error && error != -ENOSPC)
> > > > > > +		goto out;
> > > > > > +
> > > > > >  	/* update secondary superblocks. */
> > > > > >  	for (agno = 1; agno < nagcount; agno++) {
> > > > > >  		error = 0;
> > > > > > @@ -680,6 +686,8 @@ xfs_growfs_data_private(
> > > > > >  			continue;
> > > > > >  		}
> > > > > >  	}
> > > > > > +
> > > > > > + out:
> > > > > >  	return saved_error ? saved_error : error;
> > > > > >  
> > > > > >   error0:
> > > > > > @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
> > > > > >  	"Please umount the filesystem and rectify the problem(s)");
> > > > > >  	}
> > > > > >  }
> > > > > > +
> > > > > > +/*
> > > > > > + * Reserve free space for per-AG metadata.
> > > > > > + */
> > > > > > +int
> > > > > > +xfs_fs_reserve_ag_blocks(
> > > > > > +	struct xfs_mount	*mp)
> > > > > > +{
> > > > > > +	xfs_agnumber_t		agno;
> > > > > > +	struct xfs_perag	*pag;
> > > > > > +	int			error = 0;
> > > > > > +	int			err2;
> > > > > > +
> > > > > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > > > > +		pag = xfs_perag_get(mp, agno);
> > > > > > +		err2 = xfs_ag_resv_init(pag);
> > > > > > +		xfs_perag_put(pag);
> > > > > > +		if (err2 && !error)
> > > > > > +			error = err2;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (error && error != -ENOSPC) {
> > > > > > +		xfs_warn(mp,
> > > > > > +	"Error %d reserving per-AG metadata reserve pool.", error);
> > > > > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > > > +	}
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Free space reserved for per-AG metadata.
> > > > > > + */
> > > > > > +int
> > > > > > +xfs_fs_unreserve_ag_blocks(
> > > > > > +	struct xfs_mount	*mp)
> > > > > > +{
> > > > > > +	xfs_agnumber_t		agno;
> > > > > > +	struct xfs_perag	*pag;
> > > > > > +	int			error = 0;
> > > > > > +	int			err2;
> > > > > > +
> > > > > > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > > > > > +		pag = xfs_perag_get(mp, agno);
> > > > > > +		err2 = xfs_ag_resv_free(pag);
> > > > > > +		xfs_perag_put(pag);
> > > > > > +		if (err2 && !error)
> > > > > > +			error = err2;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (error)
> > > > > > +		xfs_warn(mp,
> > > > > > +	"Error %d freeing per-AG metadata reserve pool.", error);
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> > > > > > index f32713f..f349158 100644
> > > > > > --- a/fs/xfs/xfs_fsops.h
> > > > > > +++ b/fs/xfs/xfs_fsops.h
> > > > > > @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
> > > > > >  				xfs_fsop_resblks_t *outval);
> > > > > >  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
> > > > > >  
> > > > > > +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> > > > > > +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> > > > > > +
> > > > > >  #endif	/* __XFS_FSOPS_H__ */
> > > > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > > > index caecbd2..b5da81d 100644
> > > > > > --- a/fs/xfs/xfs_mount.c
> > > > > > +++ b/fs/xfs/xfs_mount.c
> > > > > > @@ -986,10 +986,17 @@ xfs_mountfs(
> > > > > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > > >  			goto out_quota;
> > > > > >  		}
> > > > > > +
> > > > > > +		/* Reserve AG blocks for future btree expansion. */
> > > > > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > > > > +		if (error && error != -ENOSPC)
> > > > > > +			goto out_agresv;
> > > > > >  	}
> > > > > >  
> > > > > >  	return 0;
> > > > > >  
> > > > > > + out_agresv:
> > > > > > +	xfs_fs_unreserve_ag_blocks(mp);
> > > > > >   out_quota:
> > > > > >  	xfs_qm_unmount_quotas(mp);
> > > > > >   out_rtunmount:
> > > > > > @@ -1034,6 +1041,7 @@ xfs_unmountfs(
> > > > > >  
> > > > > >  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
> > > > > >  
> > > > > > +	xfs_fs_unreserve_ag_blocks(mp);
> > > > > >  	xfs_qm_unmount_quotas(mp);
> > > > > >  	xfs_rtunmount_inodes(mp);
> > > > > >  	IRELE(mp->m_rootip);
> > > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > > index e6aaa91..875ab9f 100644
> > > > > > --- a/fs/xfs/xfs_super.c
> > > > > > +++ b/fs/xfs/xfs_super.c
> > > > > > @@ -1315,10 +1315,22 @@ xfs_fs_remount(
> > > > > >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > > >  			return error;
> > > > > >  		}
> > > > > > +
> > > > > > +		/* Create the per-AG metadata reservation pool .*/
> > > > > > +		error = xfs_fs_reserve_ag_blocks(mp);
> > > > > > +		if (error && error != -ENOSPC)
> > > > > > +			return error;
> > > > > >  	}
> > > > > >  
> > > > > >  	/* rw -> ro */
> > > > > >  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> > > > > > +		/* Free the per-AG metadata reservation pool. */
> > > > > > +		error = xfs_fs_unreserve_ag_blocks(mp);
> > > > > > +		if (error) {
> > > > > > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > > > > +			return error;
> > > > > > +		}
> > > > > > +
> > > > > >  		/*
> > > > > >  		 * Before we sync the metadata, we need to free up the reserve
> > > > > >  		 * block pool so that the used block count in the superblock on
> > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 187+ messages in thread

end of thread, other threads:[~2016-12-08  6:14 UTC | newest]

Thread overview: 187+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30  3:05 [PATCH v10 00/63] xfs: add reflink and dedupe support Darrick J. Wong
2016-09-30  3:05 ` [PATCH 01/63] vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint Darrick J. Wong
2016-09-30  3:05 ` [PATCH 02/63] vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks Darrick J. Wong
2016-09-30  7:08   ` Christoph Hellwig
2016-09-30  3:05 ` [PATCH 03/63] xfs: return an error when an inline directory is too small Darrick J. Wong
2016-09-30  3:06 ` [PATCH 04/63] xfs: define tracepoints for refcount btree activities Darrick J. Wong
2016-09-30  3:06 ` [PATCH 05/63] xfs: introduce refcount btree definitions Darrick J. Wong
2016-09-30  3:06 ` [PATCH 06/63] xfs: refcount btree add more reserved blocks Darrick J. Wong
2016-09-30  3:06 ` [PATCH 07/63] xfs: define the on-disk refcount btree format Darrick J. Wong
2016-09-30  3:06 ` [PATCH 08/63] xfs: add refcount btree support to growfs Darrick J. Wong
2016-09-30  3:06 ` [PATCH 09/63] xfs: account for the refcount btree in the alloc/free log reservation Darrick J. Wong
2016-09-30  3:06 ` [PATCH 10/63] xfs: add refcount btree operations Darrick J. Wong
2016-09-30  3:06 ` [PATCH 11/63] xfs: create refcount update intent log items Darrick J. Wong
2016-09-30  3:06 ` [PATCH 12/63] xfs: log refcount intent items Darrick J. Wong
2016-09-30  3:06 ` [PATCH 13/63] xfs: adjust refcount of an extent of blocks in refcount btree Darrick J. Wong
2016-09-30  7:11   ` Christoph Hellwig
2016-09-30 17:53     ` Darrick J. Wong
2016-09-30  3:07 ` [PATCH 14/63] xfs: connect refcount adjust functions to upper layers Darrick J. Wong
2016-09-30  7:13   ` Christoph Hellwig
2016-09-30 16:21   ` Brian Foster
2016-09-30 19:40     ` Darrick J. Wong
2016-09-30 20:11       ` Brian Foster
2016-09-30  3:07 ` [PATCH 15/63] xfs: adjust refcount when unmapping file blocks Darrick J. Wong
2016-09-30  7:14   ` Christoph Hellwig
2016-09-30  3:07 ` [PATCH 16/63] xfs: add refcount btree block detection to log recovery Darrick J. Wong
2016-09-30  7:15   ` Christoph Hellwig
2016-09-30  3:07 ` [PATCH 17/63] xfs: refcount btree requires more reserved space Darrick J. Wong
2016-09-30  7:15   ` Christoph Hellwig
2016-09-30 16:46   ` Brian Foster
2016-09-30 18:41     ` Darrick J. Wong
2016-09-30  3:07 ` [PATCH 18/63] xfs: introduce reflink utility functions Darrick J. Wong
2016-09-30  3:07   ` Darrick J. Wong
2016-09-30  7:16   ` Christoph Hellwig
2016-09-30 19:22   ` Brian Foster
2016-09-30 19:50     ` Darrick J. Wong
2016-09-30  3:07 ` [PATCH 19/63] xfs: create bmbt update intent log items Darrick J. Wong
2016-09-30  7:24   ` Christoph Hellwig
2016-09-30 17:24     ` Darrick J. Wong
2016-09-30  3:07 ` [PATCH 20/63] xfs: log bmap intent items Darrick J. Wong
2016-09-30  7:26   ` Christoph Hellwig
2016-09-30 17:26     ` Darrick J. Wong
2016-09-30 19:22   ` Brian Foster
2016-09-30 19:52     ` Darrick J. Wong
2016-09-30  3:07 ` [PATCH 21/63] xfs: map an inode's offset to an exact physical block Darrick J. Wong
2016-09-30  7:31   ` Christoph Hellwig
2016-09-30 17:30     ` Darrick J. Wong
2016-10-03 19:03   ` Brian Foster
2016-10-04  0:11     ` Darrick J. Wong
2016-10-04 12:43       ` Brian Foster
2016-10-04 17:28         ` Darrick J. Wong
2016-09-30  3:08 ` [PATCH 22/63] xfs: pass bmapi flags through to bmap_del_extent Darrick J. Wong
2016-09-30  7:16   ` Christoph Hellwig
2016-09-30  3:08 ` [PATCH 23/63] xfs: implement deferred bmbt map/unmap operations Darrick J. Wong
2016-09-30  7:34   ` Christoph Hellwig
2016-09-30 17:38     ` Darrick J. Wong
2016-09-30 20:34       ` Roger Willcocks
2016-09-30 21:08         ` Darrick J. Wong
2016-09-30  3:08 ` [PATCH 24/63] xfs: when replaying bmap operations, don't let unlinked inodes get reaped Darrick J. Wong
2016-09-30  7:35   ` Christoph Hellwig
2016-10-03 19:04   ` Brian Foster
2016-10-04  0:29     ` Darrick J. Wong
2016-10-04 12:44       ` Brian Foster
2016-10-04 19:07         ` Dave Chinner
2016-10-04 21:44           ` Darrick J. Wong
2016-09-30  3:08 ` [PATCH 25/63] xfs: return work remaining at the end of a bunmapi operation Darrick J. Wong
2016-09-30  7:19   ` Christoph Hellwig
2016-10-03 19:04   ` Brian Foster
2016-10-04  0:30     ` Darrick J. Wong
2016-10-04 12:44       ` Brian Foster
2016-09-30  3:08 ` [PATCH 26/63] xfs: define tracepoints for reflink activities Darrick J. Wong
2016-09-30  7:20   ` Christoph Hellwig
2016-09-30  3:08 ` [PATCH 27/63] xfs: add reflink feature flag to geometry Darrick J. Wong
2016-09-30  7:20   ` Christoph Hellwig
2016-09-30  3:08 ` [PATCH 28/63] xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Darrick J. Wong
2016-09-30  7:20   ` Christoph Hellwig
2016-09-30  3:08 ` [PATCH 29/63] xfs: introduce the CoW fork Darrick J. Wong
2016-09-30  7:39   ` Christoph Hellwig
2016-09-30 17:48     ` Darrick J. Wong
2016-09-30  3:08 ` [PATCH 30/63] xfs: support bmapping delalloc extents in " Darrick J. Wong
2016-09-30  7:42   ` Christoph Hellwig
2016-09-30  3:09 ` [PATCH 31/63] xfs: create delalloc extents in " Darrick J. Wong
2016-10-04 16:38   ` Brian Foster
2016-10-04 17:39     ` Darrick J. Wong
2016-10-04 18:38       ` Brian Foster
2016-09-30  3:09 ` [PATCH 32/63] xfs: support allocating delayed " Darrick J. Wong
2016-09-30  7:42   ` Christoph Hellwig
2016-10-04 16:38   ` Brian Foster
2016-09-30  3:09 ` [PATCH 33/63] xfs: allocate " Darrick J. Wong
2016-10-04 16:38   ` Brian Foster
2016-10-04 18:26     ` Darrick J. Wong
2016-10-04 18:39       ` Brian Foster
2016-09-30  3:09 ` [PATCH 34/63] xfs: support removing extents from " Darrick J. Wong
2016-09-30  7:46   ` Christoph Hellwig
2016-09-30 18:00     ` Darrick J. Wong
2016-10-05 18:26   ` Brian Foster
2016-09-30  3:09 ` [PATCH 35/63] xfs: move mappings from cow fork to data fork after copy-write Darrick J. Wong
2016-10-05 18:26   ` Brian Foster
2016-10-05 21:22     ` Darrick J. Wong
2016-09-30  3:09 ` [PATCH 36/63] xfs: report shared extent mappings to userspace correctly Darrick J. Wong
2016-09-30  3:09 ` [PATCH 37/63] xfs: implement CoW for directio writes Darrick J. Wong
2016-10-05 18:27   ` Brian Foster
2016-10-05 20:55     ` Darrick J. Wong
2016-10-06 12:20       ` Brian Foster
2016-10-07  1:02         ` Darrick J. Wong
2016-10-07  6:17           ` Christoph Hellwig
2016-10-07 12:16             ` Brian Foster
2016-10-07 12:15           ` Brian Foster
2016-10-13 18:14             ` Darrick J. Wong
2016-10-13 19:01               ` Brian Foster
2016-09-30  3:09 ` [PATCH 38/63] xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks Darrick J. Wong
2016-09-30  7:47   ` Christoph Hellwig
2016-10-06 16:44   ` Brian Foster
2016-10-07  0:40     ` Darrick J. Wong
2016-09-30  3:09 ` [PATCH 39/63] xfs: cancel pending CoW reservations when destroying inodes Darrick J. Wong
2016-09-30  7:47   ` Christoph Hellwig
2016-10-06 16:44   ` Brian Foster
2016-10-07  0:42     ` Darrick J. Wong
2016-09-30  3:09 ` [PATCH 40/63] xfs: store in-progress CoW allocations in the refcount btree Darrick J. Wong
2016-09-30  7:49   ` Christoph Hellwig
2016-10-07 18:04   ` Brian Foster
2016-10-07 19:18     ` Darrick J. Wong
2016-09-30  3:10 ` [PATCH 41/63] xfs: reflink extents from one file to another Darrick J. Wong
2016-09-30  7:50   ` Christoph Hellwig
2016-10-07 18:04   ` Brian Foster
2016-10-07 19:44     ` Darrick J. Wong
2016-10-07 20:48       ` Brian Foster
2016-10-07 21:41         ` Darrick J. Wong
2016-10-10 13:17           ` Brian Foster
2016-09-30  3:10 ` [PATCH 42/63] xfs: add clone file and clone range vfs functions Darrick J. Wong
2016-09-30  7:51   ` Christoph Hellwig
2016-09-30 18:04     ` Darrick J. Wong
2016-10-07 18:04   ` Brian Foster
2016-10-07 20:31     ` Darrick J. Wong
2016-09-30  3:10 ` [PATCH 43/63] xfs: add dedupe range vfs function Darrick J. Wong
2016-09-30  7:53   ` Christoph Hellwig
2016-09-30  3:10 ` [PATCH 44/63] xfs: teach get_bmapx about shared extents and the CoW fork Darrick J. Wong
2016-09-30  7:53   ` Christoph Hellwig
2016-09-30  3:10 ` [PATCH 45/63] xfs: swap inode reflink flags when swapping inode extents Darrick J. Wong
2016-09-30  7:54   ` Christoph Hellwig
2016-09-30  3:10 ` [PATCH 46/63] xfs: unshare a range of blocks via fallocate Darrick J. Wong
2016-09-30  7:54   ` Christoph Hellwig
2016-10-07 18:05   ` Brian Foster
2016-10-07 20:26     ` Darrick J. Wong
2016-10-07 20:58       ` Brian Foster
2016-10-07 21:15         ` Darrick J. Wong
2016-10-07 22:25           ` Dave Chinner
2016-10-10 17:05             ` Darrick J. Wong
2016-09-30  3:10 ` [PATCH 47/63] xfs: create a separate cow extent size hint for the allocator Darrick J. Wong
2016-09-30  7:55   ` Christoph Hellwig
2016-09-30  3:10 ` [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion Darrick J. Wong
2016-09-30  8:19   ` Christoph Hellwig
2016-10-12 18:44   ` Brian Foster
2016-10-12 20:52     ` Darrick J. Wong
2016-10-12 22:42       ` Brian Foster
2016-12-06 19:32         ` Darrick J. Wong
2016-12-07 11:53           ` Brian Foster
2016-12-08  6:14             ` Darrick J. Wong
2016-09-30  3:10 ` [PATCH 49/63] xfs: don't allow reflink when the AG is low on space Darrick J. Wong
2016-09-30  8:19   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 50/63] xfs: try other AGs to allocate a BMBT block Darrick J. Wong
2016-09-30  8:20   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 51/63] xfs: garbage collect old cowextsz reservations Darrick J. Wong
2016-09-30  8:23   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 52/63] xfs: increase log reservations for reflink Darrick J. Wong
2016-09-30  8:23   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 53/63] xfs: add shared rmap map/unmap/convert log item types Darrick J. Wong
2016-09-30  8:24   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 54/63] xfs: use interval query for rmap alloc operations on shared files Darrick J. Wong
2016-09-30  8:24   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 55/63] xfs: convert unwritten status of reverse mappings for " Darrick J. Wong
2016-09-30  8:25   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 56/63] xfs: set a default CoW extent size of 32 blocks Darrick J. Wong
2016-09-30  8:25   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 57/63] xfs: check for invalid inode reflink flags Darrick J. Wong
2016-09-30  8:26   ` Christoph Hellwig
2016-09-30  3:11 ` [PATCH 58/63] xfs: don't mix reflink and DAX mode for now Darrick J. Wong
2016-09-30  8:26   ` Christoph Hellwig
2016-09-30  3:12 ` [PATCH 59/63] xfs: simulate per-AG reservations being critically low Darrick J. Wong
2016-09-30  8:27   ` Christoph Hellwig
2016-09-30  3:12 ` [PATCH 60/63] xfs: recognize the reflink feature bit Darrick J. Wong
2016-09-30  8:27   ` Christoph Hellwig
2016-09-30  3:12 ` [PATCH 61/63] xfs: various swapext cleanups Darrick J. Wong
2016-09-30  8:28   ` Christoph Hellwig
2016-09-30  3:12 ` [PATCH 62/63] xfs: refactor swapext code Darrick J. Wong
2016-09-30  8:28   ` Christoph Hellwig
2016-09-30  3:12 ` [PATCH 63/63] xfs: implement swapext for rmap filesystems Darrick J. Wong
2016-09-30  9:00 ` [PATCH v10 00/63] xfs: add reflink and dedupe support Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.