All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] xfs: reflink/scrub/quota fixes
@ 2018-01-24  2:17 Darrick J. Wong
  2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
                   ` (11 more replies)
  0 siblings, 12 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:17 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This is a rollup of all the patches I've sent in the past 9 days or so.
If all goes well I hope to land this during the 4.16 merge.

Running generic/232 with quotas and reflink demonstrated that there was
something wrong with the way we did quota accounting -- on an otherwise
idle system, fs-wide du block count numbers didn't match the quota
reports.  I started digging into why the quota accounting was wrong, and
the following are the results of my bug hunt.

The first patch teaches the reflink code to break layout leases before
commencing the block remapping work.  This time we avoid the "looping
trying to get a lock" that Christoph complained about, in favor of
dropping both locks and retrying if we can't cleanly break the layouts
without waiting.

The second patch changes the source file locking (if src != dest) during
a reflink operation to take the shared locks when possible.  The only
thing changing in the source file is the setting of the reflink iflag,
for which we will still take ILOCK_EXCL.  The net result of this is
less lock contention during fsstress and a 30% lower runtime, not that
anyone cares about fsstress benchmarking. :)

Patch three ensure that we attach dquots to inodes before we start
reflinking their blocks.  This could lead to quota undercharging; an
fstest to check this will be sent separately.

Patch four reorganizes the copy on write quota updating code to reflect
how the CoW fork works now.  In short, the CoW fork is entirely in
memory, so we can only use the in-memory quota reservation counters for
all CoW blocks; the accounting only becomes permanent if we remap an
extent into the data fork.

Patch five creates a separate i_cow_blocks counter to track all the CoW
blocks assigned to a file, which makes changing a file's uid/gid/prjid
easier, makes reporting cow blocks via stat easy, and enables various
cleanups.

Patch six fixes a serious potential corruption problem with the cow
extent allocation -- when we allocate into the CoW fork with the cow
extent size hint set, the allocator enlarges the allocation request to
try to hit alignment goals.  However, if the allocated extent does not
actually fulfill any of the requested range, we send a garbage
zero-length extent back to the iomap code (which also doesn't notice),
and the write lands at the startblock of the garbage extent.  The fix is
to detect that we didn't fill the entire requested range and fix up the
returned mapping so that we always fill the first block of the
requested allocation.

The seventh patch fixes a minor problem where we fail to clear di_flags2
when we're freeing an inode.

The eighth and ninth patches fix inconsistent and incorrect print format
specifer usage in the tracepoints.  In tracepoint land, %p is sufficient
to print a pointer as 0x12345678, so just do that.  %pS and %pF (printk
training wheels) are wrong here.  Also fix inode numbers to always use
0x%llx (we've been lax about printing them as numbers).

The tenth patch creates an xfs_inode_verifier_error helper so that we
can complain about inode corruption problems in a standard way but leave
the format string details to the kernel/xfsprogs.  We really can't have
%pS and other stuff escaping to userspace.

The eleventh patch fixes a NULL pointer deref because we incorrectly
freed the inode btree cursor if there's an error while counting the
blocks in the inode btree for rmapbt cross-referencing.

Anyway, with this set applied I think we're ready to remove the reflink
EXPERIMENTAL tag during the 4.16 cycle.

--D

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-24 14:16   ` Brian Foster
  2018-01-26  9:06   ` Christoph Hellwig
  2018-01-24  2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Before we share blocks between files, we need to break the pnfs leases
on the layout before we start slicing and dicing the block map.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 47aea2e..f89a725 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1245,6 +1245,50 @@ xfs_reflink_remap_blocks(
 }
 
 /*
+ * Grab the exclusive iolock for a data copy from src to dest, making
+ * sure to abide vfs locking order (lowest pointer value goes first) and
+ * breaking the pnfs layout leases on dest before proceeding.  The loop
+ * is needed because we cannot call the blocking break_layout() with the
+ * src iolock held, and therefore have to back out both locks.
+ */
+static int
+xfs_iolock_two_inodes_and_break_layout(
+	struct inode		*src,
+	struct inode		*dest)
+{
+	bool			src_first = src < dest;
+	bool			src_last = src > dest;
+	int			error;
+
+retry:
+	if (src_first) {
+		inode_lock(src);
+		inode_lock_nested(dest, I_MUTEX_NONDIR2);
+	} else {
+		inode_lock(dest);
+	}
+
+	error = break_layout(dest, false);
+	if (error == -EWOULDBLOCK) {
+		inode_unlock(dest);
+		if (src_first)
+			inode_unlock(src);
+		error = break_layout(dest, true);
+		if (error)
+			return error;
+		goto retry;
+	} else if (error) {
+		inode_unlock(dest);
+		if (src_first)
+			inode_unlock(src);
+		return error;
+	}
+	if (src_last)
+		inode_lock_nested(src, I_MUTEX_NONDIR2);
+	return 0;
+}
+
+/*
  * Link a range of blocks from one file to another.
  */
 int
@@ -1274,7 +1318,9 @@ xfs_reflink_remap_range(
 		return -EIO;
 
 	/* Lock both files against IO */
-	lock_two_nondirectories(inode_in, inode_out);
+	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+	if (ret)
+		return ret;
 	if (same_inode)
 		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
 	else


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
  2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-24 14:18   ` Brian Foster
  2018-01-26 12:07   ` Christoph Hellwig
  2018-01-24  2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Reflink and dedupe operations remap blocks from a source file into a
destination file.  The destination file needs exclusive locks on all
levels because we're updating its block map, but the source file isn't
undergoing any block map changes so we can use a shared lock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c   |   49 ++++++++++++++++++++++++++++++++-----------------
 fs/xfs/xfs_inode.h   |   12 +++++++++++-
 fs/xfs/xfs_reflink.c |   26 ++++++++++++++++----------
 3 files changed, 59 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c9e40d4..4a38cfc 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -545,24 +545,37 @@ xfs_lock_inodes(
 }
 
 /*
- * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
- * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
- * lock more than one at a time, lockdep will report false positives saying we
- * have violated locking orders.
+ * xfs_lock_two_inodes_separately() can only be used to lock one type of lock
+ * at a time - the mmaplock or the ilock, but not more than one type at a
+ * time. If we lock more than one at a time, lockdep will report false
+ * positives saying we have violated locking orders.  The iolock must be
+ * double-locked separately since we use i_rwsem for that.  We now support
+ * taking one lock EXCL and the other SHARED.
  */
 void
-xfs_lock_two_inodes(
-	xfs_inode_t		*ip0,
-	xfs_inode_t		*ip1,
-	uint			lock_mode)
+xfs_lock_two_inodes_separately(
+	struct xfs_inode	*ip0,
+	uint			ip0_mode,
+	struct xfs_inode	*ip1,
+	uint			ip1_mode)
 {
-	xfs_inode_t		*temp;
+	struct xfs_inode	*temp;
+	uint			mode_temp;
 	int			attempts = 0;
 	xfs_log_item_t		*lp;
 
-	ASSERT(!(lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
-	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
-		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+	ASSERT(hweight32(ip0_mode) == 1);
+	ASSERT(hweight32(ip1_mode) == 1);
+	ASSERT(!(ip0_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
+	ASSERT(!(ip1_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
+	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
+	       !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+	ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
+	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+	ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
+	       !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
+	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
 
 	ASSERT(ip0->i_ino != ip1->i_ino);
 
@@ -570,10 +583,13 @@ xfs_lock_two_inodes(
 		temp = ip0;
 		ip0 = ip1;
 		ip1 = temp;
+		mode_temp = ip0_mode;
+		ip0_mode = ip1_mode;
+		ip1_mode = mode_temp;
 	}
 
  again:
-	xfs_ilock(ip0, xfs_lock_inumorder(lock_mode, 0));
+	xfs_ilock(ip0, xfs_lock_inumorder(ip0_mode, 0));
 
 	/*
 	 * If the first lock we have locked is in the AIL, we must TRY to get
@@ -582,18 +598,17 @@ xfs_lock_two_inodes(
 	 */
 	lp = (xfs_log_item_t *)ip0->i_itemp;
 	if (lp && (lp->li_flags & XFS_LI_IN_AIL)) {
-		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(lock_mode, 1))) {
-			xfs_iunlock(ip0, lock_mode);
+		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(ip1_mode, 1))) {
+			xfs_iunlock(ip0, ip0_mode);
 			if ((++attempts % 5) == 0)
 				delay(1); /* Don't just spin the CPU */
 			goto again;
 		}
 	} else {
-		xfs_ilock(ip1, xfs_lock_inumorder(lock_mode, 1));
+		xfs_ilock(ip1, xfs_lock_inumorder(ip1_mode, 1));
 	}
 }
 
-
 void
 __xfs_iflock(
 	struct xfs_inode	*ip)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 386b0bb..ff56486 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -423,7 +423,17 @@ void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
 int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
-void		xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
+void		xfs_lock_two_inodes_separately(struct xfs_inode *ip0,
+				uint ip0_mode, struct xfs_inode *ip1,
+				uint ip1_mode);
+static inline void
+xfs_lock_two_inodes(
+	struct xfs_inode	*ip0,
+	struct xfs_inode	*ip1,
+	uint			lock_mode)
+{
+	xfs_lock_two_inodes_separately(ip0, lock_mode, ip1, lock_mode);
+}
 
 xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
 xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f89a725..f5a43b2 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1202,13 +1202,16 @@ xfs_reflink_remap_blocks(
 
 	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
 	while (len) {
+		uint		lock_mode;
+
 		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
 				dest, destoff);
+
 		/* Read extent from the source file */
 		nimaps = 1;
-		xfs_ilock(src, XFS_ILOCK_EXCL);
+		lock_mode = xfs_ilock_data_map_shared(src);
 		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
-		xfs_iunlock(src, XFS_ILOCK_EXCL);
+		xfs_iunlock(src, lock_mode);
 		if (error)
 			goto err;
 		ASSERT(nimaps == 1);
@@ -1262,7 +1265,7 @@ xfs_iolock_two_inodes_and_break_layout(
 
 retry:
 	if (src_first) {
-		inode_lock(src);
+		inode_lock_shared(src);
 		inode_lock_nested(dest, I_MUTEX_NONDIR2);
 	} else {
 		inode_lock(dest);
@@ -1272,7 +1275,7 @@ xfs_iolock_two_inodes_and_break_layout(
 	if (error == -EWOULDBLOCK) {
 		inode_unlock(dest);
 		if (src_first)
-			inode_unlock(src);
+			inode_unlock_shared(src);
 		error = break_layout(dest, true);
 		if (error)
 			return error;
@@ -1280,11 +1283,11 @@ xfs_iolock_two_inodes_and_break_layout(
 	} else if (error) {
 		inode_unlock(dest);
 		if (src_first)
-			inode_unlock(src);
+			inode_unlock_shared(src);
 		return error;
 	}
 	if (src_last)
-		inode_lock_nested(src, I_MUTEX_NONDIR2);
+		down_read_nested(&src->i_rwsem, I_MUTEX_NONDIR2);
 	return 0;
 }
 
@@ -1324,7 +1327,8 @@ xfs_reflink_remap_range(
 	if (same_inode)
 		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
 	else
-		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
+		xfs_lock_two_inodes_separately(src, XFS_MMAPLOCK_SHARED,
+				dest, XFS_MMAPLOCK_EXCL);
 
 	/* Check file eligibility and prepare for block sharing. */
 	ret = -EINVAL;
@@ -1387,10 +1391,12 @@ xfs_reflink_remap_range(
 			is_dedupe);
 
 out_unlock:
-	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
+	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
+	if (!same_inode)
+		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
+	inode_unlock(inode_out);
 	if (!same_inode)
-		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
-	unlock_two_nondirectories(inode_in, inode_out);
+		inode_unlock_shared(inode_in);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return ret;


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
  2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
  2018-01-24  2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-24 14:18   ` Brian Foster
  2018-01-26  9:07   ` Christoph Hellwig
  2018-01-24  2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Ensure that we've attached all the necessary dquots before performing
reflink operations so that quota accounting is accurate.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5a43b2..82abff6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1345,6 +1345,11 @@ xfs_reflink_remap_range(
 	if (ret <= 0)
 		goto out_unlock;
 
+	/* Attach dquots to dest inode before changing block map */
+	ret = xfs_qm_dqattach(dest, 0);
+	if (ret)
+		goto out_unlock;
+
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
 
 	/*


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (2 preceding siblings ...)
  2018-01-24  2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-24 14:22   ` Brian Foster
  2018-01-25  1:20   ` [PATCH v2 " Darrick J. Wong
  2018-01-24  2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Since the CoW fork only exists in memory, it is incorrect to update the
on-disk quota block counts when we modify the CoW fork.  Unlike the data
fork, even real extents in the CoW fork are only reservations (on-disk
they're owned by the refcountbt) so they must not be tracked in the on
disk quota info.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  203 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_reflink.c     |    8 +-
 2 files changed, 196 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 6e6f3cb..e3e8f7c 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -52,6 +52,145 @@
 #include "xfs_refcount.h"
 #include "xfs_icache.h"
 
+/*
+ * Data/Attr Fork Mapping Lifecycle
+ *
+ * The data fork contains the block mappings between logical blocks in a file
+ * and physical blocks on the disk.  The XFS notions of delayed allocation
+ * reservations, unwritten extents, and real extents follow well known
+ * conventions in the filesystem world.
+ *
+ * As a side note, the attribute fork does the same for extended attribute
+ * blocks, though the logical block offsets are not available to userspace and
+ * the only valid states are HOLE and REAL.
+ *
+ * Metadata involved outside of the block mapping itself are as follows:
+ *
+ * - i_delayed_blks: Number of blocks that are reserved for delayed allocation.
+ * - i_cow_blocks: Number of blocks reserved for copy on write staging.
+ *
+ * - di_nblocks: Number of blocks (on-disk) assigned to the inode.
+ *
+ * - d_bcount: Number of quota blocks accounted for by on-disk metadata.
+ * - q_res_bcount: Number of quota blocks reserved in-core for future writes +
+ *           blocks mentioned by on-disk metadata.
+ *
+ * - qt_blk_res: Number of quota blocks reserved in-core for this transaction.
+ *           Unused reservation is given back to q_res_bcount on commit.
+ * - qt_bcount: Number of quota blocks used by this transaction from
+ *           qt_blk_res.  d_bcount is increased by this on commit.
+ * - qt_delbcount: Number of quota blocks used by this transaction from
+ *           q_res_bcount but not q_res_bcount.  d_bcount is increased by this
+ *           on commit.
+ *
+ * - sb_fdblocks: Number of free blocks recorded in the superblock on disk.
+ * - fdblocks: Number of free blocks recorded in the superblock minus any
+ *           in-core reservations made in anticipation of future writes.
+ *
+ * - t_blk_res: Number of blocks reserved out of fdblocks for a transaction.
+ *           When the transaction commits, t_blk_res - t_blk_res_used is given
+ *           back to fdblocks.
+ * - t_blk_res_used: Number of blocks used by this transaction that were
+ *           reserved for this transaction.
+ * - t_fdblocks_del: Number of blocks by which fdblocks and sb_fdblocks will
+ *           have to decrease at commit.
+ * - t_res_fdblocks_delta: Number of blocks by which sb_fdblocks will have to
+ *           decrease at commit.  We assume that fdblocks was decreased
+ *           prior to the transaction.
+ *
+ * Data fork block mappings have four logical states:
+ *
+ *    +--------> UNWRITTEN <------+
+ *    |              ^            |
+ *    |              v            v
+ * DELALLOC <----> HOLE <------> REAL
+ *    |                           ^
+ *    |                           |
+ *    +---------------------------+
+ *
+ * The state transitions and required metadata updates are as follows:
+ *
+ * - HOLE to DELALLOC: Increase i_delayed_blks and q_res_bcount, and decrease
+ *           fdblocks.
+ * - HOLE to REAL: Increase di_nblocks and qt_bcount, and decrease fdblocks.
+ * - HOLE to UNWRITTEN: Same as above.
+ *
+ * - DELALLOC to UNWRITTEN: Increase di_nblocks and qt_delbcount, and decrease
+ *           i_delayed_blks.
+ * - DELALLOC to REAL: Same as above.
+ * - DELALLOC to HOLE: Increase fdblocks, and decrease i_delayed_blks and
+ *           q_res_bcount.
+ *
+ * - UNWRITTEN to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
+ * - UNWRITTEN to REAL: No change.
+ *
+ * - REAL to UNWRITTEN: No change.
+ * - REAL to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
+ *
+ * Note in particular that delalloc reservations have "transaction-less"
+ * quota reservations via q_res_bcount.  If the reservation is allocated,
+ * qt_delbcount is used to increment d_bcount without touching q_res_bcount.
+ * Filling a hole with an allocated extent, by contrast, uses qt_blk_res
+ * to make a reservation in q_res_bcount, qt_bcount to record the number
+ * of allocated blocks; at commit qt_bcount is added to d_bcount and
+ * qt_blk_res - qt_bcount is added back to q_res_bcount.
+ *
+ * Copy on Write Fork Mapping Lifecycle
+ *
+ * The CoW fork handles things differently from the data fork because its
+ * mappings only exist in memory-- the refcount btree is the on-disk owner of
+ * the extents until they're remapped into the data fork.  Therefore,
+ * unwritten and real extents in the CoW fork are treated the same way as
+ * delayed allocation extents.  Quota and fdblock changes only exist in
+ * memory, which requires some twists in the bmap functions.
+ *
+ * The CoW fork extent state diagram looks like this:
+ *
+ *    +--------> UNWRITTEN -------+
+ *    |              ^            |
+ *    |              v            v
+ * DELALLOC <----> HOLE <------- REAL
+ *
+ * Holes are still holes.  Delayed allocation extents reserve blocks for
+ * landing future writes, just like they do in the data fork.  However, unlike
+ * the data fork, unwritten extents signal an extent that has been allocated
+ * but is not currently undergoing writeback.  Real extents are undergoing
+ * writeback, and when that writeback finishes the corresponding data fork
+ * extent will be punched out and the CoW fork counterpart moved to the new
+ * hole in the data fork.
+ *
+ * The state transitions and required metadata updates are as follows:
+ *
+ * - HOLE to DELALLOC: Increase i_cow_blocks and q_res_bcount, and decrease
+ *           fdblocks.
+ * - HOLE to UNWRITTEN: Same as above, but since we reserved quota via
+ *           qt_blk_res (which increased q_res_bcount) when we allocate the
+ *           extent we have to decrease qt_blk_res so that the commit doesn't
+ *           give the allocated CoW blocks back.
+ *
+ * - DELALLOC to UNWRITTEN: No change.
+ * - DELALLOC to HOLE: Decrease i_cow_blocks and q_res_bcount, and increase
+ *           fdblocks.
+ *
+ * - UNWRITTEN to HOLE: Same as DELALLOC to HOLE.
+ * - UNWRITTEN to REAL: No change.
+ *
+ * - REAL to HOLE: This transition happens when we've finished a write
+ *           operation and need to move the mapping to the data fork.  We
+ *           punch the correspond data fork mappings, which decreases
+ *           qt_bcount.  Then we map the CoW fork mapping into the hole we
+ *           just cleared out of the data fork, which increases qt_bcount.
+ *           There's a subtlety here -- if we promoted a write over a hole to
+ *           CoW, there will be a net increase in qt_bcount, which is fine
+ *           because we already reserved the quota when we filled the CoW
+ *           fork.  Finally, we punch the CoW fork mapping, which decreases
+ *           q_res_bcount.
+ *
+ * Notice how all CoW fork extents use transactionless quota reservations and
+ * the in-core fdblocks to maintain state, and we avoid updating any on-disk
+ * metadata.  This is essential to maintain metadata correctness if the system
+ * goes down.
+ */
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
 
@@ -3337,6 +3476,39 @@ xfs_bmap_btalloc_filestreams(
 	return 0;
 }
 
+/* Deal with CoW fork accounting when we allocate a block. */
+static void
+xfs_bmap_btalloc_cow(
+	struct xfs_bmalloca	*ap,
+	struct xfs_alloc_arg	*args)
+{
+	/* Filling a previously reserved extent; nothing to do here. */
+	if (ap->wasdel)
+		return;
+
+	/*
+	 * The CoW fork only exists in memory, so the on-disk quota accounting
+	 * must not incude any CoW fork extents.  Therefore, CoW blocks are
+	 * only tracked in the in-core dquot block count (q_res_bcount).
+	 *
+	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
+	 * CoW extent having reserved enough blocks from both q_res_bcount and
+	 * qt_blk_res to guarantee that we won't run out of space.  The unused
+	 * qt_blk_res is given back to q_res_bcount when the transaction
+	 * commits.
+	 *
+	 * We don't want the quota accounting for our newly allocated blocks
+	 * to be given back, so we must decrease qt_blk_res without decreasing
+	 * q_res_bcount.
+	 *
+	 * Note: If we're allocating a delalloc extent, we already reserved
+	 * the q_res_bcount blocks, so no quota accounting update is needed
+	 * here.
+	 */
+	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
+			-(long)args->len);
+}
+
 STATIC int
 xfs_bmap_btalloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
@@ -3571,19 +3743,22 @@ xfs_bmap_btalloc(
 			*ap->firstblock = args.fsbno;
 		ASSERT(nullfb || fb_agno <= args.agno);
 		ap->length = args.len;
-		if (!(ap->flags & XFS_BMAPI_COWFORK))
-			ap->ip->i_d.di_nblocks += args.len;
-		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
 		if (ap->wasdel)
 			ap->ip->i_delayed_blks -= args.len;
-		/*
-		 * Adjust the disk quota also. This was reserved
-		 * earlier.
-		 */
-		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
-			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
-					XFS_TRANS_DQ_BCOUNT,
-			(long) args.len);
+		if (ap->flags & XFS_BMAPI_COWFORK) {
+			xfs_bmap_btalloc_cow(ap, &args);
+		} else {
+			ap->ip->i_d.di_nblocks += args.len;
+			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
+			/*
+			 * Adjust the disk quota also. This was reserved
+			 * earlier.
+			 */
+			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
+				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
+						XFS_TRANS_DQ_BCOUNT,
+				(long) args.len);
+		}
 	} else {
 		ap->blkno = NULLFSBLOCK;
 		ap->length = 0;
@@ -4776,6 +4951,7 @@ xfs_bmap_del_extent_cow(
 	struct xfs_bmbt_irec	new;
 	xfs_fileoff_t		del_endoff, got_endoff;
 	int			state = BMAP_COWFORK;
+	int			error;
 
 	XFS_STATS_INC(mp, xs_del_exlist);
 
@@ -4832,6 +5008,11 @@ xfs_bmap_del_extent_cow(
 		xfs_iext_insert(ip, icur, &new, state);
 		break;
 	}
+
+	/* Remove the quota reservation */
+	error = xfs_trans_reserve_quota_nblks(NULL, ip,
+			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
+	ASSERT(error == 0);
 }
 
 /*
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 82abff6..e367351 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
 					del.br_startblock, del.br_blockcount,
 					NULL);
 
-			/* Update quota accounting */
-			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
-					-(long)del.br_blockcount);
-
 			/* Roll the transaction */
 			xfs_defer_ijoin(&dfops, ip);
 			error = xfs_defer_finish(tpp, &dfops);
@@ -795,6 +791,10 @@ xfs_reflink_end_cow(
 		if (error)
 			goto out_defer;
 
+		/* Charge this new data fork mapping to the on-disk quota. */
+		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
+				(long)del.br_blockcount);
+
 		/* Remove the mapping from the CoW fork. */
 		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
 


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (3 preceding siblings ...)
  2018-01-24  2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-25 13:06   ` Brian Foster
  2018-01-26 12:15   ` Christoph Hellwig
  2018-01-24  2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Track the number of blocks reserved in the CoW fork so that we can
move the quota reservations whenever we chown, and don't account for
CoW fork delalloc reservations in i_delayed_blks.  This should make
chown work properly for quota reservations, enables us to fully
account for real extents in the cow fork in the file stat info, and
improves the post-eof scanning decisions because we're no longer
confusing data fork delalloc extents with cow fork delalloc extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c      |   16 ++++++++++++----
 fs/xfs/libxfs/xfs_inode_buf.c |    1 +
 fs/xfs/xfs_bmap_util.c        |    5 +++++
 fs/xfs/xfs_icache.c           |    3 ++-
 fs/xfs/xfs_inode.c            |   11 +++++------
 fs/xfs/xfs_inode.h            |    1 +
 fs/xfs/xfs_iops.c             |    3 ++-
 fs/xfs/xfs_itable.c           |    3 ++-
 fs/xfs/xfs_qm.c               |    2 +-
 fs/xfs/xfs_reflink.c          |    4 ++--
 fs/xfs/xfs_super.c            |    1 +
 11 files changed, 34 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index e3e8f7c..93ce2c6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3505,6 +3505,7 @@ xfs_bmap_btalloc_cow(
 	 * the q_res_bcount blocks, so no quota accounting update is needed
 	 * here.
 	 */
+	ap->ip->i_cow_blocks += args->len;
 	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
 			-(long)args->len);
 }
@@ -3743,13 +3744,13 @@ xfs_bmap_btalloc(
 			*ap->firstblock = args.fsbno;
 		ASSERT(nullfb || fb_agno <= args.agno);
 		ap->length = args.len;
-		if (ap->wasdel)
-			ap->ip->i_delayed_blks -= args.len;
 		if (ap->flags & XFS_BMAPI_COWFORK) {
 			xfs_bmap_btalloc_cow(ap, &args);
 		} else {
 			ap->ip->i_d.di_nblocks += args.len;
 			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
+			if (ap->wasdel)
+				ap->ip->i_delayed_blks -= args.len;
 			/*
 			 * Adjust the disk quota also. This was reserved
 			 * earlier.
@@ -4116,7 +4117,10 @@ xfs_bmapi_reserve_delalloc(
 		goto out_unreserve_blocks;
 
 
-	ip->i_delayed_blks += alen;
+	if (whichfork == XFS_COW_FORK)
+		ip->i_cow_blocks += alen;
+	else
+		ip->i_delayed_blks += alen;
 
 	got->br_startoff = aoff;
 	got->br_startblock = nullstartblock(indlen);
@@ -4859,7 +4863,10 @@ xfs_bmap_del_extent_delay(
 			isrt ? XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS);
 	if (error)
 		return error;
-	ip->i_delayed_blks -= del->br_blockcount;
+	if (whichfork == XFS_COW_FORK)
+		ip->i_cow_blocks -= del->br_blockcount;
+	else
+		ip->i_delayed_blks -= del->br_blockcount;
 
 	if (got->br_startoff == del->br_startoff)
 		state |= BMAP_LEFT_FILLING;
@@ -5010,6 +5017,7 @@ xfs_bmap_del_extent_cow(
 	}
 
 	/* Remove the quota reservation */
+	ip->i_cow_blocks -= del->br_blockcount;
 	error = xfs_trans_reserve_quota_nblks(NULL, ip,
 			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
 	ASSERT(error == 0);
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 4035b5d..6e9dcdb 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -624,6 +624,7 @@ xfs_iread(
 
 	ASSERT(ip->i_d.di_version >= 2);
 	ip->i_delayed_blks = 0;
+	ip->i_cow_blocks = 0;
 
 	/*
 	 * Mark the buffer containing the inode as something to keep
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 6d37ab4..c572789 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1991,6 +1991,7 @@ xfs_swap_extents(
 	/* Swap the cow forks. */
 	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
 		xfs_extnum_t	extnum;
+		unsigned int	cowblocks;
 
 		ASSERT(ip->i_cformat == XFS_DINODE_FMT_EXTENTS);
 		ASSERT(tip->i_cformat == XFS_DINODE_FMT_EXTENTS);
@@ -2011,6 +2012,10 @@ xfs_swap_extents(
 			xfs_inode_set_cowblocks_tag(tip);
 		else
 			xfs_inode_clear_cowblocks_tag(tip);
+
+		cowblocks = tip->i_cow_blocks;
+		tip->i_cow_blocks = ip->i_cow_blocks;
+		ip->i_cow_blocks = cowblocks;
 	}
 
 	xfs_trans_log_inode(tp, ip,  src_log_flags);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2da7a2e..1344206 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -80,6 +80,7 @@ xfs_inode_alloc(
 	memset(&ip->i_df, 0, sizeof(xfs_ifork_t));
 	ip->i_flags = 0;
 	ip->i_delayed_blks = 0;
+	ip->i_cow_blocks = 0;
 	memset(&ip->i_d, 0, sizeof(ip->i_d));
 
 	return ip;
@@ -1668,7 +1669,7 @@ xfs_prep_free_cowblocks(
 	 * Just clear the tag if we have an empty cow fork or none at all. It's
 	 * possible the inode was fully unshared since it was originally tagged.
 	 */
-	if (!xfs_is_reflink_inode(ip) || !ifp->if_bytes) {
+	if (!xfs_is_reflink_inode(ip) || ip->i_cow_blocks == 0) {
 		trace_xfs_inode_free_cowblocks_invalid(ip);
 		xfs_inode_clear_cowblocks_tag(ip);
 		return false;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4a38cfc..a208825 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1508,15 +1508,13 @@ xfs_itruncate_clear_reflink_flags(
 	struct xfs_inode	*ip)
 {
 	struct xfs_ifork	*dfork;
-	struct xfs_ifork	*cfork;
 
 	if (!xfs_is_reflink_inode(ip))
 		return;
 	dfork = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
-	cfork = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	if (dfork->if_bytes == 0 && cfork->if_bytes == 0)
+	if (dfork->if_bytes == 0 && ip->i_cow_blocks == 0)
 		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-	if (cfork->if_bytes == 0)
+	if (ip->i_cow_blocks == 0)
 		xfs_inode_clear_cowblocks_tag(ip);
 }
 
@@ -1669,7 +1667,7 @@ xfs_release(
 		truncated = xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED);
 		if (truncated) {
 			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
-			if (ip->i_delayed_blks > 0) {
+			if (ip->i_delayed_blks > 0 || ip->i_cow_blocks > 0) {
 				error = filemap_flush(VFS_I(ip)->i_mapping);
 				if (error)
 					return error;
@@ -1909,7 +1907,8 @@ xfs_inactive(
 
 	if (S_ISREG(VFS_I(ip)->i_mode) &&
 	    (ip->i_d.di_size != 0 || XFS_ISIZE(ip) != 0 ||
-	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0))
+	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0 ||
+	     ip->i_cow_blocks > 0))
 		truncate = 1;
 
 	error = xfs_qm_dqattach(ip, 0);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ff56486..6feee8a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -62,6 +62,7 @@ typedef struct xfs_inode {
 	/* Miscellaneous state. */
 	unsigned long		i_flags;	/* see defined flags below */
 	unsigned int		i_delayed_blks;	/* count of delay alloc blks */
+	unsigned int		i_cow_blocks;	/* count of cow fork blocks */
 
 	struct xfs_icdinode	i_d;		/* most of ondisk inode */
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fc..6c3381c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -513,7 +513,8 @@ xfs_vn_getattr(
 	stat->mtime = inode->i_mtime;
 	stat->ctime = inode->i_ctime;
 	stat->blocks =
-		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
+		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks +
+				  ip->i_cow_blocks);
 
 	if (ip->i_d.di_version == 3) {
 		if (request_mask & STATX_BTIME) {
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index d583105..412d7eb 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -122,7 +122,8 @@ xfs_bulkstat_one_int(
 	case XFS_DINODE_FMT_BTREE:
 		buf->bs_rdev = 0;
 		buf->bs_blksize = mp->m_sb.sb_blocksize;
-		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks;
+		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks +
+				 ip->i_cow_blocks;
 		break;
 	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 5b848f4..28f12f8 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1847,7 +1847,7 @@ xfs_qm_vop_chown_reserve(
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
 	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
 
-	delblks = ip->i_delayed_blks;
+	delblks = ip->i_delayed_blks + ip->i_cow_blocks;
 	blkflags = XFS_IS_REALTIME_INODE(ip) ?
 			XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS;
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index e367351..f875ea7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -619,7 +619,7 @@ xfs_reflink_cancel_cow_blocks(
 	}
 
 	/* clear tag if cow fork is emptied */
-	if (!ifp->if_bytes)
+	if (ip->i_cow_blocks == 0)
 		xfs_inode_clear_cowblocks_tag(ip);
 
 	return error;
@@ -704,7 +704,7 @@ xfs_reflink_end_cow(
 	trace_xfs_reflink_end_cow(ip, offset, count);
 
 	/* No COW extents?  That's easy! */
-	if (ifp->if_bytes == 0)
+	if (ip->i_cow_blocks == 0)
 		return 0;
 
 	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f3e0001..9d04cfb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -989,6 +989,7 @@ xfs_fs_destroy_inode(
 	xfs_inactive(ip);
 
 	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
+	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_cow_blocks == 0);
 	XFS_STATS_INC(ip->i_mount, vn_reclaim);
 
 	/*


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (4 preceding siblings ...)
  2018-01-24  2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
  2018-01-26  9:11   ` Christoph Hellwig
  2018-01-24  2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

In xfs_bmap_btalloc, we try using the CoW extent size hint to force
allocations to align (offset-wise) to cowextsz granularity to reduce CoW
fragmentation.  This works fine until we cannot satisfy the allocation
with enough blocks to cover the requested range and the alignment hints.
If this happens, return an unaligned region because if we don't the
extent trim functions cause us to return a zero-length extent to iomap,
which iomap doesn't catch and thus blows up.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/iomap.c               |    2 +-
 fs/xfs/libxfs/xfs_bmap.c |   21 +++++++++++++++++++--
 2 files changed, 20 insertions(+), 3 deletions(-)


diff --git a/fs/iomap.c b/fs/iomap.c
index e5de772..aec35a0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
 	if (ret)
 		return ret;
-	if (WARN_ON(iomap.offset > pos))
+	if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
 		return -EIO;
 
 	/*
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 93ce2c6..4ec1fdc5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3480,8 +3480,20 @@ xfs_bmap_btalloc_filestreams(
 static void
 xfs_bmap_btalloc_cow(
 	struct xfs_bmalloca	*ap,
-	struct xfs_alloc_arg	*args)
+	struct xfs_alloc_arg	*args,
+	xfs_fileoff_t		orig_offset,
+	xfs_extlen_t		orig_length)
 {
+	/*
+	 * If we didn't get enough blocks to satisfy the cowextsize
+	 * aligned request, break the alignment and return whatever we
+	 * got; it's the best we can do.
+	 */
+	if (ap->length <= orig_length)
+		ap->offset = orig_offset;
+	else if (ap->offset + ap->length < orig_offset + orig_length)
+		ap->offset = orig_offset + orig_length - ap->length;
+
 	/* Filling a previously reserved extent; nothing to do here. */
 	if (ap->wasdel)
 		return;
@@ -3520,6 +3532,8 @@ xfs_bmap_btalloc(
 	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
 	xfs_agnumber_t	ag;
 	xfs_alloc_arg_t	args;
+	xfs_fileoff_t	orig_offset;
+	xfs_extlen_t	orig_length;
 	xfs_extlen_t	blen;
 	xfs_extlen_t	nextminlen = 0;
 	int		nullfb;		/* true if ap->firstblock isn't set */
@@ -3529,6 +3543,8 @@ xfs_bmap_btalloc(
 	int		stripe_align;
 
 	ASSERT(ap->length);
+	orig_offset = ap->offset;
+	orig_length = ap->length;
 
 	mp = ap->ip->i_mount;
 
@@ -3745,7 +3761,8 @@ xfs_bmap_btalloc(
 		ASSERT(nullfb || fb_agno <= args.agno);
 		ap->length = args.len;
 		if (ap->flags & XFS_BMAPI_COWFORK) {
-			xfs_bmap_btalloc_cow(ap, &args);
+			xfs_bmap_btalloc_cow(ap, &args, orig_offset,
+					orig_length);
 		} else {
 			ap->ip->i_d.di_nblocks += args.len;
 			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 07/11] xfs: always zero di_flags2 when we free the inode
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (5 preceding siblings ...)
  2018-01-24  2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
  2018-01-26  9:08   ` Christoph Hellwig
  2018-01-24  2:18 ` [PATCH 08/11] xfs: fix tracepoint %p formats Darrick J. Wong
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Always zero the di_flags2 field when we free the inode so that we never
write reflinked non-file inode records to disk.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |    1 +
 1 file changed, 1 insertion(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a208825..fc118dd 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2465,6 +2465,7 @@ xfs_ifree(
 
 	VFS_I(ip)->i_mode = 0;		/* mark incore inode as free */
 	ip->i_d.di_flags = 0;
+	ip->i_d.di_flags2 = 0;
 	ip->i_d.di_dmevmask = 0;
 	ip->i_d.di_forkoff = 0;		/* mark the attr fork not in use */
 	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 08/11] xfs: fix tracepoint %p formats
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (6 preceding siblings ...)
  2018-01-24  2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
  2018-01-24  2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Tracepoint printk doesn't have any of the %p suffixes, so use %p.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/trace.h |   20 ++++++++++----------
 fs/xfs/xfs_trace.h   |   24 ++++++++++++------------
 2 files changed, 22 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index a0a6d3c..732775f 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -90,7 +90,7 @@ TRACE_EVENT(xfs_scrub_op_error,
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %pS",
+	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->type,
 		  __entry->agno,
@@ -121,7 +121,7 @@ TRACE_EVENT(xfs_scrub_file_op_error,
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %pS",
+	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,
@@ -156,7 +156,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_block_error_class,
 		__entry->bno = bno;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %pS",
+	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->type,
 		  __entry->agno,
@@ -207,7 +207,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_ino_error_class,
 		__entry->bno = bno;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %pS",
+	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->type,
@@ -246,7 +246,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_fblock_error_class,
 		__entry->offset = offset;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %pS",
+	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,
@@ -277,7 +277,7 @@ TRACE_EVENT(xfs_scrub_incomplete,
 		__entry->type = sc->sm->sm_type;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d type %u ret_ip %pS",
+	TP_printk("dev %d:%d type %u ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->type,
 		  __entry->ret_ip)
@@ -311,7 +311,7 @@ TRACE_EVENT(xfs_scrub_btree_op_error,
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
+	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->type,
 		  __entry->btnum,
@@ -354,7 +354,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_op_error,
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
+	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,
@@ -393,7 +393,7 @@ TRACE_EVENT(xfs_scrub_btree_error,
 		__entry->ptr = cur->bc_ptrs[level];
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
+	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->type,
 		  __entry->btnum,
@@ -433,7 +433,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
 		__entry->ptr = cur->bc_ptrs[level];
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
+	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 945de08..893081e 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -119,7 +119,7 @@ DECLARE_EVENT_CLASS(xfs_perag_class,
 		__entry->refcount = refcount;
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d agno %u refcount %d caller %pS",
+	TP_printk("dev %d:%d agno %u refcount %d caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
 		  __entry->refcount,
@@ -252,7 +252,7 @@ DECLARE_EVENT_CLASS(xfs_bmap_class,
 		__entry->caller_ip = caller_ip;
 	),
 	TP_printk("dev %d:%d ino 0x%llx state %s cur %p/%d "
-		  "offset %lld block %lld count %lld flag %d caller %pS",
+		  "offset %lld block %lld count %lld flag %d caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __print_flags(__entry->bmap_state, "|", XFS_BMAP_EXT_FLAGS),
@@ -301,7 +301,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
 		__entry->caller_ip = caller_ip;
 	),
 	TP_printk("dev %d:%d bno 0x%llx nblks 0x%x hold %d pincount %d "
-		  "lock %d flags %s caller %pS",
+		  "lock %d flags %s caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long long)__entry->bno,
 		  __entry->nblks,
@@ -370,7 +370,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class,
 		__entry->caller_ip = caller_ip;
 	),
 	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
-		  "lock %d flags %s caller %pS",
+		  "lock %d flags %s caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long long)__entry->bno,
 		  __entry->buffer_length,
@@ -415,7 +415,7 @@ TRACE_EVENT(xfs_buf_ioerror,
 		__entry->caller_ip = caller_ip;
 	),
 	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
-		  "lock %d error %d flags %s caller %pS",
+		  "lock %d error %d flags %s caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long long)__entry->bno,
 		  __entry->buffer_length,
@@ -579,7 +579,7 @@ DECLARE_EVENT_CLASS(xfs_lock_class,
 		__entry->lock_flags = lock_flags;
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d ino 0x%llx flags %s caller %pS",
+	TP_printk("dev %d:%d ino 0x%llx flags %s caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __print_flags(__entry->lock_flags, "|", XFS_LOCK_FLAGS),
@@ -697,7 +697,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %pS",
+	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->count,
@@ -1049,7 +1049,7 @@ TRACE_EVENT(xfs_log_force,
 		__entry->lsn = lsn;
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d lsn 0x%llx caller %pS",
+	TP_printk("dev %d:%d lsn 0x%llx caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->lsn, (void *)__entry->caller_ip)
 )
@@ -1403,7 +1403,7 @@ TRACE_EVENT(xfs_bunmap,
 		__entry->flags = flags;
 	),
 	TP_printk("dev %d:%d ino 0x%llx size 0x%llx bno 0x%llx len 0x%llx"
-		  "flags %s caller %pS",
+		  "flags %s caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->size,
@@ -1517,7 +1517,7 @@ TRACE_EVENT(xfs_agf,
 	),
 	TP_printk("dev %d:%d agno %u flags %s length %u roots b %u c %u "
 		  "levels b %u c %u flfirst %u fllast %u flcount %u "
-		  "freeblks %u longest %u caller %pS",
+		  "freeblks %u longest %u caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
 		  __print_flags(__entry->flags, "|", XFS_AGF_FLAGS),
@@ -2486,7 +2486,7 @@ DECLARE_EVENT_CLASS(xfs_ag_error_class,
 		__entry->error = error;
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d agno %u error %d caller %pS",
+	TP_printk("dev %d:%d agno %u error %d caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
 		  __entry->error,
@@ -2977,7 +2977,7 @@ DECLARE_EVENT_CLASS(xfs_inode_error_class,
 		__entry->error = error;
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d ino %llx error %d caller %pS",
+	TP_printk("dev %d:%d ino %llx error %d caller %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->error,


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 09/11] xfs: make tracepoint inode number format consistent
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (7 preceding siblings ...)
  2018-01-24  2:18 ` [PATCH 08/11] xfs: fix tracepoint %p formats Darrick J. Wong
@ 2018-01-24  2:18 ` Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
  2018-01-26  9:09   ` Christoph Hellwig
  2018-01-24  2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:18 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix all the inode number formats to be consistently (0x%llx) in all
trace point definitions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/trace.h |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 732775f..eb420a41 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -50,7 +50,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_class,
 		__entry->flags = sm->sm_flags;
 		__entry->error = error;
 	),
-	TP_printk("dev %d:%d ino %llu type %u agno %u inum %llu gen %u flags 0x%x error %d",
+	TP_printk("dev %d:%d ino 0x%llx type %u agno %u inum %llu gen %u flags 0x%x error %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->type,
@@ -121,7 +121,7 @@ TRACE_EVENT(xfs_scrub_file_op_error,
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %p",
+	TP_printk("dev %d:%d ino 0x%llx fork %d type %u offset %llu error %d ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,
@@ -207,7 +207,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_ino_error_class,
 		__entry->bno = bno;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %p",
+	TP_printk("dev %d:%d ino 0x%llx type %u agno %u agbno %u ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->type,
@@ -246,7 +246,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_fblock_error_class,
 		__entry->offset = offset;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %p",
+	TP_printk("dev %d:%d ino 0x%llx fork %d type %u offset %llu ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,
@@ -354,7 +354,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_op_error,
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
+	TP_printk("dev %d:%d ino 0x%llx fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,
@@ -433,7 +433,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
 		__entry->ptr = cur->bc_ptrs[level];
 		__entry->ret_ip = ret_ip;
 	),
-	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
+	TP_printk("dev %d:%d ino 0x%llx fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->whichfork,


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 10/11] xfs: refactor inode verifier corruption error printing
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (8 preceding siblings ...)
  2018-01-24  2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
@ 2018-01-24  2:19 ` Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
  2018-01-26  9:10   ` Christoph Hellwig
  2018-01-24  2:19 ` [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
  2018-01-25  5:26 ` [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc Darrick J. Wong
  11 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:19 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor inode verifier error reporting into a non-libxfs function so
that we aren't encoding the message format in libxfs.  This also
changes the kernel dmesg output to resemble buffer verifier errors
more closely.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |    6 ++----
 fs/xfs/xfs_error.c            |   37 +++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_error.h            |    3 +++
 fs/xfs/xfs_inode.c            |   14 ++++++++------
 4 files changed, 50 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6e9dcdb..6d05ba6 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -578,10 +578,8 @@ xfs_iread(
 	/* even unallocated inodes are verified */
 	fa = xfs_dinode_verify(mp, ip->i_ino, dip);
 	if (fa) {
-		xfs_alert(mp, "%s: validation failed for inode %lld at %pS",
-				__func__, ip->i_ino, fa);
-
-		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, dip);
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "record", dip,
+				sizeof(*dip), fa);
 		error = -EFSCORRUPTED;
 		goto out_brelse;
 	}
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 980d5f0..ccf520f 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -24,6 +24,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_sysfs.h"
+#include "xfs_inode.h"
 
 #ifdef DEBUG
 
@@ -372,3 +373,39 @@ xfs_verifier_error(
 	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
 		xfs_stack_trace();
 }
+
+/*
+ * Warnings for inode corruption problems.  Don't bother with the stack
+ * trace unless the error level is turned up high.
+ */
+void
+xfs_inode_verifier_error(
+	struct xfs_inode	*ip,
+	int			error,
+	const char		*name,
+	void			*buf,
+	size_t			bufsz,
+	xfs_failaddr_t		failaddr)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_failaddr_t		fa;
+	int			sz;
+
+	fa = failaddr ? failaddr : __return_address;
+
+	xfs_alert(mp, "Metadata %s detected at %pS, inode 0x%llx %s",
+		  error == -EFSBADCRC ? "CRC error" : "corruption",
+		  fa, ip->i_ino, name);
+
+	xfs_alert(mp, "Unmount and run xfs_repair");
+
+	if (buf && xfs_error_level >= XFS_ERRLEVEL_LOW) {
+		sz = min_t(size_t, XFS_CORRUPTION_DUMP_LEN, bufsz);
+		xfs_alert(mp, "First %d bytes of corrupted metadata buffer:",
+				sz);
+		xfs_hex_dump(buf, sz);
+	}
+
+	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
+		xfs_stack_trace();
+}
diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
index a3ba05b..7e728c5 100644
--- a/fs/xfs/xfs_error.h
+++ b/fs/xfs/xfs_error.h
@@ -28,6 +28,9 @@ extern void xfs_corruption_error(const char *tag, int level,
 			int linenum, xfs_failaddr_t failaddr);
 extern void xfs_verifier_error(struct xfs_buf *bp, int error,
 			xfs_failaddr_t failaddr);
+extern void xfs_inode_verifier_error(struct xfs_inode *ip, int error,
+			const char *name, void *buf, size_t bufsz,
+			xfs_failaddr_t failaddr);
 
 #define	XFS_ERROR_REPORT(e, lvl, mp)	\
 	xfs_error_report(e, lvl, mp, __FILE__, __LINE__, __return_address)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index fc118dd..c60efec 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3502,21 +3502,23 @@ bool
 xfs_inode_verify_forks(
 	struct xfs_inode	*ip)
 {
+	struct xfs_ifork	*ifp;
 	xfs_failaddr_t		fa;
 
 	fa = xfs_ifork_verify_data(ip, &xfs_default_ifork_ops);
 	if (fa) {
-		xfs_alert(ip->i_mount,
-				"%s: bad inode %llu inline data fork at %pS",
-				__func__, ip->i_ino, fa);
+		ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "data fork",
+				ifp->if_u1.if_data, ifp->if_bytes, fa);
 		return false;
 	}
 
 	fa = xfs_ifork_verify_attr(ip, &xfs_default_ifork_ops);
 	if (fa) {
-		xfs_alert(ip->i_mount,
-				"%s: bad inode %llu inline attr fork at %pS",
-				__func__, ip->i_ino, fa);
+		ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "attr fork",
+				ifp ? ifp->if_u1.if_data : NULL,
+				ifp ? ifp->if_bytes : 0, fa);
 		return false;
 	}
 	return true;


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (9 preceding siblings ...)
  2018-01-24  2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
@ 2018-01-24  2:19 ` Darrick J. Wong
  2018-01-26  9:10   ` Christoph Hellwig
  2018-01-25  5:26 ` [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc Darrick J. Wong
  11 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24  2:19 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Even if we can't use the inobt/finobt cursors to count the number of
inode btree blocks, we are never allowed to clobber the cursor of the
btree being checked, so don't do this.  Found by fuzzing level = ones
in xfs/364.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/ialloc.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 21c850a..63ab3f9 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -391,12 +391,12 @@ xfs_scrub_iallocbt_xref_rmap_btreeblks(
 
 	/* Check that we saw as many inobt blocks as the rmap says. */
 	error = xfs_btree_count_blocks(sc->sa.ino_cur, &inobt_blocks);
-	if (!xfs_scrub_should_check_xref(sc, &error, &sc->sa.ino_cur))
+	if (!xfs_scrub_process_error(sc, 0, 0, &error))
 		return;
 
 	if (sc->sa.fino_cur) {
 		error = xfs_btree_count_blocks(sc->sa.fino_cur, &finobt_blocks);
-		if (!xfs_scrub_should_check_xref(sc, &error, &sc->sa.fino_cur))
+		if (!xfs_scrub_process_error(sc, 0, 0, &error))
 			return;
 	}
 


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks
  2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
@ 2018-01-24 14:16   ` Brian Foster
  2018-01-26  9:06   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Brian Foster @ 2018-01-24 14:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:03PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Before we share blocks between files, we need to break the pnfs leases
> on the layout before we start slicing and dicing the block map.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_reflink.c |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 47 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 47aea2e..f89a725 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1245,6 +1245,50 @@ xfs_reflink_remap_blocks(
>  }
>  
>  /*
> + * Grab the exclusive iolock for a data copy from src to dest, making
> + * sure to abide vfs locking order (lowest pointer value goes first) and
> + * breaking the pnfs layout leases on dest before proceeding.  The loop
> + * is needed because we cannot call the blocking break_layout() with the
> + * src iolock held, and therefore have to back out both locks.
> + */
> +static int
> +xfs_iolock_two_inodes_and_break_layout(
> +	struct inode		*src,
> +	struct inode		*dest)
> +{
> +	bool			src_first = src < dest;
> +	bool			src_last = src > dest;
> +	int			error;
> +
> +retry:
> +	if (src_first) {
> +		inode_lock(src);
> +		inode_lock_nested(dest, I_MUTEX_NONDIR2);
> +	} else {
> +		inode_lock(dest);
> +	}
> +
> +	error = break_layout(dest, false);
> +	if (error == -EWOULDBLOCK) {
> +		inode_unlock(dest);
> +		if (src_first)
> +			inode_unlock(src);
> +		error = break_layout(dest, true);
> +		if (error)
> +			return error;
> +		goto retry;
> +	} else if (error) {
> +		inode_unlock(dest);
> +		if (src_first)
> +			inode_unlock(src);
> +		return error;
> +	}
> +	if (src_last)
> +		inode_lock_nested(src, I_MUTEX_NONDIR2);
> +	return 0;
> +}
> +
> +/*
>   * Link a range of blocks from one file to another.
>   */
>  int
> @@ -1274,7 +1318,9 @@ xfs_reflink_remap_range(
>  		return -EIO;
>  
>  	/* Lock both files against IO */
> -	lock_two_nondirectories(inode_in, inode_out);
> +	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
> +	if (ret)
> +		return ret;
>  	if (same_inode)
>  		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
>  	else
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink
  2018-01-24  2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
@ 2018-01-24 14:18   ` Brian Foster
  2018-01-24 18:40     ` Darrick J. Wong
  2018-01-26 12:07   ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-24 14:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:09PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Reflink and dedupe operations remap blocks from a source file into a
> destination file.  The destination file needs exclusive locks on all
> levels because we're updating its block map, but the source file isn't
> undergoing any block map changes so we can use a shared lock.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_inode.c   |   49 ++++++++++++++++++++++++++++++++-----------------
>  fs/xfs/xfs_inode.h   |   12 +++++++++++-
>  fs/xfs/xfs_reflink.c |   26 ++++++++++++++++----------
>  3 files changed, 59 insertions(+), 28 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c9e40d4..4a38cfc 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -545,24 +545,37 @@ xfs_lock_inodes(
>  }
>  
>  /*
> - * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
> - * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
> - * lock more than one at a time, lockdep will report false positives saying we
> - * have violated locking orders.
> + * xfs_lock_two_inodes_separately() can only be used to lock one type of lock
> + * at a time - the mmaplock or the ilock, but not more than one type at a
> + * time. If we lock more than one at a time, lockdep will report false
> + * positives saying we have violated locking orders.  The iolock must be
> + * double-locked separately since we use i_rwsem for that.  We now support
> + * taking one lock EXCL and the other SHARED.
>   */
>  void
> -xfs_lock_two_inodes(
> -	xfs_inode_t		*ip0,
> -	xfs_inode_t		*ip1,
> -	uint			lock_mode)
> +xfs_lock_two_inodes_separately(
> +	struct xfs_inode	*ip0,
> +	uint			ip0_mode,
> +	struct xfs_inode	*ip1,
> +	uint			ip1_mode)
>  {

Nit.. but "separately" doesn't really convey meaning to me. I guess
something like xfs_lock_two_inodes_mode() is more clear to me, even
though the original version still accepted a mode. Eh, I guess even
__xfs_lock_two_inodes() might be fine (and perhaps more common
practice). Code looks fine either way:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> -	xfs_inode_t		*temp;
> +	struct xfs_inode	*temp;
> +	uint			mode_temp;
>  	int			attempts = 0;
>  	xfs_log_item_t		*lp;
>  
> -	ASSERT(!(lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> -	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
> -		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +	ASSERT(hweight32(ip0_mode) == 1);
> +	ASSERT(hweight32(ip1_mode) == 1);
> +	ASSERT(!(ip0_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> +	ASSERT(!(ip1_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> +	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> +	       !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +	ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> +	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +	ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> +	       !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> +	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
>  
>  	ASSERT(ip0->i_ino != ip1->i_ino);
>  
> @@ -570,10 +583,13 @@ xfs_lock_two_inodes(
>  		temp = ip0;
>  		ip0 = ip1;
>  		ip1 = temp;
> +		mode_temp = ip0_mode;
> +		ip0_mode = ip1_mode;
> +		ip1_mode = mode_temp;
>  	}
>  
>   again:
> -	xfs_ilock(ip0, xfs_lock_inumorder(lock_mode, 0));
> +	xfs_ilock(ip0, xfs_lock_inumorder(ip0_mode, 0));
>  
>  	/*
>  	 * If the first lock we have locked is in the AIL, we must TRY to get
> @@ -582,18 +598,17 @@ xfs_lock_two_inodes(
>  	 */
>  	lp = (xfs_log_item_t *)ip0->i_itemp;
>  	if (lp && (lp->li_flags & XFS_LI_IN_AIL)) {
> -		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(lock_mode, 1))) {
> -			xfs_iunlock(ip0, lock_mode);
> +		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(ip1_mode, 1))) {
> +			xfs_iunlock(ip0, ip0_mode);
>  			if ((++attempts % 5) == 0)
>  				delay(1); /* Don't just spin the CPU */
>  			goto again;
>  		}
>  	} else {
> -		xfs_ilock(ip1, xfs_lock_inumorder(lock_mode, 1));
> +		xfs_ilock(ip1, xfs_lock_inumorder(ip1_mode, 1));
>  	}
>  }
>  
> -
>  void
>  __xfs_iflock(
>  	struct xfs_inode	*ip)
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 386b0bb..ff56486 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -423,7 +423,17 @@ void		xfs_iunpin_wait(xfs_inode_t *);
>  #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
>  
>  int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
> -void		xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
> +void		xfs_lock_two_inodes_separately(struct xfs_inode *ip0,
> +				uint ip0_mode, struct xfs_inode *ip1,
> +				uint ip1_mode);
> +static inline void
> +xfs_lock_two_inodes(
> +	struct xfs_inode	*ip0,
> +	struct xfs_inode	*ip1,
> +	uint			lock_mode)
> +{
> +	xfs_lock_two_inodes_separately(ip0, lock_mode, ip1, lock_mode);
> +}
>  
>  xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
>  xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index f89a725..f5a43b2 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1202,13 +1202,16 @@ xfs_reflink_remap_blocks(
>  
>  	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
>  	while (len) {
> +		uint		lock_mode;
> +
>  		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
>  				dest, destoff);
> +
>  		/* Read extent from the source file */
>  		nimaps = 1;
> -		xfs_ilock(src, XFS_ILOCK_EXCL);
> +		lock_mode = xfs_ilock_data_map_shared(src);
>  		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
> -		xfs_iunlock(src, XFS_ILOCK_EXCL);
> +		xfs_iunlock(src, lock_mode);
>  		if (error)
>  			goto err;
>  		ASSERT(nimaps == 1);
> @@ -1262,7 +1265,7 @@ xfs_iolock_two_inodes_and_break_layout(
>  
>  retry:
>  	if (src_first) {
> -		inode_lock(src);
> +		inode_lock_shared(src);
>  		inode_lock_nested(dest, I_MUTEX_NONDIR2);
>  	} else {
>  		inode_lock(dest);
> @@ -1272,7 +1275,7 @@ xfs_iolock_two_inodes_and_break_layout(
>  	if (error == -EWOULDBLOCK) {
>  		inode_unlock(dest);
>  		if (src_first)
> -			inode_unlock(src);
> +			inode_unlock_shared(src);
>  		error = break_layout(dest, true);
>  		if (error)
>  			return error;
> @@ -1280,11 +1283,11 @@ xfs_iolock_two_inodes_and_break_layout(
>  	} else if (error) {
>  		inode_unlock(dest);
>  		if (src_first)
> -			inode_unlock(src);
> +			inode_unlock_shared(src);
>  		return error;
>  	}
>  	if (src_last)
> -		inode_lock_nested(src, I_MUTEX_NONDIR2);
> +		down_read_nested(&src->i_rwsem, I_MUTEX_NONDIR2);
>  	return 0;
>  }
>  
> @@ -1324,7 +1327,8 @@ xfs_reflink_remap_range(
>  	if (same_inode)
>  		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
>  	else
> -		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
> +		xfs_lock_two_inodes_separately(src, XFS_MMAPLOCK_SHARED,
> +				dest, XFS_MMAPLOCK_EXCL);
>  
>  	/* Check file eligibility and prepare for block sharing. */
>  	ret = -EINVAL;
> @@ -1387,10 +1391,12 @@ xfs_reflink_remap_range(
>  			is_dedupe);
>  
>  out_unlock:
> -	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> +	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> +	if (!same_inode)
> +		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> +	inode_unlock(inode_out);
>  	if (!same_inode)
> -		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> -	unlock_two_nondirectories(inode_in, inode_out);
> +		inode_unlock_shared(inode_in);
>  	if (ret)
>  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
>  	return ret;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations
  2018-01-24  2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
@ 2018-01-24 14:18   ` Brian Foster
  2018-01-26  9:07   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Brian Foster @ 2018-01-24 14:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:17PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Ensure that we've attached all the necessary dquots before performing
> reflink operations so that quota accounting is accurate.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_reflink.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index f5a43b2..82abff6 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1345,6 +1345,11 @@ xfs_reflink_remap_range(
>  	if (ret <= 0)
>  		goto out_unlock;
>  
> +	/* Attach dquots to dest inode before changing block map */
> +	ret = xfs_qm_dqattach(dest, 0);
> +	if (ret)
> +		goto out_unlock;
> +
>  	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
>  
>  	/*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-24  2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
@ 2018-01-24 14:22   ` Brian Foster
  2018-01-24 19:14     ` Darrick J. Wong
  2018-01-25  1:20   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-24 14:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:23PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Since the CoW fork only exists in memory, it is incorrect to update the
> on-disk quota block counts when we modify the CoW fork.  Unlike the data
> fork, even real extents in the CoW fork are only reservations (on-disk
> they're owned by the refcountbt) so they must not be tracked in the on
> disk quota info.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |  203 ++++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_reflink.c     |    8 +-
>  2 files changed, 196 insertions(+), 15 deletions(-)
> 
> 

Mostly comments on the comment so far... I still have to grok the code,
but working through to this point made my brain hurt a bit so I'm
sending what I have so far. ;P

> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 6e6f3cb..e3e8f7c 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -52,6 +52,145 @@
>  #include "xfs_refcount.h"
>  #include "xfs_icache.h"
>  
> +/*
> + * Data/Attr Fork Mapping Lifecycle
> + *
> + * The data fork contains the block mappings between logical blocks in a file
> + * and physical blocks on the disk.  The XFS notions of delayed allocation
> + * reservations, unwritten extents, and real extents follow well known
> + * conventions in the filesystem world.
> + *
> + * As a side note, the attribute fork does the same for extended attribute
> + * blocks, though the logical block offsets are not available to userspace and
> + * the only valid states are HOLE and REAL.
> + *
> + * Metadata involved outside of the block mapping itself are as follows:
> + *
> + * - i_delayed_blks: Number of blocks that are reserved for delayed allocation.
> + * - i_cow_blocks: Number of blocks reserved for copy on write staging.
> + *

I know it's implied by some of the field names, but I think it would be
useful to point out what data structures these are associated with.
Also, i_cow_blocks technically doesn't exist yet.. right?

> + * - di_nblocks: Number of blocks (on-disk) assigned to the inode.
> + *

Hmm, some of these are self-explanatory or already documented where they
are defined. I'm wondering if we really need to repeat descriptions for
all of these (as opposed to ensuring they are all sufficiently described
where they are defined). That also saves us from having to keep field
names and whatnot synced up in an unexpected source location.

> + * - d_bcount: Number of quota blocks accounted for by on-disk metadata.
> + * - q_res_bcount: Number of quota blocks reserved in-core for future writes +
> + *           blocks mentioned by on-disk metadata.
> + *
> + * - qt_blk_res: Number of quota blocks reserved in-core for this transaction.
> + *           Unused reservation is given back to q_res_bcount on commit.
> + * - qt_bcount: Number of quota blocks used by this transaction from
> + *           qt_blk_res.  d_bcount is increased by this on commit.
> + * - qt_delbcount: Number of quota blocks used by this transaction from
> + *           q_res_bcount but not q_res_bcount.  d_bcount is increased by this
> + *           on commit.
> + *

"from q_res_bcount but not q_res_bcount" ?

Also: qt_bcount_delta, qt_delbcount_delta?

> + * - sb_fdblocks: Number of free blocks recorded in the superblock on disk.
> + * - fdblocks: Number of free blocks recorded in the superblock minus any
> + *           in-core reservations made in anticipation of future writes.
> + *
> + * - t_blk_res: Number of blocks reserved out of fdblocks for a transaction.
> + *           When the transaction commits, t_blk_res - t_blk_res_used is given
> + *           back to fdblocks.
> + * - t_blk_res_used: Number of blocks used by this transaction that were
> + *           reserved for this transaction.
> + * - t_fdblocks_del: Number of blocks by which fdblocks and sb_fdblocks will
> + *           have to decrease at commit.
> + * - t_res_fdblocks_delta: Number of blocks by which sb_fdblocks will have to
> + *           decrease at commit.  We assume that fdblocks was decreased
> + *           prior to the transaction.
> + *
> + * Data fork block mappings have four logical states:
> + *
> + *    +--------> UNWRITTEN <------+
> + *    |              ^            |
> + *    |              v            v
> + * DELALLOC <----> HOLE <------> REAL
> + *    |                           ^
> + *    |                           |
> + *    +---------------------------+
> + *

I'm not sure we need a graphic for the extent states. Non-hole
conversions to delalloc is the only transition that doesn't make any
sense.

> + * The state transitions and required metadata updates are as follows:
> + *
> + * - HOLE to DELALLOC: Increase i_delayed_blks and q_res_bcount, and decrease
> + *           fdblocks.
> + * - HOLE to REAL: Increase di_nblocks and qt_bcount, and decrease fdblocks.
> + * - HOLE to UNWRITTEN: Same as above.
> + *
> + * - DELALLOC to UNWRITTEN: Increase di_nblocks and qt_delbcount, and decrease
> + *           i_delayed_blks.
> + * - DELALLOC to REAL: Same as above.
> + * - DELALLOC to HOLE: Increase fdblocks, and decrease i_delayed_blks and
> + *           q_res_bcount.
> + *
> + * - UNWRITTEN to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
> + * - UNWRITTEN to REAL: No change.
> + *
> + * - REAL to UNWRITTEN: No change.
> + * - REAL to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
> + *
> + * Note in particular that delalloc reservations have "transaction-less"
> + * quota reservations via q_res_bcount.  If the reservation is allocated,
> + * qt_delbcount is used to increment d_bcount without touching q_res_bcount.
> + * Filling a hole with an allocated extent, by contrast, uses qt_blk_res
> + * to make a reservation in q_res_bcount, qt_bcount to record the number
> + * of allocated blocks; at commit qt_bcount is added to d_bcount and
> + * qt_blk_res - qt_bcount is added back to q_res_bcount.
> + *

While I agree with the intent/usefulness of a big comment around how the
quota accounting works, this all kind of reads like a braindump so far.
E.g., this state changes these fields, this field represents how much to
change some other field, etc. While I'm sure that is useful information
to work out the problem being resolved here, it doesn't explain much how
everything works. IOW, I still feel like I need to go trace through the
code to understand the comment, as opposed to the comment helping me
trace through the code.

I think it would be better to have a high level description about how
the quota accounting works with respect to the transaction/extent
states, what high-level data structures are involved and how they
relate, etc., rather than what seems essentially to be a chart that maps
states to fields. 

I'm also wondering how much of the whole picture really needs to be
described here to cover quota accounting with respect to block mapping
(sufficiently to differentiate data fork from cow fork, which I take is
the purpose of this section). For example, do we really need the
internal transaction details for how quota deltas are carried?

Instead, I think it might be sufficient to explain that the quota system
works in two "levels" (for lack of a better term :/), one for
reservation and another for real block usage. The reservation is
associated with transaction reservation and/or delayed block allocation
(no tx). In either case, quota reservation is converted to real quota
block usage when a transaction commits that maps real/physical blocks.
If the transaction held extra reservation that went unused, that quota
reservation is released. The primary difference is that transactions to
convert delalloc -> real do not reserve quota blocks in the first place,
since that has already occurred, so they just need to make sure to
convert/persist the quota res for the blocks that were converted to
real.

> + * Copy on Write Fork Mapping Lifecycle
> + *
> + * The CoW fork handles things differently from the data fork because its
> + * mappings only exist in memory-- the refcount btree is the on-disk owner of
> + * the extents until they're remapped into the data fork.  Therefore,
> + * unwritten and real extents in the CoW fork are treated the same way as
> + * delayed allocation extents.  Quota and fdblock changes only exist in
> + * memory, which requires some twists in the bmap functions.
> + *

Ok, but perhaps this should point out what happens when cow blocks are
reserved with respect to quotas..? IIUC, a delalloc quota reservation
occurs just the same as above, the blocks simply reside in another fork.

> + * The CoW fork extent state diagram looks like this:
> + *
> + *    +--------> UNWRITTEN -------+
> + *    |              ^            |
> + *    |              v            v
> + * DELALLOC <----> HOLE <------- REAL
> + *
> + * Holes are still holes.  Delayed allocation extents reserve blocks for
> + * landing future writes, just like they do in the data fork.  However, unlike
> + * the data fork, unwritten extents signal an extent that has been allocated
> + * but is not currently undergoing writeback.  Real extents are undergoing
> + * writeback, and when that writeback finishes the corresponding data fork
> + * extent will be punched out and the CoW fork counterpart moved to the new
> + * hole in the data fork.
> + *

Ok, so the difference is that for the COW fork, the _extent state_
conversion is not the appropriate trigger event to convert quota
reservation to real quota usage. Instead, the blocks being remapped from
the COW fork to the data fork is when that should occur.

> + * The state transitions and required metadata updates are as follows:
> + *
> + * - HOLE to DELALLOC: Increase i_cow_blocks and q_res_bcount, and decrease
> + *           fdblocks.
> + * - HOLE to UNWRITTEN: Same as above, but since we reserved quota via
> + *           qt_blk_res (which increased q_res_bcount) when we allocate the
> + *           extent we have to decrease qt_blk_res so that the commit doesn't
> + *           give the allocated CoW blocks back.
> + *

Hmm, this is a little confusing. Looking at the code change and comment
below, I think I get what this is trying to do, which is essentially
make a real block cow fork alloc behave like a delalloc reservation
(with respect to quota). FWIW, I think what confuses me is the assertion
that the blocks would be "given back" otherwise. The only reference I
have to compare is data fork alloc behavior, which implies that used
reservation would not be given back, but rather converted to real quota
usage on block allocation (and excess reservation would still be given
back, which afaict we still want to happen). So the trickery is required
to prevent conversion of quota reservation for the allocated cow blocks,
let that res sit around until the cow blocks are remapped, and release
unused reservation from the tx as normal. Am I following that correctly?

> + * - DELALLOC to UNWRITTEN: No change.
> + * - DELALLOC to HOLE: Decrease i_cow_blocks and q_res_bcount, and increase
> + *           fdblocks.
> + *
> + * - UNWRITTEN to HOLE: Same as DELALLOC to HOLE.
> + * - UNWRITTEN to REAL: No change.
> + *
> + * - REAL to HOLE: This transition happens when we've finished a write
> + *           operation and need to move the mapping to the data fork.  We
> + *           punch the correspond data fork mappings, which decreases
> + *           qt_bcount.  Then we map the CoW fork mapping into the hole we
> + *           just cleared out of the data fork, which increases qt_bcount.
> + *           There's a subtlety here -- if we promoted a write over a hole to
> + *           CoW, there will be a net increase in qt_bcount, which is fine
> + *           because we already reserved the quota when we filled the CoW
> + *           fork.  Finally, we punch the CoW fork mapping, which decreases
> + *           q_res_bcount.
> + *
> + * Notice how all CoW fork extents use transactionless quota reservations and
> + * the in-core fdblocks to maintain state, and we avoid updating any on-disk
> + * metadata.  This is essential to maintain metadata correctness if the system
> + * goes down.
> + */
>  
>  kmem_zone_t		*xfs_bmap_free_item_zone;
>  
> @@ -3337,6 +3476,39 @@ xfs_bmap_btalloc_filestreams(
>  	return 0;
>  }
>  
> +/* Deal with CoW fork accounting when we allocate a block. */
> +static void
> +xfs_bmap_btalloc_cow(
> +	struct xfs_bmalloca	*ap,
> +	struct xfs_alloc_arg	*args)
> +{
> +	/* Filling a previously reserved extent; nothing to do here. */
> +	if (ap->wasdel)
> +		return;
> +
> +	/*
> +	 * The CoW fork only exists in memory, so the on-disk quota accounting
> +	 * must not incude any CoW fork extents.  Therefore, CoW blocks are
> +	 * only tracked in the in-core dquot block count (q_res_bcount).
> +	 *
> +	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
> +	 * CoW extent having reserved enough blocks from both q_res_bcount and
> +	 * qt_blk_res to guarantee that we won't run out of space.  The unused
> +	 * qt_blk_res is given back to q_res_bcount when the transaction
> +	 * commits.
> +	 *
> +	 * We don't want the quota accounting for our newly allocated blocks
> +	 * to be given back, so we must decrease qt_blk_res without decreasing
> +	 * q_res_bcount.
> +	 *
> +	 * Note: If we're allocating a delalloc extent, we already reserved
> +	 * the q_res_bcount blocks, so no quota accounting update is needed
> +	 * here.
> +	 */
> +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> +			-(long)args->len);
> +}

Factoring nit.. if we're going to refactor bits of xfs_bmap_btalloc()
out, it might be cleaner to factor out all of the quota logic rather
than just the cow bits (which is basically just a simple check and
function call). E.g., refactor into an xfs_bmap_btalloc_quota() helper
that does the right thing based on the fork, with comments as to why,
etc. (and perhaps just leave the unrelated di_nblocks change behind).

Brian

> +
>  STATIC int
>  xfs_bmap_btalloc(
>  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> @@ -3571,19 +3743,22 @@ xfs_bmap_btalloc(
>  			*ap->firstblock = args.fsbno;
>  		ASSERT(nullfb || fb_agno <= args.agno);
>  		ap->length = args.len;
> -		if (!(ap->flags & XFS_BMAPI_COWFORK))
> -			ap->ip->i_d.di_nblocks += args.len;
> -		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
>  		if (ap->wasdel)
>  			ap->ip->i_delayed_blks -= args.len;
> -		/*
> -		 * Adjust the disk quota also. This was reserved
> -		 * earlier.
> -		 */
> -		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> -			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> -					XFS_TRANS_DQ_BCOUNT,
> -			(long) args.len);
> +		if (ap->flags & XFS_BMAPI_COWFORK) {
> +			xfs_bmap_btalloc_cow(ap, &args);
> +		} else {
> +			ap->ip->i_d.di_nblocks += args.len;
> +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> +			/*
> +			 * Adjust the disk quota also. This was reserved
> +			 * earlier.
> +			 */
> +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> +						XFS_TRANS_DQ_BCOUNT,
> +				(long) args.len);
> +		}
>  	} else {
>  		ap->blkno = NULLFSBLOCK;
>  		ap->length = 0;
> @@ -4776,6 +4951,7 @@ xfs_bmap_del_extent_cow(
>  	struct xfs_bmbt_irec	new;
>  	xfs_fileoff_t		del_endoff, got_endoff;
>  	int			state = BMAP_COWFORK;
> +	int			error;
>  
>  	XFS_STATS_INC(mp, xs_del_exlist);
>  
> @@ -4832,6 +5008,11 @@ xfs_bmap_del_extent_cow(
>  		xfs_iext_insert(ip, icur, &new, state);
>  		break;
>  	}
> +
> +	/* Remove the quota reservation */
> +	error = xfs_trans_reserve_quota_nblks(NULL, ip,
> +			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> +	ASSERT(error == 0);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 82abff6..e367351 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
>  					del.br_startblock, del.br_blockcount,
>  					NULL);
>  
> -			/* Update quota accounting */
> -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> -					-(long)del.br_blockcount);
> -
>  			/* Roll the transaction */
>  			xfs_defer_ijoin(&dfops, ip);
>  			error = xfs_defer_finish(tpp, &dfops);
> @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
>  		if (error)
>  			goto out_defer;
>  
> +		/* Charge this new data fork mapping to the on-disk quota. */
> +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> +				(long)del.br_blockcount);
> +
>  		/* Remove the mapping from the CoW fork. */
>  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink
  2018-01-24 14:18   ` Brian Foster
@ 2018-01-24 18:40     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24 18:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jan 24, 2018 at 09:18:21AM -0500, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:18:09PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Reflink and dedupe operations remap blocks from a source file into a
> > destination file.  The destination file needs exclusive locks on all
> > levels because we're updating its block map, but the source file isn't
> > undergoing any block map changes so we can use a shared lock.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_inode.c   |   49 ++++++++++++++++++++++++++++++++-----------------
> >  fs/xfs/xfs_inode.h   |   12 +++++++++++-
> >  fs/xfs/xfs_reflink.c |   26 ++++++++++++++++----------
> >  3 files changed, 59 insertions(+), 28 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c9e40d4..4a38cfc 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -545,24 +545,37 @@ xfs_lock_inodes(
> >  }
> >  
> >  /*
> > - * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
> > - * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
> > - * lock more than one at a time, lockdep will report false positives saying we
> > - * have violated locking orders.
> > + * xfs_lock_two_inodes_separately() can only be used to lock one type of lock
> > + * at a time - the mmaplock or the ilock, but not more than one type at a
> > + * time. If we lock more than one at a time, lockdep will report false
> > + * positives saying we have violated locking orders.  The iolock must be
> > + * double-locked separately since we use i_rwsem for that.  We now support
> > + * taking one lock EXCL and the other SHARED.
> >   */
> >  void
> > -xfs_lock_two_inodes(
> > -	xfs_inode_t		*ip0,
> > -	xfs_inode_t		*ip1,
> > -	uint			lock_mode)
> > +xfs_lock_two_inodes_separately(
> > +	struct xfs_inode	*ip0,
> > +	uint			ip0_mode,
> > +	struct xfs_inode	*ip1,
> > +	uint			ip1_mode)
> >  {
> 
> Nit.. but "separately" doesn't really convey meaning to me. I guess
> something like xfs_lock_two_inodes_mode() is more clear to me, even
> though the original version still accepted a mode. Eh, I guess even
> __xfs_lock_two_inodes() might be fine (and perhaps more common
> practice). Code looks fine either way:

TBH, "_mode" was my close second choice, so I'll just change the name
before I commit it for reals.  Thanks for the review!

--D

> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > -	xfs_inode_t		*temp;
> > +	struct xfs_inode	*temp;
> > +	uint			mode_temp;
> >  	int			attempts = 0;
> >  	xfs_log_item_t		*lp;
> >  
> > -	ASSERT(!(lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> > -	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
> > -		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +	ASSERT(hweight32(ip0_mode) == 1);
> > +	ASSERT(hweight32(ip1_mode) == 1);
> > +	ASSERT(!(ip0_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> > +	ASSERT(!(ip1_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> > +	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> > +	       !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +	ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> > +	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +	ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> > +	       !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> > +	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> >  
> >  	ASSERT(ip0->i_ino != ip1->i_ino);
> >  
> > @@ -570,10 +583,13 @@ xfs_lock_two_inodes(
> >  		temp = ip0;
> >  		ip0 = ip1;
> >  		ip1 = temp;
> > +		mode_temp = ip0_mode;
> > +		ip0_mode = ip1_mode;
> > +		ip1_mode = mode_temp;
> >  	}
> >  
> >   again:
> > -	xfs_ilock(ip0, xfs_lock_inumorder(lock_mode, 0));
> > +	xfs_ilock(ip0, xfs_lock_inumorder(ip0_mode, 0));
> >  
> >  	/*
> >  	 * If the first lock we have locked is in the AIL, we must TRY to get
> > @@ -582,18 +598,17 @@ xfs_lock_two_inodes(
> >  	 */
> >  	lp = (xfs_log_item_t *)ip0->i_itemp;
> >  	if (lp && (lp->li_flags & XFS_LI_IN_AIL)) {
> > -		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(lock_mode, 1))) {
> > -			xfs_iunlock(ip0, lock_mode);
> > +		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(ip1_mode, 1))) {
> > +			xfs_iunlock(ip0, ip0_mode);
> >  			if ((++attempts % 5) == 0)
> >  				delay(1); /* Don't just spin the CPU */
> >  			goto again;
> >  		}
> >  	} else {
> > -		xfs_ilock(ip1, xfs_lock_inumorder(lock_mode, 1));
> > +		xfs_ilock(ip1, xfs_lock_inumorder(ip1_mode, 1));
> >  	}
> >  }
> >  
> > -
> >  void
> >  __xfs_iflock(
> >  	struct xfs_inode	*ip)
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 386b0bb..ff56486 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -423,7 +423,17 @@ void		xfs_iunpin_wait(xfs_inode_t *);
> >  #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
> >  
> >  int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
> > -void		xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
> > +void		xfs_lock_two_inodes_separately(struct xfs_inode *ip0,
> > +				uint ip0_mode, struct xfs_inode *ip1,
> > +				uint ip1_mode);
> > +static inline void
> > +xfs_lock_two_inodes(
> > +	struct xfs_inode	*ip0,
> > +	struct xfs_inode	*ip1,
> > +	uint			lock_mode)
> > +{
> > +	xfs_lock_two_inodes_separately(ip0, lock_mode, ip1, lock_mode);
> > +}
> >  
> >  xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
> >  xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index f89a725..f5a43b2 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1202,13 +1202,16 @@ xfs_reflink_remap_blocks(
> >  
> >  	/* drange = (destoff, destoff + len); srange = (srcoff, srcoff + len) */
> >  	while (len) {
> > +		uint		lock_mode;
> > +
> >  		trace_xfs_reflink_remap_blocks_loop(src, srcoff, len,
> >  				dest, destoff);
> > +
> >  		/* Read extent from the source file */
> >  		nimaps = 1;
> > -		xfs_ilock(src, XFS_ILOCK_EXCL);
> > +		lock_mode = xfs_ilock_data_map_shared(src);
> >  		error = xfs_bmapi_read(src, srcoff, len, &imap, &nimaps, 0);
> > -		xfs_iunlock(src, XFS_ILOCK_EXCL);
> > +		xfs_iunlock(src, lock_mode);
> >  		if (error)
> >  			goto err;
> >  		ASSERT(nimaps == 1);
> > @@ -1262,7 +1265,7 @@ xfs_iolock_two_inodes_and_break_layout(
> >  
> >  retry:
> >  	if (src_first) {
> > -		inode_lock(src);
> > +		inode_lock_shared(src);
> >  		inode_lock_nested(dest, I_MUTEX_NONDIR2);
> >  	} else {
> >  		inode_lock(dest);
> > @@ -1272,7 +1275,7 @@ xfs_iolock_two_inodes_and_break_layout(
> >  	if (error == -EWOULDBLOCK) {
> >  		inode_unlock(dest);
> >  		if (src_first)
> > -			inode_unlock(src);
> > +			inode_unlock_shared(src);
> >  		error = break_layout(dest, true);
> >  		if (error)
> >  			return error;
> > @@ -1280,11 +1283,11 @@ xfs_iolock_two_inodes_and_break_layout(
> >  	} else if (error) {
> >  		inode_unlock(dest);
> >  		if (src_first)
> > -			inode_unlock(src);
> > +			inode_unlock_shared(src);
> >  		return error;
> >  	}
> >  	if (src_last)
> > -		inode_lock_nested(src, I_MUTEX_NONDIR2);
> > +		down_read_nested(&src->i_rwsem, I_MUTEX_NONDIR2);
> >  	return 0;
> >  }
> >  
> > @@ -1324,7 +1327,8 @@ xfs_reflink_remap_range(
> >  	if (same_inode)
> >  		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
> >  	else
> > -		xfs_lock_two_inodes(src, dest, XFS_MMAPLOCK_EXCL);
> > +		xfs_lock_two_inodes_separately(src, XFS_MMAPLOCK_SHARED,
> > +				dest, XFS_MMAPLOCK_EXCL);
> >  
> >  	/* Check file eligibility and prepare for block sharing. */
> >  	ret = -EINVAL;
> > @@ -1387,10 +1391,12 @@ xfs_reflink_remap_range(
> >  			is_dedupe);
> >  
> >  out_unlock:
> > -	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
> > +	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > +	if (!same_inode)
> > +		xfs_iunlock(src, XFS_MMAPLOCK_SHARED);
> > +	inode_unlock(inode_out);
> >  	if (!same_inode)
> > -		xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
> > -	unlock_two_nondirectories(inode_in, inode_out);
> > +		inode_unlock_shared(inode_in);
> >  	if (ret)
> >  		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
> >  	return ret;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-24 14:22   ` Brian Foster
@ 2018-01-24 19:14     ` Darrick J. Wong
  2018-01-25 13:01       ` Brian Foster
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-24 19:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jan 24, 2018 at 09:22:16AM -0500, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:18:23PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Since the CoW fork only exists in memory, it is incorrect to update the
> > on-disk quota block counts when we modify the CoW fork.  Unlike the data
> > fork, even real extents in the CoW fork are only reservations (on-disk
> > they're owned by the refcountbt) so they must not be tracked in the on
> > disk quota info.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c |  203 ++++++++++++++++++++++++++++++++++++++++++++--
> >  fs/xfs/xfs_reflink.c     |    8 +-
> >  2 files changed, 196 insertions(+), 15 deletions(-)
> > 
> > 
> 
> Mostly comments on the comment so far... I still have to grok the code,
> but working through to this point made my brain hurt a bit so I'm
> sending what I have so far. ;P

Grokking all the pieces (especially the quota handling stuff) made my
brain hurt too.

> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 6e6f3cb..e3e8f7c 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -52,6 +52,145 @@
> >  #include "xfs_refcount.h"
> >  #include "xfs_icache.h"
> >  
> > +/*
> > + * Data/Attr Fork Mapping Lifecycle
> > + *
> > + * The data fork contains the block mappings between logical blocks in a file
> > + * and physical blocks on the disk.  The XFS notions of delayed allocation
> > + * reservations, unwritten extents, and real extents follow well known
> > + * conventions in the filesystem world.
> > + *
> > + * As a side note, the attribute fork does the same for extended attribute
> > + * blocks, though the logical block offsets are not available to userspace and
> > + * the only valid states are HOLE and REAL.
> > + *
> > + * Metadata involved outside of the block mapping itself are as follows:
> > + *
> > + * - i_delayed_blks: Number of blocks that are reserved for delayed allocation.
> > + * - i_cow_blocks: Number of blocks reserved for copy on write staging.
> > + *
> 
> I know it's implied by some of the field names, but I think it would be
> useful to point out what data structures these are associated with.
> Also, i_cow_blocks technically doesn't exist yet.. right?

Yeah, it's added by the next patch, I was being lazy and putting the
comment ahead of the next commit rather than rewriting the block comment
in subsequent patches.

> > + * - di_nblocks: Number of blocks (on-disk) assigned to the inode.
> > + *
> 
> Hmm, some of these are self-explanatory or already documented where they
> are defined. I'm wondering if we really need to repeat descriptions for
> all of these (as opposed to ensuring they are all sufficiently described
> where they are defined). That also saves us from having to keep field
> names and whatnot synced up in an unexpected source location.
> 
> > + * - d_bcount: Number of quota blocks accounted for by on-disk metadata.
> > + * - q_res_bcount: Number of quota blocks reserved in-core for future writes +
> > + *           blocks mentioned by on-disk metadata.
> > + *
> > + * - qt_blk_res: Number of quota blocks reserved in-core for this transaction.
> > + *           Unused reservation is given back to q_res_bcount on commit.
> > + * - qt_bcount: Number of quota blocks used by this transaction from
> > + *           qt_blk_res.  d_bcount is increased by this on commit.
> > + * - qt_delbcount: Number of quota blocks used by this transaction from
> > + *           q_res_bcount but not q_res_bcount.  d_bcount is increased by this
> > + *           on commit.
> > + *
> 
> "from q_res_bcount but not q_res_bcount" ?

"...from q_res_bcount but not qt_blk_res."

> Also: qt_bcount_delta, qt_delbcount_delta?

Oops, yes, those names should have _delta after them.

> > + * - sb_fdblocks: Number of free blocks recorded in the superblock on disk.
> > + * - fdblocks: Number of free blocks recorded in the superblock minus any
> > + *           in-core reservations made in anticipation of future writes.
> > + *
> > + * - t_blk_res: Number of blocks reserved out of fdblocks for a transaction.
> > + *           When the transaction commits, t_blk_res - t_blk_res_used is given
> > + *           back to fdblocks.
> > + * - t_blk_res_used: Number of blocks used by this transaction that were
> > + *           reserved for this transaction.
> > + * - t_fdblocks_del: Number of blocks by which fdblocks and sb_fdblocks will
> > + *           have to decrease at commit.
> > + * - t_res_fdblocks_delta: Number of blocks by which sb_fdblocks will have to
> > + *           decrease at commit.  We assume that fdblocks was decreased
> > + *           prior to the transaction.
> > + *
> > + * Data fork block mappings have four logical states:
> > + *
> > + *    +--------> UNWRITTEN <------+
> > + *    |              ^            |
> > + *    |              v            v
> > + * DELALLOC <----> HOLE <------> REAL
> > + *    |                           ^
> > + *    |                           |
> > + *    +---------------------------+
> > + *
> 
> I'm not sure we need a graphic for the extent states. Non-hole
> conversions to delalloc is the only transition that doesn't make any
> sense.

First of all, Dave keeps asking for ASCII art 'when appropriate'. :)

But on a more serious note, I thought the state diagram would be useful
for anyone who isn't so familiar with how blocks get mapped into files
in xfs, particularly to compare the data fork diagram against the
corresponding picture for the cow fork.

> > + * The state transitions and required metadata updates are as follows:
> > + *
> > + * - HOLE to DELALLOC: Increase i_delayed_blks and q_res_bcount, and decrease
> > + *           fdblocks.
> > + * - HOLE to REAL: Increase di_nblocks and qt_bcount, and decrease fdblocks.
> > + * - HOLE to UNWRITTEN: Same as above.
> > + *
> > + * - DELALLOC to UNWRITTEN: Increase di_nblocks and qt_delbcount, and decrease
> > + *           i_delayed_blks.
> > + * - DELALLOC to REAL: Same as above.
> > + * - DELALLOC to HOLE: Increase fdblocks, and decrease i_delayed_blks and
> > + *           q_res_bcount.
> > + *
> > + * - UNWRITTEN to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
> > + * - UNWRITTEN to REAL: No change.
> > + *
> > + * - REAL to UNWRITTEN: No change.
> > + * - REAL to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
> > + *
> > + * Note in particular that delalloc reservations have "transaction-less"
> > + * quota reservations via q_res_bcount.  If the reservation is allocated,
> > + * qt_delbcount is used to increment d_bcount without touching q_res_bcount.
> > + * Filling a hole with an allocated extent, by contrast, uses qt_blk_res
> > + * to make a reservation in q_res_bcount, qt_bcount to record the number
> > + * of allocated blocks; at commit qt_bcount is added to d_bcount and
> > + * qt_blk_res - qt_bcount is added back to q_res_bcount.
> > + *
> 
> While I agree with the intent/usefulness of a big comment around how the
> quota accounting works, this all kind of reads like a braindump so far.

Guilty as charged!  This /is/ pulled straight from my notes. :/

> E.g., this state changes these fields, this field represents how much to
> change some other field, etc. While I'm sure that is useful information
> to work out the problem being resolved here, it doesn't explain much how
> everything works. IOW, I still feel like I need to go trace through the
> code to understand the comment, as opposed to the comment helping me
> trace through the code.
> 
> I think it would be better to have a high level description about how
> the quota accounting works with respect to the transaction/extent
> states, what high-level data structures are involved and how they
> relate, etc., rather than what seems essentially to be a chart that maps
> states to fields. 

Hm, ok.  I'll work on that for the next revision.

> I'm also wondering how much of the whole picture really needs to be
> described here to cover quota accounting with respect to block mapping
> (sufficiently to differentiate data fork from cow fork, which I take is
> the purpose of this section). For example, do we really need the
> internal transaction details for how quota deltas are carried?
> 
> Instead, I think it might be sufficient to explain that the quota system
> works in two "levels" (for lack of a better term :/), one for
> reservation and another for real block usage. The reservation is
> associated with transaction reservation and/or delayed block allocation
> (no tx). In either case, quota reservation is converted to real quota
> block usage when a transaction commits that maps real/physical blocks.
> If the transaction held extra reservation that went unused, that quota
> reservation is released. The primary difference is that transactions to
> convert delalloc -> real do not reserve quota blocks in the first place,
> since that has already occurred, so they just need to make sure to
> convert/persist the quota res for the blocks that were converted to
> real.

I thought about cutting this whole comment down to a simple sentence
about how quota accounting is different between the data & cow forks (as
you figured out, we use delalloc quota reservations for everything in
the cow fork and only turn them into real ones when we go to remap) but
then worried that doing so would presuppose the reader knew anything
about how the extent lifecycles work... and that's how I end up with a
gigantic manual.

> > + * Copy on Write Fork Mapping Lifecycle
> > + *
> > + * The CoW fork handles things differently from the data fork because its
> > + * mappings only exist in memory-- the refcount btree is the on-disk owner of
> > + * the extents until they're remapped into the data fork.  Therefore,
> > + * unwritten and real extents in the CoW fork are treated the same way as
> > + * delayed allocation extents.  Quota and fdblock changes only exist in
> > + * memory, which requires some twists in the bmap functions.
> > + *
> 
> Ok, but perhaps this should point out what happens when cow blocks are
> reserved with respect to quotas..? IIUC, a delalloc quota reservation
> occurs just the same as above, the blocks simply reside in another fork.

Yes.

> > + * The CoW fork extent state diagram looks like this:
> > + *
> > + *    +--------> UNWRITTEN -------+
> > + *    |              ^            |
> > + *    |              v            v
> > + * DELALLOC <----> HOLE <------- REAL
> > + *
> > + * Holes are still holes.  Delayed allocation extents reserve blocks for
> > + * landing future writes, just like they do in the data fork.  However, unlike
> > + * the data fork, unwritten extents signal an extent that has been allocated
> > + * but is not currently undergoing writeback.  Real extents are undergoing
> > + * writeback, and when that writeback finishes the corresponding data fork
> > + * extent will be punched out and the CoW fork counterpart moved to the new
> > + * hole in the data fork.
> > + *
> 
> Ok, so the difference is that for the COW fork, the _extent state_
> conversion is not the appropriate trigger event to convert quota
> reservation to real quota usage. Instead, the blocks being remapped from
> the COW fork to the data fork is when that should occur.

Yes.

> > + * The state transitions and required metadata updates are as follows:
> > + *
> > + * - HOLE to DELALLOC: Increase i_cow_blocks and q_res_bcount, and decrease
> > + *           fdblocks.
> > + * - HOLE to UNWRITTEN: Same as above, but since we reserved quota via
> > + *           qt_blk_res (which increased q_res_bcount) when we allocate the
> > + *           extent we have to decrease qt_blk_res so that the commit doesn't
> > + *           give the allocated CoW blocks back.
> > + *
> 
> Hmm, this is a little confusing. Looking at the code change and comment
> below, I think I get what this is trying to do, which is essentially
> make a real block cow fork alloc behave like a delalloc reservation
> (with respect to quota). FWIW, I think what confuses me is the assertion
> that the blocks would be "given back" otherwise. The only reference I
> have to compare is data fork alloc behavior, which implies that used
> reservation would not be given back, but rather converted to real quota
> usage on block allocation (and excess reservation would still be given
> back, which afaict we still want to happen). So the trickery is required
> to prevent conversion of quota reservation for the allocated cow blocks,
> let that res sit around until the cow blocks are remapped, and release
> unused reservation from the tx as normal. Am I following that correctly?

Yes.

> > + * - DELALLOC to UNWRITTEN: No change.
> > + * - DELALLOC to HOLE: Decrease i_cow_blocks and q_res_bcount, and increase
> > + *           fdblocks.
> > + *
> > + * - UNWRITTEN to HOLE: Same as DELALLOC to HOLE.
> > + * - UNWRITTEN to REAL: No change.
> > + *
> > + * - REAL to HOLE: This transition happens when we've finished a write
> > + *           operation and need to move the mapping to the data fork.  We
> > + *           punch the correspond data fork mappings, which decreases
> > + *           qt_bcount.  Then we map the CoW fork mapping into the hole we
> > + *           just cleared out of the data fork, which increases qt_bcount.
> > + *           There's a subtlety here -- if we promoted a write over a hole to
> > + *           CoW, there will be a net increase in qt_bcount, which is fine
> > + *           because we already reserved the quota when we filled the CoW
> > + *           fork.  Finally, we punch the CoW fork mapping, which decreases
> > + *           q_res_bcount.
> > + *
> > + * Notice how all CoW fork extents use transactionless quota reservations and
> > + * the in-core fdblocks to maintain state, and we avoid updating any on-disk
> > + * metadata.  This is essential to maintain metadata correctness if the system
> > + * goes down.
> > + */
> >  
> >  kmem_zone_t		*xfs_bmap_free_item_zone;
> >  
> > @@ -3337,6 +3476,39 @@ xfs_bmap_btalloc_filestreams(
> >  	return 0;
> >  }
> >  
> > +/* Deal with CoW fork accounting when we allocate a block. */
> > +static void
> > +xfs_bmap_btalloc_cow(
> > +	struct xfs_bmalloca	*ap,
> > +	struct xfs_alloc_arg	*args)
> > +{
> > +	/* Filling a previously reserved extent; nothing to do here. */
> > +	if (ap->wasdel)
> > +		return;
> > +
> > +	/*
> > +	 * The CoW fork only exists in memory, so the on-disk quota accounting
> > +	 * must not incude any CoW fork extents.  Therefore, CoW blocks are
> > +	 * only tracked in the in-core dquot block count (q_res_bcount).
> > +	 *
> > +	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
> > +	 * CoW extent having reserved enough blocks from both q_res_bcount and
> > +	 * qt_blk_res to guarantee that we won't run out of space.  The unused
> > +	 * qt_blk_res is given back to q_res_bcount when the transaction
> > +	 * commits.
> > +	 *
> > +	 * We don't want the quota accounting for our newly allocated blocks
> > +	 * to be given back, so we must decrease qt_blk_res without decreasing
> > +	 * q_res_bcount.
> > +	 *
> > +	 * Note: If we're allocating a delalloc extent, we already reserved
> > +	 * the q_res_bcount blocks, so no quota accounting update is needed
> > +	 * here.
> > +	 */
> > +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> > +			-(long)args->len);
> > +}
> 
> Factoring nit.. if we're going to refactor bits of xfs_bmap_btalloc()
> out, it might be cleaner to factor out all of the quota logic rather
> than just the cow bits (which is basically just a simple check and
> function call). E.g., refactor into an xfs_bmap_btalloc_quota() helper
> that does the right thing based on the fork, with comments as to why,
> etc. (and perhaps just leave the unrelated di_nblocks change behind).

I thought about factoring the data/attr fork stuff into its own
xfs_bmap_btalloc_quota() function too, since this function is already
eyewateringly long.  I think I'll do that as a separate refactor at the
end of the series, though...

--D

> Brian
> 
> > +
> >  STATIC int
> >  xfs_bmap_btalloc(
> >  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> > @@ -3571,19 +3743,22 @@ xfs_bmap_btalloc(
> >  			*ap->firstblock = args.fsbno;
> >  		ASSERT(nullfb || fb_agno <= args.agno);
> >  		ap->length = args.len;
> > -		if (!(ap->flags & XFS_BMAPI_COWFORK))
> > -			ap->ip->i_d.di_nblocks += args.len;
> > -		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> >  		if (ap->wasdel)
> >  			ap->ip->i_delayed_blks -= args.len;
> > -		/*
> > -		 * Adjust the disk quota also. This was reserved
> > -		 * earlier.
> > -		 */
> > -		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > -			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > -					XFS_TRANS_DQ_BCOUNT,
> > -			(long) args.len);
> > +		if (ap->flags & XFS_BMAPI_COWFORK) {
> > +			xfs_bmap_btalloc_cow(ap, &args);
> > +		} else {
> > +			ap->ip->i_d.di_nblocks += args.len;
> > +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > +			/*
> > +			 * Adjust the disk quota also. This was reserved
> > +			 * earlier.
> > +			 */
> > +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > +						XFS_TRANS_DQ_BCOUNT,
> > +				(long) args.len);
> > +		}
> >  	} else {
> >  		ap->blkno = NULLFSBLOCK;
> >  		ap->length = 0;
> > @@ -4776,6 +4951,7 @@ xfs_bmap_del_extent_cow(
> >  	struct xfs_bmbt_irec	new;
> >  	xfs_fileoff_t		del_endoff, got_endoff;
> >  	int			state = BMAP_COWFORK;
> > +	int			error;
> >  
> >  	XFS_STATS_INC(mp, xs_del_exlist);
> >  
> > @@ -4832,6 +5008,11 @@ xfs_bmap_del_extent_cow(
> >  		xfs_iext_insert(ip, icur, &new, state);
> >  		break;
> >  	}
> > +
> > +	/* Remove the quota reservation */
> > +	error = xfs_trans_reserve_quota_nblks(NULL, ip,
> > +			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > +	ASSERT(error == 0);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 82abff6..e367351 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
> >  					del.br_startblock, del.br_blockcount,
> >  					NULL);
> >  
> > -			/* Update quota accounting */
> > -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > -					-(long)del.br_blockcount);
> > -
> >  			/* Roll the transaction */
> >  			xfs_defer_ijoin(&dfops, ip);
> >  			error = xfs_defer_finish(tpp, &dfops);
> > @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
> >  		if (error)
> >  			goto out_defer;
> >  
> > +		/* Charge this new data fork mapping to the on-disk quota. */
> > +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > +				(long)del.br_blockcount);
> > +
> >  		/* Remove the mapping from the CoW fork. */
> >  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> >  
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-24  2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
  2018-01-24 14:22   ` Brian Foster
@ 2018-01-25  1:20   ` Darrick J. Wong
  2018-01-25 13:03     ` Brian Foster
  2018-01-26 12:12     ` Christoph Hellwig
  1 sibling, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25  1:20 UTC (permalink / raw)
  To: linux-xfs; +Cc: Brian Foster

From: Darrick J. Wong <darrick.wong@oracle.com>

Since the CoW fork only exists in memory, it is incorrect to update the
on-disk quota block counts when we modify the CoW fork.  Unlike the data
fork, even real extents in the CoW fork are only reservations (on-disk
they're owned by the refcountbt) so they must not be tracked in the on
disk quota info.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: make documentation more crisp and to the point
---
 fs/xfs/libxfs/xfs_bmap.c |  118 ++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_quota.h       |   14 ++++-
 fs/xfs/xfs_reflink.c     |    8 ++-
 3 files changed, 122 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0c9c9cd..7f0ac40 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -52,6 +52,71 @@
 #include "xfs_refcount.h"
 #include "xfs_icache.h"
 
+/*
+ * Data/Attribute Fork Mapping Lifecycle
+ *
+ * The data fork contains the block mappings between logical blocks in a file
+ * and physical blocks on the disk.  The XFS notions of delayed allocation
+ * reservations, unwritten extents, and real extents follow well known
+ * conventions in the filesystem world.
+ *
+ * Data fork extent states follow these transitions:
+ *
+ *    +--------> UNWRITTEN <------+
+ *    |              ^            |
+ *    |              v            v
+ * DELALLOC <----> HOLE <------> REAL
+ *    |                           ^
+ *    |                           |
+ *    +---------------------------+
+ *
+ * Every delayed allocation reserves in-memory quota blocks (q_res_bcount) and
+ * in-memory fs free blocks (fdblocks), and increases the in-memory per-inode
+ * i_delayed_blks.  The reservation includes potentially required bmbt blocks.
+ *
+ * Every transaction reserves quota blocks (qt_blk_res) from the in-memory
+ * quota blocks and free blocks (t_blk_res) from the in-memory fs free blocks.
+ * The transaction tracks both the number of blocks used from its own
+ * reservation as well as the number of blocks used that came from a delayed
+ * allocation.  When the transaction commits, it gives back the unused parts
+ * of its own block reservations.  Next, it adds any block usage that came
+ * from a delayed allocation to the on-disk counters without changing the
+ * in-memory reservations (q_res_bcount and fdblocks).
+ *
+ * To convert a delayed allocation to a real or unwritten extent, we use a
+ * transaction to allocate the blocks.  At commit time, the block reservations
+ * are given back or added to the on-disk counters as noted above.
+ * i_delayed_blks is decreased while the on-disk per-inode di_nblocks is
+ * increased.
+ *
+ * The attribute fork works in the same way as the data fork except that the
+ * only valid states are REAL and HOLE.
+ *
+ * Copy on Write Fork Mapping Lifecycle
+ *
+ * The CoW fork exists only in memory and is used to stage copy writes for
+ * file data and has fewer transitions:
+ *
+ *    +--------> UNWRITTEN -------+
+ *    |              ^            |
+ *    |              v            v
+ * DELALLOC <----> HOLE <------- REAL
+ *
+ * Delayed allocation extents here are treated the same as in the data fork
+ * except that they are counted by the per-inode i_cow_blocks instead of
+ * i_delayed_blks.
+ *
+ * Unwritten and real extents are counted by the quota code as block
+ * reservations (q_res_bcount) and not on-disk quota blocks (d_bcount), and
+ * are counted by the free block counters as in-memory reservations (fdblocks)
+ * and not on-disk free blocks (sb_fdblocks).  These blocks are also counted
+ * by i_cow_blocks and not the on-disk di_nblocks.
+ *
+ * When a CoW fork extent is remapped to the data fork, the reservations are
+ * converted into on-disk counts in the same manner as a delayed allocation
+ * conversion in the data fork.  The number of blocks being remapped is
+ * subtracted from i_cow_blocks and added to di_nblocks.
+ */
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
 
@@ -3337,6 +3402,28 @@ xfs_bmap_btalloc_filestreams(
 	return 0;
 }
 
+/* Deal with CoW fork accounting when we allocate a block. */
+static void
+xfs_bmap_btalloc_quota_cow(
+	struct xfs_bmalloca	*ap,
+	struct xfs_alloc_arg	*args)
+{
+	/* Filling a previously reserved extent; nothing to do here. */
+	if (ap->wasdel)
+		return;
+
+	/*
+	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
+	 * CoW extent having reserved enough blocks from both q_res_bcount and
+	 * qt_blk_res to guarantee that we won't run out of space.  The unused
+	 * qt_blk_res is given back to q_res_bcount when the transaction
+	 * commits, so we must decrease qt_blk_res without decreasing
+	 * q_res_bcount.
+	 */
+	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
+			-(long)args->len);
+}
+
 STATIC int
 xfs_bmap_btalloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
@@ -3571,19 +3658,22 @@ xfs_bmap_btalloc(
 			*ap->firstblock = args.fsbno;
 		ASSERT(nullfb || fb_agno <= args.agno);
 		ap->length = args.len;
-		if (!(ap->flags & XFS_BMAPI_COWFORK))
-			ap->ip->i_d.di_nblocks += args.len;
-		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
 		if (ap->wasdel)
 			ap->ip->i_delayed_blks -= args.len;
-		/*
-		 * Adjust the disk quota also. This was reserved
-		 * earlier.
-		 */
-		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
-			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
-					XFS_TRANS_DQ_BCOUNT,
-			(long) args.len);
+		if (ap->flags & XFS_BMAPI_COWFORK) {
+			xfs_bmap_btalloc_quota_cow(ap, &args);
+		} else {
+			ap->ip->i_d.di_nblocks += args.len;
+			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
+			/*
+			 * Adjust the disk quota also. This was reserved
+			 * earlier.
+			 */
+			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
+				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
+						XFS_TRANS_DQ_BCOUNT,
+				(long) args.len);
+		}
 	} else {
 		ap->blkno = NULLFSBLOCK;
 		ap->length = 0;
@@ -4760,6 +4850,7 @@ xfs_bmap_del_extent_cow(
 	struct xfs_bmbt_irec	new;
 	xfs_fileoff_t		del_endoff, got_endoff;
 	int			state = BMAP_COWFORK;
+	int			error;
 
 	XFS_STATS_INC(mp, xs_del_exlist);
 
@@ -4816,6 +4907,11 @@ xfs_bmap_del_extent_cow(
 		xfs_iext_insert(ip, icur, &new, state);
 		break;
 	}
+
+	/* Remove the quota reservation */
+	error = xfs_trans_reserve_quota_nblks(NULL, ip,
+			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
+	ASSERT(error == 0);
 }
 
 /*
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index ce6506a..34b4ec2 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -54,11 +54,19 @@ struct xfs_trans;
  */
 typedef struct xfs_dqtrx {
 	struct xfs_dquot *qt_dquot;	  /* the dquot this refers to */
-	ulong		qt_blk_res;	  /* blks reserved on a dquot */
+
+	/* dquot bcount blks reserved for this transaction */
+	ulong		qt_blk_res;
+
 	ulong		qt_ino_res;	  /* inode reserved on a dquot */
 	ulong		qt_ino_res_used;  /* inodes used from the reservation */
-	long		qt_bcount_delta;  /* dquot blk count changes */
-	long		qt_delbcnt_delta; /* delayed dquot blk count changes */
+
+	/* dquot block count changes taken from qt_blk_res */
+	long		qt_bcount_delta;
+
+	/* dquot block count changes taken from delalloc reservation */
+	long		qt_delbcnt_delta;
+
 	long		qt_icount_delta;  /* dquot inode count changes */
 	ulong		qt_rtblk_res;	  /* # blks reserved on a dquot */
 	ulong		qt_rtblk_res_used;/* # blks used from reservation */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 82abff6..e367351 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
 					del.br_startblock, del.br_blockcount,
 					NULL);
 
-			/* Update quota accounting */
-			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
-					-(long)del.br_blockcount);
-
 			/* Roll the transaction */
 			xfs_defer_ijoin(&dfops, ip);
 			error = xfs_defer_finish(tpp, &dfops);
@@ -795,6 +791,10 @@ xfs_reflink_end_cow(
 		if (error)
 			goto out_defer;
 
+		/* Charge this new data fork mapping to the on-disk quota. */
+		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
+				(long)del.br_blockcount);
+
 		/* Remove the mapping from the CoW fork. */
 		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc
  2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
                   ` (10 preceding siblings ...)
  2018-01-24  2:19 ` [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
@ 2018-01-25  5:26 ` Darrick J. Wong
  2018-01-26 12:17   ` Christoph Hellwig
  11 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25  5:26 UTC (permalink / raw)
  To: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Since we now have a dedicated function for dealing with CoW allocation
related quota updates in xfs_bmap_btalloc, we might as well refactor the
data/attr fork quota update into its own function too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 4144487..2d99b7a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3402,6 +3402,25 @@ xfs_bmap_btalloc_filestreams(
 	return 0;
 }
 
+/* Deal with data/attr fork accounting when we allocate a block. */
+static void
+xfs_bmap_btalloc_quota(
+	struct xfs_bmalloca	*ap,
+	struct xfs_alloc_arg	*args)
+{
+	ap->ip->i_d.di_nblocks += args.len;
+	xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
+	if (ap->wasdel)
+		ap->ip->i_delayed_blks -= args.len;
+	/*
+	 * Adjust the disk quota also. This was reserved
+	 * earlier.
+	 */
+	xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
+		ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : XFS_TRANS_DQ_BCOUNT,
+		(long) args.len);
+}
+
 /* Deal with CoW fork accounting when we allocate a block. */
 static void
 xfs_bmap_btalloc_quota_cow(
@@ -3679,18 +3698,7 @@ xfs_bmap_btalloc(
 			xfs_bmap_btalloc_quota_cow(ap, &args, orig_offset,
 					orig_length);
 		} else {
-			ap->ip->i_d.di_nblocks += args.len;
-			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
-			if (ap->wasdel)
-				ap->ip->i_delayed_blks -= args.len;
-			/*
-			 * Adjust the disk quota also. This was reserved
-			 * earlier.
-			 */
-			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
-				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
-						XFS_TRANS_DQ_BCOUNT,
-				(long) args.len);
+			xfs_bmap_btalloc_quota(ap, &args);
 		}
 	} else {
 		ap->blkno = NULLFSBLOCK;

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-24 19:14     ` Darrick J. Wong
@ 2018-01-25 13:01       ` Brian Foster
  2018-01-25 17:52         ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-25 13:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jan 24, 2018 at 11:14:25AM -0800, Darrick J. Wong wrote:
> On Wed, Jan 24, 2018 at 09:22:16AM -0500, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 06:18:23PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Since the CoW fork only exists in memory, it is incorrect to update the
> > > on-disk quota block counts when we modify the CoW fork.  Unlike the data
> > > fork, even real extents in the CoW fork are only reservations (on-disk
> > > they're owned by the refcountbt) so they must not be tracked in the on
> > > disk quota info.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c |  203 ++++++++++++++++++++++++++++++++++++++++++++--
> > >  fs/xfs/xfs_reflink.c     |    8 +-
> > >  2 files changed, 196 insertions(+), 15 deletions(-)
> > > 
> > > 
...
> > > + * - sb_fdblocks: Number of free blocks recorded in the superblock on disk.
> > > + * - fdblocks: Number of free blocks recorded in the superblock minus any
> > > + *           in-core reservations made in anticipation of future writes.
> > > + *
> > > + * - t_blk_res: Number of blocks reserved out of fdblocks for a transaction.
> > > + *           When the transaction commits, t_blk_res - t_blk_res_used is given
> > > + *           back to fdblocks.
> > > + * - t_blk_res_used: Number of blocks used by this transaction that were
> > > + *           reserved for this transaction.
> > > + * - t_fdblocks_del: Number of blocks by which fdblocks and sb_fdblocks will
> > > + *           have to decrease at commit.
> > > + * - t_res_fdblocks_delta: Number of blocks by which sb_fdblocks will have to
> > > + *           decrease at commit.  We assume that fdblocks was decreased
> > > + *           prior to the transaction.
> > > + *
> > > + * Data fork block mappings have four logical states:
> > > + *
> > > + *    +--------> UNWRITTEN <------+
> > > + *    |              ^            |
> > > + *    |              v            v
> > > + * DELALLOC <----> HOLE <------> REAL
> > > + *    |                           ^
> > > + *    |                           |
> > > + *    +---------------------------+
> > > + *
> > 
> > I'm not sure we need a graphic for the extent states. Non-hole
> > conversions to delalloc is the only transition that doesn't make any
> > sense.
> 
> First of all, Dave keeps asking for ASCII art 'when appropriate'. :)
> 

I'm not against ASCII art in general...

> But on a more serious note, I thought the state diagram would be useful
> for anyone who isn't so familiar with how blocks get mapped into files
> in xfs, particularly to compare the data fork diagram against the
> corresponding picture for the cow fork.
> 

... but TBH I didn't really notice they were different until you pointed
it out. :/ It still seems like a rather verbose means to point out that
COW fork apparently doesn't have the DELALLOC -> REAL transition.

...
> 
> > I'm also wondering how much of the whole picture really needs to be
> > described here to cover quota accounting with respect to block mapping
> > (sufficiently to differentiate data fork from cow fork, which I take is
> > the purpose of this section). For example, do we really need the
> > internal transaction details for how quota deltas are carried?
> > 
> > Instead, I think it might be sufficient to explain that the quota system
> > works in two "levels" (for lack of a better term :/), one for
> > reservation and another for real block usage. The reservation is
> > associated with transaction reservation and/or delayed block allocation
> > (no tx). In either case, quota reservation is converted to real quota
> > block usage when a transaction commits that maps real/physical blocks.
> > If the transaction held extra reservation that went unused, that quota
> > reservation is released. The primary difference is that transactions to
> > convert delalloc -> real do not reserve quota blocks in the first place,
> > since that has already occurred, so they just need to make sure to
> > convert/persist the quota res for the blocks that were converted to
> > real.
> 
> I thought about cutting this whole comment down to a simple sentence
> about how quota accounting is different between the data & cow forks (as
> you figured out, we use delalloc quota reservations for everything in
> the cow fork and only turn them into real ones when we go to remap) but
> then worried that doing so would presuppose the reader knew anything
> about how the extent lifecycles work... and that's how I end up with a
> gigantic manual.
> 

Which is probably fine for xfs-docs or something..

The more I think about it, the more I think this whole thing is better
off focused on explaining the unique quota management of COW fork
blocks. We can add additional comments about extent lifecycles, how
quota is tracked in general, etc., but perhaps it's best to make those a
separate patch rather than attempt to document a ground-up "COW fork
quotas for dummies" in a single comment. :P Anyways, I'll reserve
further judgement for the new patch..

Brian

> > > + * Copy on Write Fork Mapping Lifecycle
> > > + *
> > > + * The CoW fork handles things differently from the data fork because its
> > > + * mappings only exist in memory-- the refcount btree is the on-disk owner of
> > > + * the extents until they're remapped into the data fork.  Therefore,
> > > + * unwritten and real extents in the CoW fork are treated the same way as
> > > + * delayed allocation extents.  Quota and fdblock changes only exist in
> > > + * memory, which requires some twists in the bmap functions.
> > > + *
> > 
> > Ok, but perhaps this should point out what happens when cow blocks are
> > reserved with respect to quotas..? IIUC, a delalloc quota reservation
> > occurs just the same as above, the blocks simply reside in another fork.
> 
> Yes.
> 
> > > + * The CoW fork extent state diagram looks like this:
> > > + *
> > > + *    +--------> UNWRITTEN -------+
> > > + *    |              ^            |
> > > + *    |              v            v
> > > + * DELALLOC <----> HOLE <------- REAL
> > > + *
> > > + * Holes are still holes.  Delayed allocation extents reserve blocks for
> > > + * landing future writes, just like they do in the data fork.  However, unlike
> > > + * the data fork, unwritten extents signal an extent that has been allocated
> > > + * but is not currently undergoing writeback.  Real extents are undergoing
> > > + * writeback, and when that writeback finishes the corresponding data fork
> > > + * extent will be punched out and the CoW fork counterpart moved to the new
> > > + * hole in the data fork.
> > > + *
> > 
> > Ok, so the difference is that for the COW fork, the _extent state_
> > conversion is not the appropriate trigger event to convert quota
> > reservation to real quota usage. Instead, the blocks being remapped from
> > the COW fork to the data fork is when that should occur.
> 
> Yes.
> 
> > > + * The state transitions and required metadata updates are as follows:
> > > + *
> > > + * - HOLE to DELALLOC: Increase i_cow_blocks and q_res_bcount, and decrease
> > > + *           fdblocks.
> > > + * - HOLE to UNWRITTEN: Same as above, but since we reserved quota via
> > > + *           qt_blk_res (which increased q_res_bcount) when we allocate the
> > > + *           extent we have to decrease qt_blk_res so that the commit doesn't
> > > + *           give the allocated CoW blocks back.
> > > + *
> > 
> > Hmm, this is a little confusing. Looking at the code change and comment
> > below, I think I get what this is trying to do, which is essentially
> > make a real block cow fork alloc behave like a delalloc reservation
> > (with respect to quota). FWIW, I think what confuses me is the assertion
> > that the blocks would be "given back" otherwise. The only reference I
> > have to compare is data fork alloc behavior, which implies that used
> > reservation would not be given back, but rather converted to real quota
> > usage on block allocation (and excess reservation would still be given
> > back, which afaict we still want to happen). So the trickery is required
> > to prevent conversion of quota reservation for the allocated cow blocks,
> > let that res sit around until the cow blocks are remapped, and release
> > unused reservation from the tx as normal. Am I following that correctly?
> 
> Yes.
> 
> > > + * - DELALLOC to UNWRITTEN: No change.
> > > + * - DELALLOC to HOLE: Decrease i_cow_blocks and q_res_bcount, and increase
> > > + *           fdblocks.
> > > + *
> > > + * - UNWRITTEN to HOLE: Same as DELALLOC to HOLE.
> > > + * - UNWRITTEN to REAL: No change.
> > > + *
> > > + * - REAL to HOLE: This transition happens when we've finished a write
> > > + *           operation and need to move the mapping to the data fork.  We
> > > + *           punch the correspond data fork mappings, which decreases
> > > + *           qt_bcount.  Then we map the CoW fork mapping into the hole we
> > > + *           just cleared out of the data fork, which increases qt_bcount.
> > > + *           There's a subtlety here -- if we promoted a write over a hole to
> > > + *           CoW, there will be a net increase in qt_bcount, which is fine
> > > + *           because we already reserved the quota when we filled the CoW
> > > + *           fork.  Finally, we punch the CoW fork mapping, which decreases
> > > + *           q_res_bcount.
> > > + *
> > > + * Notice how all CoW fork extents use transactionless quota reservations and
> > > + * the in-core fdblocks to maintain state, and we avoid updating any on-disk
> > > + * metadata.  This is essential to maintain metadata correctness if the system
> > > + * goes down.
> > > + */
> > >  
> > >  kmem_zone_t		*xfs_bmap_free_item_zone;
> > >  
> > > @@ -3337,6 +3476,39 @@ xfs_bmap_btalloc_filestreams(
> > >  	return 0;
> > >  }
> > >  
> > > +/* Deal with CoW fork accounting when we allocate a block. */
> > > +static void
> > > +xfs_bmap_btalloc_cow(
> > > +	struct xfs_bmalloca	*ap,
> > > +	struct xfs_alloc_arg	*args)
> > > +{
> > > +	/* Filling a previously reserved extent; nothing to do here. */
> > > +	if (ap->wasdel)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * The CoW fork only exists in memory, so the on-disk quota accounting
> > > +	 * must not incude any CoW fork extents.  Therefore, CoW blocks are
> > > +	 * only tracked in the in-core dquot block count (q_res_bcount).
> > > +	 *
> > > +	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
> > > +	 * CoW extent having reserved enough blocks from both q_res_bcount and
> > > +	 * qt_blk_res to guarantee that we won't run out of space.  The unused
> > > +	 * qt_blk_res is given back to q_res_bcount when the transaction
> > > +	 * commits.
> > > +	 *
> > > +	 * We don't want the quota accounting for our newly allocated blocks
> > > +	 * to be given back, so we must decrease qt_blk_res without decreasing
> > > +	 * q_res_bcount.
> > > +	 *
> > > +	 * Note: If we're allocating a delalloc extent, we already reserved
> > > +	 * the q_res_bcount blocks, so no quota accounting update is needed
> > > +	 * here.
> > > +	 */
> > > +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> > > +			-(long)args->len);
> > > +}
> > 
> > Factoring nit.. if we're going to refactor bits of xfs_bmap_btalloc()
> > out, it might be cleaner to factor out all of the quota logic rather
> > than just the cow bits (which is basically just a simple check and
> > function call). E.g., refactor into an xfs_bmap_btalloc_quota() helper
> > that does the right thing based on the fork, with comments as to why,
> > etc. (and perhaps just leave the unrelated di_nblocks change behind).
> 
> I thought about factoring the data/attr fork stuff into its own
> xfs_bmap_btalloc_quota() function too, since this function is already
> eyewateringly long.  I think I'll do that as a separate refactor at the
> end of the series, though...
> 
> --D
> 
> > Brian
> > 
> > > +
> > >  STATIC int
> > >  xfs_bmap_btalloc(
> > >  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> > > @@ -3571,19 +3743,22 @@ xfs_bmap_btalloc(
> > >  			*ap->firstblock = args.fsbno;
> > >  		ASSERT(nullfb || fb_agno <= args.agno);
> > >  		ap->length = args.len;
> > > -		if (!(ap->flags & XFS_BMAPI_COWFORK))
> > > -			ap->ip->i_d.di_nblocks += args.len;
> > > -		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > >  		if (ap->wasdel)
> > >  			ap->ip->i_delayed_blks -= args.len;
> > > -		/*
> > > -		 * Adjust the disk quota also. This was reserved
> > > -		 * earlier.
> > > -		 */
> > > -		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > > -			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > > -					XFS_TRANS_DQ_BCOUNT,
> > > -			(long) args.len);
> > > +		if (ap->flags & XFS_BMAPI_COWFORK) {
> > > +			xfs_bmap_btalloc_cow(ap, &args);
> > > +		} else {
> > > +			ap->ip->i_d.di_nblocks += args.len;
> > > +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > +			/*
> > > +			 * Adjust the disk quota also. This was reserved
> > > +			 * earlier.
> > > +			 */
> > > +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > > +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > > +						XFS_TRANS_DQ_BCOUNT,
> > > +				(long) args.len);
> > > +		}
> > >  	} else {
> > >  		ap->blkno = NULLFSBLOCK;
> > >  		ap->length = 0;
> > > @@ -4776,6 +4951,7 @@ xfs_bmap_del_extent_cow(
> > >  	struct xfs_bmbt_irec	new;
> > >  	xfs_fileoff_t		del_endoff, got_endoff;
> > >  	int			state = BMAP_COWFORK;
> > > +	int			error;
> > >  
> > >  	XFS_STATS_INC(mp, xs_del_exlist);
> > >  
> > > @@ -4832,6 +5008,11 @@ xfs_bmap_del_extent_cow(
> > >  		xfs_iext_insert(ip, icur, &new, state);
> > >  		break;
> > >  	}
> > > +
> > > +	/* Remove the quota reservation */
> > > +	error = xfs_trans_reserve_quota_nblks(NULL, ip,
> > > +			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > +	ASSERT(error == 0);
> > >  }
> > >  
> > >  /*
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 82abff6..e367351 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
> > >  					del.br_startblock, del.br_blockcount,
> > >  					NULL);
> > >  
> > > -			/* Update quota accounting */
> > > -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > > -					-(long)del.br_blockcount);
> > > -
> > >  			/* Roll the transaction */
> > >  			xfs_defer_ijoin(&dfops, ip);
> > >  			error = xfs_defer_finish(tpp, &dfops);
> > > @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
> > >  		if (error)
> > >  			goto out_defer;
> > >  
> > > +		/* Charge this new data fork mapping to the on-disk quota. */
> > > +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > +				(long)del.br_blockcount);
> > > +
> > >  		/* Remove the mapping from the CoW fork. */
> > >  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > >  
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-25  1:20   ` [PATCH v2 " Darrick J. Wong
@ 2018-01-25 13:03     ` Brian Foster
  2018-01-25 18:20       ` Darrick J. Wong
  2018-01-26 12:12     ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-25 13:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jan 24, 2018 at 05:20:35PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Since the CoW fork only exists in memory, it is incorrect to update the
> on-disk quota block counts when we modify the CoW fork.  Unlike the data
> fork, even real extents in the CoW fork are only reservations (on-disk
> they're owned by the refcountbt) so they must not be tracked in the on
> disk quota info.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: make documentation more crisp and to the point
> ---
>  fs/xfs/libxfs/xfs_bmap.c |  118 ++++++++++++++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_quota.h       |   14 ++++-
>  fs/xfs/xfs_reflink.c     |    8 ++-
>  3 files changed, 122 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 0c9c9cd..7f0ac40 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -52,6 +52,71 @@
>  #include "xfs_refcount.h"
>  #include "xfs_icache.h"
>  
> +/*
> + * Data/Attribute Fork Mapping Lifecycle
> + *
> + * The data fork contains the block mappings between logical blocks in a file
> + * and physical blocks on the disk.  The XFS notions of delayed allocation
> + * reservations, unwritten extents, and real extents follow well known
> + * conventions in the filesystem world.
> + *
> + * Data fork extent states follow these transitions:
> + *
> + *    +--------> UNWRITTEN <------+
> + *    |              ^            |
> + *    |              v            v
> + * DELALLOC <----> HOLE <------> REAL
> + *    |                           ^
> + *    |                           |
> + *    +---------------------------+
> + *
> + * Every delayed allocation reserves in-memory quota blocks (q_res_bcount) and
> + * in-memory fs free blocks (fdblocks), and increases the in-memory per-inode
> + * i_delayed_blks.  The reservation includes potentially required bmbt blocks.
> + *

So we have some bits about extent states..

> + * Every transaction reserves quota blocks (qt_blk_res) from the in-memory
> + * quota blocks and free blocks (t_blk_res) from the in-memory fs free blocks.
> + * The transaction tracks both the number of blocks used from its own
> + * reservation as well as the number of blocks used that came from a delayed
> + * allocation.  When the transaction commits, it gives back the unused parts
> + * of its own block reservations.  Next, it adds any block usage that came
> + * from a delayed allocation to the on-disk counters without changing the
> + * in-memory reservations (q_res_bcount and fdblocks).
> + *

Some bits about quota, transactions..

> + * To convert a delayed allocation to a real or unwritten extent, we use a
> + * transaction to allocate the blocks.  At commit time, the block reservations
> + * are given back or added to the on-disk counters as noted above.
> + * i_delayed_blks is decreased while the on-disk per-inode di_nblocks is
> + * increased.
> + *
> + * The attribute fork works in the same way as the data fork except that the
> + * only valid states are REAL and HOLE.
> + *
> + * Copy on Write Fork Mapping Lifecycle
> + *
> + * The CoW fork exists only in memory and is used to stage copy writes for
> + * file data and has fewer transitions:
> + *
> + *    +--------> UNWRITTEN -------+
> + *    |              ^            |
> + *    |              v            v
> + * DELALLOC <----> HOLE <------- REAL
> + *
> + * Delayed allocation extents here are treated the same as in the data fork
> + * except that they are counted by the per-inode i_cow_blocks instead of
> + * i_delayed_blks.
> + *

COW fork extent state..

> + * Unwritten and real extents are counted by the quota code as block
> + * reservations (q_res_bcount) and not on-disk quota blocks (d_bcount), and
> + * are counted by the free block counters as in-memory reservations (fdblocks)
> + * and not on-disk free blocks (sb_fdblocks).  These blocks are also counted
> + * by i_cow_blocks and not the on-disk di_nblocks.
> + *
> + * When a CoW fork extent is remapped to the data fork, the reservations are
> + * converted into on-disk counts in the same manner as a delayed allocation
> + * conversion in the data fork.  The number of blocks being remapped is
> + * subtracted from i_cow_blocks and added to di_nblocks.
> + */
>  

COW fork quota quirkiness..

In all, I'd probably suggest to split this whole thing up into maybe 3
independent comments. If we don't have anything that explains extent
states properly, perhaps do that in the headers where we define the
extent state definitions (and then explain how they are used differently
between forks). If we don't have sufficient explanation of how quota
reservation lifecycle works, do the same in the headers that define the
data structures that map transaction deltas to actual quota accounting.

Finally, this patch already does some refactoring below to deal with the
COW fork quota accounting quirkiness, so we can use the comment in that
function to explain exactly what/why is going on here. That's just my
.02 on the whole comment.

>  kmem_zone_t		*xfs_bmap_free_item_zone;
>  
> @@ -3337,6 +3402,28 @@ xfs_bmap_btalloc_filestreams(
>  	return 0;
>  }
>  
> +/* Deal with CoW fork accounting when we allocate a block. */
> +static void
> +xfs_bmap_btalloc_quota_cow(
> +	struct xfs_bmalloca	*ap,
> +	struct xfs_alloc_arg	*args)
> +{
> +	/* Filling a previously reserved extent; nothing to do here. */
> +	if (ap->wasdel)
> +		return;
> +
> +	/*
> +	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
> +	 * CoW extent having reserved enough blocks from both q_res_bcount and
> +	 * qt_blk_res to guarantee that we won't run out of space.  The unused
> +	 * qt_blk_res is given back to q_res_bcount when the transaction
> +	 * commits, so we must decrease qt_blk_res without decreasing
> +	 * q_res_bcount.
> +	 */
> +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> +			-(long)args->len);
> +}
> +

Not sure we were on the same page wrt to my previous comment here.. I
think this should look something like:

/*
 * Update quota accounting for physical allocation. Depends on extent
 * state, target fork, etc.
 */
static void
xfs_bmap_btalloc_dquot(...)
{
	/*
	 * COW fork requires quota accounting magic words words words ..
	 */
	if (COWFORK) {
		/* do not account wasdel extents as used because ... */
		if (ap->wasdel)
			return;
		/*
		 * quota reservation exists in transaction, do magic to
		 * cause tx to leave a delalloc-like reservation after
		 * the transaction commits because cow fork words words
		 * words ...
		 */
		xfs_trans_mod_dquot_byino(..)
	} else {
		/* adjust disk quota ... */
		xfs_trans_mod_dquot_byino(...)
	}
}

So we don't have to scroll up and down the source file to understand
how/why we process quotas differently between the cow and other forks.

>  STATIC int
>  xfs_bmap_btalloc(
>  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> @@ -3571,19 +3658,22 @@ xfs_bmap_btalloc(
>  			*ap->firstblock = args.fsbno;
>  		ASSERT(nullfb || fb_agno <= args.agno);
>  		ap->length = args.len;
> -		if (!(ap->flags & XFS_BMAPI_COWFORK))
> -			ap->ip->i_d.di_nblocks += args.len;
> -		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);

... and I'd just leave this here.

(Or alternately rename the helper function to something like
xfs_bmap_btalloc_fork() and put the entire if/else logic in there.)

>  		if (ap->wasdel)
>  			ap->ip->i_delayed_blks -= args.len;
> -		/*
> -		 * Adjust the disk quota also. This was reserved
> -		 * earlier.
> -		 */
> -		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> -			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> -					XFS_TRANS_DQ_BCOUNT,
> -			(long) args.len);
> +		if (ap->flags & XFS_BMAPI_COWFORK) {
> +			xfs_bmap_btalloc_quota_cow(ap, &args);
> +		} else {
> +			ap->ip->i_d.di_nblocks += args.len;
> +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> +			/*
> +			 * Adjust the disk quota also. This was reserved
> +			 * earlier.
> +			 */
> +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> +						XFS_TRANS_DQ_BCOUNT,
> +				(long) args.len);
> +		}

And this just calls the helper above.

>  	} else {
>  		ap->blkno = NULLFSBLOCK;
>  		ap->length = 0;
> @@ -4760,6 +4850,7 @@ xfs_bmap_del_extent_cow(
>  	struct xfs_bmbt_irec	new;
>  	xfs_fileoff_t		del_endoff, got_endoff;
>  	int			state = BMAP_COWFORK;
> +	int			error;
>  
>  	XFS_STATS_INC(mp, xs_del_exlist);
>  
> @@ -4816,6 +4907,11 @@ xfs_bmap_del_extent_cow(
>  		xfs_iext_insert(ip, icur, &new, state);
>  		break;
>  	}
> +
> +	/* Remove the quota reservation */
> +	error = xfs_trans_reserve_quota_nblks(NULL, ip,
> +			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> +	ASSERT(error == 0);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
> index ce6506a..34b4ec2 100644
> --- a/fs/xfs/xfs_quota.h
> +++ b/fs/xfs/xfs_quota.h
> @@ -54,11 +54,19 @@ struct xfs_trans;
>   */
>  typedef struct xfs_dqtrx {
>  	struct xfs_dquot *qt_dquot;	  /* the dquot this refers to */
> -	ulong		qt_blk_res;	  /* blks reserved on a dquot */
> +
> +	/* dquot bcount blks reserved for this transaction */
> +	ulong		qt_blk_res;
> +
>  	ulong		qt_ino_res;	  /* inode reserved on a dquot */
>  	ulong		qt_ino_res_used;  /* inodes used from the reservation */
> -	long		qt_bcount_delta;  /* dquot blk count changes */
> -	long		qt_delbcnt_delta; /* delayed dquot blk count changes */
> +
> +	/* dquot block count changes taken from qt_blk_res */
> +	long		qt_bcount_delta;
> +
> +	/* dquot block count changes taken from delalloc reservation */
> +	long		qt_delbcnt_delta;
> +

Separate patch.

>  	long		qt_icount_delta;  /* dquot inode count changes */
>  	ulong		qt_rtblk_res;	  /* # blks reserved on a dquot */
>  	ulong		qt_rtblk_res_used;/* # blks used from reservation */
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 82abff6..e367351 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
>  					del.br_startblock, del.br_blockcount,
>  					NULL);
>  
> -			/* Update quota accounting */
> -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> -					-(long)del.br_blockcount);
> -
>  			/* Roll the transaction */
>  			xfs_defer_ijoin(&dfops, ip);
>  			error = xfs_defer_finish(tpp, &dfops);
> @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
>  		if (error)
>  			goto out_defer;
>  
> +		/* Charge this new data fork mapping to the on-disk quota. */
> +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> +				(long)del.br_blockcount);
> +

Should this technically be XFS_TRANS_DQ_DELBCOUNT? The blocks obviously
aren't delalloc and this transaction doesn't make a quota reservation so
I don't think it screws up accounting. But if the transaction did make a
quota reservation, it seems like this would account the extent against
the tx reservation where it instead should recognize that cow blocks
have already been reserved (which is essentially what DELBCOUNT means,
IIUC).

Other than that the code seems Ok to me.

Brian

>  		/* Remove the mapping from the CoW fork. */
>  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-24  2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
@ 2018-01-25 13:06   ` Brian Foster
  2018-01-25 19:21     ` Darrick J. Wong
  2018-01-26 12:15   ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-25 13:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:29PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Track the number of blocks reserved in the CoW fork so that we can
> move the quota reservations whenever we chown, and don't account for
> CoW fork delalloc reservations in i_delayed_blks.  This should make
> chown work properly for quota reservations, enables us to fully
> account for real extents in the cow fork in the file stat info, and
> improves the post-eof scanning decisions because we're no longer
> confusing data fork delalloc extents with cow fork delalloc extents.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c      |   16 ++++++++++++----
>  fs/xfs/libxfs/xfs_inode_buf.c |    1 +
>  fs/xfs/xfs_bmap_util.c        |    5 +++++
>  fs/xfs/xfs_icache.c           |    3 ++-
>  fs/xfs/xfs_inode.c            |   11 +++++------
>  fs/xfs/xfs_inode.h            |    1 +
>  fs/xfs/xfs_iops.c             |    3 ++-
>  fs/xfs/xfs_itable.c           |    3 ++-
>  fs/xfs/xfs_qm.c               |    2 +-
>  fs/xfs/xfs_reflink.c          |    4 ++--
>  fs/xfs/xfs_super.c            |    1 +
>  11 files changed, 34 insertions(+), 16 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4a38cfc..a208825 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
...
> @@ -1669,7 +1667,7 @@ xfs_release(
>  		truncated = xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED);
>  		if (truncated) {
>  			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
> -			if (ip->i_delayed_blks > 0) {
> +			if (ip->i_delayed_blks > 0 || ip->i_cow_blocks > 0) {
>  				error = filemap_flush(VFS_I(ip)->i_mapping);
>  				if (error)
>  					return error;

Is having cowblocks really relevant to this hunk? I thought this was
purely a delalloc vs. file size thing, but I could be wrong. 

Brian

> @@ -1909,7 +1907,8 @@ xfs_inactive(
>  
>  	if (S_ISREG(VFS_I(ip)->i_mode) &&
>  	    (ip->i_d.di_size != 0 || XFS_ISIZE(ip) != 0 ||
> -	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0))
> +	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0 ||
> +	     ip->i_cow_blocks > 0))
>  		truncate = 1;
>  
>  	error = xfs_qm_dqattach(ip, 0);
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ff56486..6feee8a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -62,6 +62,7 @@ typedef struct xfs_inode {
>  	/* Miscellaneous state. */
>  	unsigned long		i_flags;	/* see defined flags below */
>  	unsigned int		i_delayed_blks;	/* count of delay alloc blks */
> +	unsigned int		i_cow_blocks;	/* count of cow fork blocks */
>  
>  	struct xfs_icdinode	i_d;		/* most of ondisk inode */
>  
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 56475fc..6c3381c 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -513,7 +513,8 @@ xfs_vn_getattr(
>  	stat->mtime = inode->i_mtime;
>  	stat->ctime = inode->i_ctime;
>  	stat->blocks =
> -		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
> +		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks +
> +				  ip->i_cow_blocks);
>  
>  	if (ip->i_d.di_version == 3) {
>  		if (request_mask & STATX_BTIME) {
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index d583105..412d7eb 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -122,7 +122,8 @@ xfs_bulkstat_one_int(
>  	case XFS_DINODE_FMT_BTREE:
>  		buf->bs_rdev = 0;
>  		buf->bs_blksize = mp->m_sb.sb_blocksize;
> -		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks;
> +		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks +
> +				 ip->i_cow_blocks;
>  		break;
>  	}
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 5b848f4..28f12f8 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -1847,7 +1847,7 @@ xfs_qm_vop_chown_reserve(
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
>  	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
>  
> -	delblks = ip->i_delayed_blks;
> +	delblks = ip->i_delayed_blks + ip->i_cow_blocks;
>  	blkflags = XFS_IS_REALTIME_INODE(ip) ?
>  			XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS;
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index e367351..f875ea7 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -619,7 +619,7 @@ xfs_reflink_cancel_cow_blocks(
>  	}
>  
>  	/* clear tag if cow fork is emptied */
> -	if (!ifp->if_bytes)
> +	if (ip->i_cow_blocks == 0)
>  		xfs_inode_clear_cowblocks_tag(ip);
>  
>  	return error;
> @@ -704,7 +704,7 @@ xfs_reflink_end_cow(
>  	trace_xfs_reflink_end_cow(ip, offset, count);
>  
>  	/* No COW extents?  That's easy! */
> -	if (ifp->if_bytes == 0)
> +	if (ip->i_cow_blocks == 0)
>  		return 0;
>  
>  	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f3e0001..9d04cfb 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -989,6 +989,7 @@ xfs_fs_destroy_inode(
>  	xfs_inactive(ip);
>  
>  	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_cow_blocks == 0);
>  	XFS_STATS_INC(ip->i_mount, vn_reclaim);
>  
>  	/*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls
  2018-01-24  2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
@ 2018-01-25 17:31   ` Brian Foster
  2018-01-25 20:20     ` Darrick J. Wong
  2018-01-26  9:11   ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-25 17:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:35PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In xfs_bmap_btalloc, we try using the CoW extent size hint to force
> allocations to align (offset-wise) to cowextsz granularity to reduce CoW
> fragmentation.  This works fine until we cannot satisfy the allocation
> with enough blocks to cover the requested range and the alignment hints.
> If this happens, return an unaligned region because if we don't the
> extent trim functions cause us to return a zero-length extent to iomap,
> which iomap doesn't catch and thus blows up.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Hmm.. is this a direct I/O thing? The description of the problem had me
wondering how we handle this with regard to dio and traditional extent
size hints. It looks like we just return -ENOSPC if xfs_bmapi_write()
doesn't return a mapping that covers the target range of the write (even
if it apparently attempts to allocate part of the associated extent size
hint range). E.g., see the nimaps == 0 check in
xfs_iomap_write_direct() after we commit the transaction.

In fact, it looks like just repeating the failed write could eventually
succeed if the issue is that there is actually enough free space
available to allocate the hint range up to where the write is targeted,
just no long enough extent available to fill the extent size hint range
in a single bmapi_write call. That behavior is a bit strange, I admit,
but I'm wondering if we could do the same thing for the cow hint. Would
a similar nimaps check in the xfs_bmapi_write() caller resolve the bug
described here?

If so and if we still care to actually change/fix the allocation
behavior with regard to the hints, perhaps we could do that in a
separate patch more generically for both hints..?

Brian

>  fs/iomap.c               |    2 +-
>  fs/xfs/libxfs/xfs_bmap.c |   21 +++++++++++++++++++--
>  2 files changed, 20 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e5de772..aec35a0 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
>  	if (ret)
>  		return ret;
> -	if (WARN_ON(iomap.offset > pos))
> +	if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
>  		return -EIO;
>  
>  	/*
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 93ce2c6..4ec1fdc5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3480,8 +3480,20 @@ xfs_bmap_btalloc_filestreams(
>  static void
>  xfs_bmap_btalloc_cow(
>  	struct xfs_bmalloca	*ap,
> -	struct xfs_alloc_arg	*args)
> +	struct xfs_alloc_arg	*args,
> +	xfs_fileoff_t		orig_offset,
> +	xfs_extlen_t		orig_length)
>  {
> +	/*
> +	 * If we didn't get enough blocks to satisfy the cowextsize
> +	 * aligned request, break the alignment and return whatever we
> +	 * got; it's the best we can do.
> +	 */
> +	if (ap->length <= orig_length)
> +		ap->offset = orig_offset;
> +	else if (ap->offset + ap->length < orig_offset + orig_length)
> +		ap->offset = orig_offset + orig_length - ap->length;
> +
>  	/* Filling a previously reserved extent; nothing to do here. */
>  	if (ap->wasdel)
>  		return;
> @@ -3520,6 +3532,8 @@ xfs_bmap_btalloc(
>  	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
>  	xfs_agnumber_t	ag;
>  	xfs_alloc_arg_t	args;
> +	xfs_fileoff_t	orig_offset;
> +	xfs_extlen_t	orig_length;
>  	xfs_extlen_t	blen;
>  	xfs_extlen_t	nextminlen = 0;
>  	int		nullfb;		/* true if ap->firstblock isn't set */
> @@ -3529,6 +3543,8 @@ xfs_bmap_btalloc(
>  	int		stripe_align;
>  
>  	ASSERT(ap->length);
> +	orig_offset = ap->offset;
> +	orig_length = ap->length;
>  
>  	mp = ap->ip->i_mount;
>  
> @@ -3745,7 +3761,8 @@ xfs_bmap_btalloc(
>  		ASSERT(nullfb || fb_agno <= args.agno);
>  		ap->length = args.len;
>  		if (ap->flags & XFS_BMAPI_COWFORK) {
> -			xfs_bmap_btalloc_cow(ap, &args);
> +			xfs_bmap_btalloc_cow(ap, &args, orig_offset,
> +					orig_length);
>  		} else {
>  			ap->ip->i_d.di_nblocks += args.len;
>  			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/11] xfs: always zero di_flags2 when we free the inode
  2018-01-24  2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
@ 2018-01-25 17:31   ` Brian Foster
  2018-01-25 18:36     ` Darrick J. Wong
  2018-01-26  9:08   ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-25 17:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:41PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Always zero the di_flags2 field when we free the inode so that we never
> write reflinked non-file inode records to disk.
> 

By "non-file," do you mean "invalid" or "unallocated?"

Otherwise looks fine:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_inode.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index a208825..fc118dd 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2465,6 +2465,7 @@ xfs_ifree(
>  
>  	VFS_I(ip)->i_mode = 0;		/* mark incore inode as free */
>  	ip->i_d.di_flags = 0;
> +	ip->i_d.di_flags2 = 0;
>  	ip->i_d.di_dmevmask = 0;
>  	ip->i_d.di_forkoff = 0;		/* mark the attr fork not in use */
>  	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/11] xfs: fix tracepoint %p formats
  2018-01-24  2:18 ` [PATCH 08/11] xfs: fix tracepoint %p formats Darrick J. Wong
@ 2018-01-25 17:31   ` Brian Foster
  2018-01-25 18:47     ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-25 17:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:47PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Tracepoint printk doesn't have any of the %p suffixes, so use %p.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

I see different behavior with this. E.g.,

  umount-1130  [003] ...1  1995.947789: xfs_log_force: dev 253:3 lsn 0x0 caller xfs_log_quiesce+0x3c/0x4b0 [xfs]

vs.

  umount-1272  [002] ...1  2089.445135: xfs_log_force: dev 253:3 lsn 0x0 caller 00000000937cbc85

Expected?

Brian

>  fs/xfs/scrub/trace.h |   20 ++++++++++----------
>  fs/xfs/xfs_trace.h   |   24 ++++++++++++------------
>  2 files changed, 22 insertions(+), 22 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> index a0a6d3c..732775f 100644
> --- a/fs/xfs/scrub/trace.h
> +++ b/fs/xfs/scrub/trace.h
> @@ -90,7 +90,7 @@ TRACE_EVENT(xfs_scrub_op_error,
>  		__entry->error = error;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %pS",
> +	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->type,
>  		  __entry->agno,
> @@ -121,7 +121,7 @@ TRACE_EVENT(xfs_scrub_file_op_error,
>  		__entry->error = error;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %pS",
> +	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> @@ -156,7 +156,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_block_error_class,
>  		__entry->bno = bno;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %pS",
> +	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->type,
>  		  __entry->agno,
> @@ -207,7 +207,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_ino_error_class,
>  		__entry->bno = bno;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %pS",
> +	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->type,
> @@ -246,7 +246,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_fblock_error_class,
>  		__entry->offset = offset;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %pS",
> +	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> @@ -277,7 +277,7 @@ TRACE_EVENT(xfs_scrub_incomplete,
>  		__entry->type = sc->sm->sm_type;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d type %u ret_ip %pS",
> +	TP_printk("dev %d:%d type %u ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->type,
>  		  __entry->ret_ip)
> @@ -311,7 +311,7 @@ TRACE_EVENT(xfs_scrub_btree_op_error,
>  		__entry->error = error;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
> +	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->type,
>  		  __entry->btnum,
> @@ -354,7 +354,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_op_error,
>  		__entry->error = error;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
> +	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> @@ -393,7 +393,7 @@ TRACE_EVENT(xfs_scrub_btree_error,
>  		__entry->ptr = cur->bc_ptrs[level];
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
> +	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->type,
>  		  __entry->btnum,
> @@ -433,7 +433,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
>  		__entry->ptr = cur->bc_ptrs[level];
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
> +	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 945de08..893081e 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -119,7 +119,7 @@ DECLARE_EVENT_CLASS(xfs_perag_class,
>  		__entry->refcount = refcount;
>  		__entry->caller_ip = caller_ip;
>  	),
> -	TP_printk("dev %d:%d agno %u refcount %d caller %pS",
> +	TP_printk("dev %d:%d agno %u refcount %d caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->agno,
>  		  __entry->refcount,
> @@ -252,7 +252,7 @@ DECLARE_EVENT_CLASS(xfs_bmap_class,
>  		__entry->caller_ip = caller_ip;
>  	),
>  	TP_printk("dev %d:%d ino 0x%llx state %s cur %p/%d "
> -		  "offset %lld block %lld count %lld flag %d caller %pS",
> +		  "offset %lld block %lld count %lld flag %d caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __print_flags(__entry->bmap_state, "|", XFS_BMAP_EXT_FLAGS),
> @@ -301,7 +301,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
>  		__entry->caller_ip = caller_ip;
>  	),
>  	TP_printk("dev %d:%d bno 0x%llx nblks 0x%x hold %d pincount %d "
> -		  "lock %d flags %s caller %pS",
> +		  "lock %d flags %s caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  (unsigned long long)__entry->bno,
>  		  __entry->nblks,
> @@ -370,7 +370,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class,
>  		__entry->caller_ip = caller_ip;
>  	),
>  	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
> -		  "lock %d flags %s caller %pS",
> +		  "lock %d flags %s caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  (unsigned long long)__entry->bno,
>  		  __entry->buffer_length,
> @@ -415,7 +415,7 @@ TRACE_EVENT(xfs_buf_ioerror,
>  		__entry->caller_ip = caller_ip;
>  	),
>  	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
> -		  "lock %d error %d flags %s caller %pS",
> +		  "lock %d error %d flags %s caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  (unsigned long long)__entry->bno,
>  		  __entry->buffer_length,
> @@ -579,7 +579,7 @@ DECLARE_EVENT_CLASS(xfs_lock_class,
>  		__entry->lock_flags = lock_flags;
>  		__entry->caller_ip = caller_ip;
>  	),
> -	TP_printk("dev %d:%d ino 0x%llx flags %s caller %pS",
> +	TP_printk("dev %d:%d ino 0x%llx flags %s caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __print_flags(__entry->lock_flags, "|", XFS_LOCK_FLAGS),
> @@ -697,7 +697,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
>  		__entry->pincount = atomic_read(&ip->i_pincount);
>  		__entry->caller_ip = caller_ip;
>  	),
> -	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %pS",
> +	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->count,
> @@ -1049,7 +1049,7 @@ TRACE_EVENT(xfs_log_force,
>  		__entry->lsn = lsn;
>  		__entry->caller_ip = caller_ip;
>  	),
> -	TP_printk("dev %d:%d lsn 0x%llx caller %pS",
> +	TP_printk("dev %d:%d lsn 0x%llx caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->lsn, (void *)__entry->caller_ip)
>  )
> @@ -1403,7 +1403,7 @@ TRACE_EVENT(xfs_bunmap,
>  		__entry->flags = flags;
>  	),
>  	TP_printk("dev %d:%d ino 0x%llx size 0x%llx bno 0x%llx len 0x%llx"
> -		  "flags %s caller %pS",
> +		  "flags %s caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->size,
> @@ -1517,7 +1517,7 @@ TRACE_EVENT(xfs_agf,
>  	),
>  	TP_printk("dev %d:%d agno %u flags %s length %u roots b %u c %u "
>  		  "levels b %u c %u flfirst %u fllast %u flcount %u "
> -		  "freeblks %u longest %u caller %pS",
> +		  "freeblks %u longest %u caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->agno,
>  		  __print_flags(__entry->flags, "|", XFS_AGF_FLAGS),
> @@ -2486,7 +2486,7 @@ DECLARE_EVENT_CLASS(xfs_ag_error_class,
>  		__entry->error = error;
>  		__entry->caller_ip = caller_ip;
>  	),
> -	TP_printk("dev %d:%d agno %u error %d caller %pS",
> +	TP_printk("dev %d:%d agno %u error %d caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->agno,
>  		  __entry->error,
> @@ -2977,7 +2977,7 @@ DECLARE_EVENT_CLASS(xfs_inode_error_class,
>  		__entry->error = error;
>  		__entry->caller_ip = caller_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llx error %d caller %pS",
> +	TP_printk("dev %d:%d ino %llx error %d caller %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->error,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/11] xfs: make tracepoint inode number format consistent
  2018-01-24  2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
@ 2018-01-25 17:31   ` Brian Foster
  2018-01-26  9:09   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Brian Foster @ 2018-01-25 17:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:59PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Fix all the inode number formats to be consistently (0x%llx) in all
> trace point definitions.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/scrub/trace.h |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> index 732775f..eb420a41 100644
> --- a/fs/xfs/scrub/trace.h
> +++ b/fs/xfs/scrub/trace.h
> @@ -50,7 +50,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_class,
>  		__entry->flags = sm->sm_flags;
>  		__entry->error = error;
>  	),
> -	TP_printk("dev %d:%d ino %llu type %u agno %u inum %llu gen %u flags 0x%x error %d",
> +	TP_printk("dev %d:%d ino 0x%llx type %u agno %u inum %llu gen %u flags 0x%x error %d",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->type,
> @@ -121,7 +121,7 @@ TRACE_EVENT(xfs_scrub_file_op_error,
>  		__entry->error = error;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %p",
> +	TP_printk("dev %d:%d ino 0x%llx fork %d type %u offset %llu error %d ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> @@ -207,7 +207,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_ino_error_class,
>  		__entry->bno = bno;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %p",
> +	TP_printk("dev %d:%d ino 0x%llx type %u agno %u agbno %u ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->type,
> @@ -246,7 +246,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_fblock_error_class,
>  		__entry->offset = offset;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %p",
> +	TP_printk("dev %d:%d ino 0x%llx fork %d type %u offset %llu ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> @@ -354,7 +354,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_op_error,
>  		__entry->error = error;
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
> +	TP_printk("dev %d:%d ino 0x%llx fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> @@ -433,7 +433,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
>  		__entry->ptr = cur->bc_ptrs[level];
>  		__entry->ret_ip = ret_ip;
>  	),
> -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
> +	TP_printk("dev %d:%d ino 0x%llx fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
>  		  __entry->ino,
>  		  __entry->whichfork,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/11] xfs: refactor inode verifier corruption error printing
  2018-01-24  2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
@ 2018-01-25 17:31   ` Brian Foster
  2018-01-25 18:23     ` Darrick J. Wong
  2018-01-26  9:10   ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-25 17:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:19:06PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor inode verifier error reporting into a non-libxfs function so
> that we aren't encoding the message format in libxfs.  This also
> changes the kernel dmesg output to resemble buffer verifier errors
> more closely.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c |    6 ++----
>  fs/xfs/xfs_error.c            |   37 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_error.h            |    3 +++
>  fs/xfs/xfs_inode.c            |   14 ++++++++------
>  4 files changed, 50 insertions(+), 10 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6e9dcdb..6d05ba6 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -578,10 +578,8 @@ xfs_iread(
>  	/* even unallocated inodes are verified */
>  	fa = xfs_dinode_verify(mp, ip->i_ino, dip);
>  	if (fa) {
> -		xfs_alert(mp, "%s: validation failed for inode %lld at %pS",
> -				__func__, ip->i_ino, fa);
> -
> -		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, dip);
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "record", dip,
> +				sizeof(*dip), fa);

What does "record" mean? "dinode" perhaps? Otherwise looks fine:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  		error = -EFSCORRUPTED;
>  		goto out_brelse;
>  	}
> diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
> index 980d5f0..ccf520f 100644
> --- a/fs/xfs/xfs_error.c
> +++ b/fs/xfs/xfs_error.c
> @@ -24,6 +24,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_sysfs.h"
> +#include "xfs_inode.h"
>  
>  #ifdef DEBUG
>  
> @@ -372,3 +373,39 @@ xfs_verifier_error(
>  	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
>  		xfs_stack_trace();
>  }
> +
> +/*
> + * Warnings for inode corruption problems.  Don't bother with the stack
> + * trace unless the error level is turned up high.
> + */
> +void
> +xfs_inode_verifier_error(
> +	struct xfs_inode	*ip,
> +	int			error,
> +	const char		*name,
> +	void			*buf,
> +	size_t			bufsz,
> +	xfs_failaddr_t		failaddr)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_failaddr_t		fa;
> +	int			sz;
> +
> +	fa = failaddr ? failaddr : __return_address;
> +
> +	xfs_alert(mp, "Metadata %s detected at %pS, inode 0x%llx %s",
> +		  error == -EFSBADCRC ? "CRC error" : "corruption",
> +		  fa, ip->i_ino, name);
> +
> +	xfs_alert(mp, "Unmount and run xfs_repair");
> +
> +	if (buf && xfs_error_level >= XFS_ERRLEVEL_LOW) {
> +		sz = min_t(size_t, XFS_CORRUPTION_DUMP_LEN, bufsz);
> +		xfs_alert(mp, "First %d bytes of corrupted metadata buffer:",
> +				sz);
> +		xfs_hex_dump(buf, sz);
> +	}
> +
> +	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
> +		xfs_stack_trace();
> +}
> diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> index a3ba05b..7e728c5 100644
> --- a/fs/xfs/xfs_error.h
> +++ b/fs/xfs/xfs_error.h
> @@ -28,6 +28,9 @@ extern void xfs_corruption_error(const char *tag, int level,
>  			int linenum, xfs_failaddr_t failaddr);
>  extern void xfs_verifier_error(struct xfs_buf *bp, int error,
>  			xfs_failaddr_t failaddr);
> +extern void xfs_inode_verifier_error(struct xfs_inode *ip, int error,
> +			const char *name, void *buf, size_t bufsz,
> +			xfs_failaddr_t failaddr);
>  
>  #define	XFS_ERROR_REPORT(e, lvl, mp)	\
>  	xfs_error_report(e, lvl, mp, __FILE__, __LINE__, __return_address)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index fc118dd..c60efec 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3502,21 +3502,23 @@ bool
>  xfs_inode_verify_forks(
>  	struct xfs_inode	*ip)
>  {
> +	struct xfs_ifork	*ifp;
>  	xfs_failaddr_t		fa;
>  
>  	fa = xfs_ifork_verify_data(ip, &xfs_default_ifork_ops);
>  	if (fa) {
> -		xfs_alert(ip->i_mount,
> -				"%s: bad inode %llu inline data fork at %pS",
> -				__func__, ip->i_ino, fa);
> +		ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "data fork",
> +				ifp->if_u1.if_data, ifp->if_bytes, fa);
>  		return false;
>  	}
>  
>  	fa = xfs_ifork_verify_attr(ip, &xfs_default_ifork_ops);
>  	if (fa) {
> -		xfs_alert(ip->i_mount,
> -				"%s: bad inode %llu inline attr fork at %pS",
> -				__func__, ip->i_ino, fa);
> +		ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "attr fork",
> +				ifp ? ifp->if_u1.if_data : NULL,
> +				ifp ? ifp->if_bytes : 0, fa);
>  		return false;
>  	}
>  	return true;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-25 13:01       ` Brian Foster
@ 2018-01-25 17:52         ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25 17:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 08:01:24AM -0500, Brian Foster wrote:
> On Wed, Jan 24, 2018 at 11:14:25AM -0800, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2018 at 09:22:16AM -0500, Brian Foster wrote:
> > > On Tue, Jan 23, 2018 at 06:18:23PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Since the CoW fork only exists in memory, it is incorrect to update the
> > > > on-disk quota block counts when we modify the CoW fork.  Unlike the data
> > > > fork, even real extents in the CoW fork are only reservations (on-disk
> > > > they're owned by the refcountbt) so they must not be tracked in the on
> > > > disk quota info.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c |  203 ++++++++++++++++++++++++++++++++++++++++++++--
> > > >  fs/xfs/xfs_reflink.c     |    8 +-
> > > >  2 files changed, 196 insertions(+), 15 deletions(-)
> > > > 
> > > > 
> ...
> > > > + * - sb_fdblocks: Number of free blocks recorded in the superblock on disk.
> > > > + * - fdblocks: Number of free blocks recorded in the superblock minus any
> > > > + *           in-core reservations made in anticipation of future writes.
> > > > + *
> > > > + * - t_blk_res: Number of blocks reserved out of fdblocks for a transaction.
> > > > + *           When the transaction commits, t_blk_res - t_blk_res_used is given
> > > > + *           back to fdblocks.
> > > > + * - t_blk_res_used: Number of blocks used by this transaction that were
> > > > + *           reserved for this transaction.
> > > > + * - t_fdblocks_del: Number of blocks by which fdblocks and sb_fdblocks will
> > > > + *           have to decrease at commit.
> > > > + * - t_res_fdblocks_delta: Number of blocks by which sb_fdblocks will have to
> > > > + *           decrease at commit.  We assume that fdblocks was decreased
> > > > + *           prior to the transaction.
> > > > + *
> > > > + * Data fork block mappings have four logical states:
> > > > + *
> > > > + *    +--------> UNWRITTEN <------+
> > > > + *    |              ^            |
> > > > + *    |              v            v
> > > > + * DELALLOC <----> HOLE <------> REAL
> > > > + *    |                           ^
> > > > + *    |                           |
> > > > + *    +---------------------------+
> > > > + *
> > > 
> > > I'm not sure we need a graphic for the extent states. Non-hole
> > > conversions to delalloc is the only transition that doesn't make any
> > > sense.
> > 
> > First of all, Dave keeps asking for ASCII art 'when appropriate'. :)
> > 
> 
> I'm not against ASCII art in general...
> 
> > But on a more serious note, I thought the state diagram would be useful
> > for anyone who isn't so familiar with how blocks get mapped into files
> > in xfs, particularly to compare the data fork diagram against the
> > corresponding picture for the cow fork.
> > 
> 
> ... but TBH I didn't really notice they were different until you pointed
> it out. :/ It still seems like a rather verbose means to point out that
> COW fork apparently doesn't have the DELALLOC -> REAL transition.

It also doesn't have REAL -> UNWRITTEN transitions either.  Clearly the
diagrams aren't helping, point made. ;)

Heh, it also has HOLE -> REAL transitions even though the diagram does
not so indicate, so ... yeah.

> ...
> > 
> > > I'm also wondering how much of the whole picture really needs to be
> > > described here to cover quota accounting with respect to block mapping
> > > (sufficiently to differentiate data fork from cow fork, which I take is
> > > the purpose of this section). For example, do we really need the
> > > internal transaction details for how quota deltas are carried?
> > > 
> > > Instead, I think it might be sufficient to explain that the quota system
> > > works in two "levels" (for lack of a better term :/), one for
> > > reservation and another for real block usage. The reservation is
> > > associated with transaction reservation and/or delayed block allocation
> > > (no tx). In either case, quota reservation is converted to real quota
> > > block usage when a transaction commits that maps real/physical blocks.
> > > If the transaction held extra reservation that went unused, that quota
> > > reservation is released. The primary difference is that transactions to
> > > convert delalloc -> real do not reserve quota blocks in the first place,
> > > since that has already occurred, so they just need to make sure to
> > > convert/persist the quota res for the blocks that were converted to
> > > real.
> > 
> > I thought about cutting this whole comment down to a simple sentence
> > about how quota accounting is different between the data & cow forks (as
> > you figured out, we use delalloc quota reservations for everything in
> > the cow fork and only turn them into real ones when we go to remap) but
> > then worried that doing so would presuppose the reader knew anything
> > about how the extent lifecycles work... and that's how I end up with a
> > gigantic manual.
> > 
> 
> Which is probably fine for xfs-docs or something..
> 
> The more I think about it, the more I think this whole thing is better
> off focused on explaining the unique quota management of COW fork
> blocks. We can add additional comments about extent lifecycles, how
> quota is tracked in general, etc., but perhaps it's best to make those a
> separate patch rather than attempt to document a ground-up "COW fork
> quotas for dummies" in a single comment. :P Anyways, I'll reserve
> further judgement for the new patch..

Yeah.  I think I'll go update ch3 ("Sharing Data Blocks") in the design
document instead of dumping everything into a code comment that seems
somewhat misplaced.

Onto the next reply! :)

--D

> Brian
> 
> > > > + * Copy on Write Fork Mapping Lifecycle
> > > > + *
> > > > + * The CoW fork handles things differently from the data fork because its
> > > > + * mappings only exist in memory-- the refcount btree is the on-disk owner of
> > > > + * the extents until they're remapped into the data fork.  Therefore,
> > > > + * unwritten and real extents in the CoW fork are treated the same way as
> > > > + * delayed allocation extents.  Quota and fdblock changes only exist in
> > > > + * memory, which requires some twists in the bmap functions.
> > > > + *
> > > 
> > > Ok, but perhaps this should point out what happens when cow blocks are
> > > reserved with respect to quotas..? IIUC, a delalloc quota reservation
> > > occurs just the same as above, the blocks simply reside in another fork.
> > 
> > Yes.
> > 
> > > > + * The CoW fork extent state diagram looks like this:
> > > > + *
> > > > + *    +--------> UNWRITTEN -------+
> > > > + *    |              ^            |
> > > > + *    |              v            v
> > > > + * DELALLOC <----> HOLE <------- REAL
> > > > + *
> > > > + * Holes are still holes.  Delayed allocation extents reserve blocks for
> > > > + * landing future writes, just like they do in the data fork.  However, unlike
> > > > + * the data fork, unwritten extents signal an extent that has been allocated
> > > > + * but is not currently undergoing writeback.  Real extents are undergoing
> > > > + * writeback, and when that writeback finishes the corresponding data fork
> > > > + * extent will be punched out and the CoW fork counterpart moved to the new
> > > > + * hole in the data fork.
> > > > + *
> > > 
> > > Ok, so the difference is that for the COW fork, the _extent state_
> > > conversion is not the appropriate trigger event to convert quota
> > > reservation to real quota usage. Instead, the blocks being remapped from
> > > the COW fork to the data fork is when that should occur.
> > 
> > Yes.
> > 
> > > > + * The state transitions and required metadata updates are as follows:
> > > > + *
> > > > + * - HOLE to DELALLOC: Increase i_cow_blocks and q_res_bcount, and decrease
> > > > + *           fdblocks.
> > > > + * - HOLE to UNWRITTEN: Same as above, but since we reserved quota via
> > > > + *           qt_blk_res (which increased q_res_bcount) when we allocate the
> > > > + *           extent we have to decrease qt_blk_res so that the commit doesn't
> > > > + *           give the allocated CoW blocks back.
> > > > + *
> > > 
> > > Hmm, this is a little confusing. Looking at the code change and comment
> > > below, I think I get what this is trying to do, which is essentially
> > > make a real block cow fork alloc behave like a delalloc reservation
> > > (with respect to quota). FWIW, I think what confuses me is the assertion
> > > that the blocks would be "given back" otherwise. The only reference I
> > > have to compare is data fork alloc behavior, which implies that used
> > > reservation would not be given back, but rather converted to real quota
> > > usage on block allocation (and excess reservation would still be given
> > > back, which afaict we still want to happen). So the trickery is required
> > > to prevent conversion of quota reservation for the allocated cow blocks,
> > > let that res sit around until the cow blocks are remapped, and release
> > > unused reservation from the tx as normal. Am I following that correctly?
> > 
> > Yes.
> > 
> > > > + * - DELALLOC to UNWRITTEN: No change.
> > > > + * - DELALLOC to HOLE: Decrease i_cow_blocks and q_res_bcount, and increase
> > > > + *           fdblocks.
> > > > + *
> > > > + * - UNWRITTEN to HOLE: Same as DELALLOC to HOLE.
> > > > + * - UNWRITTEN to REAL: No change.
> > > > + *
> > > > + * - REAL to HOLE: This transition happens when we've finished a write
> > > > + *           operation and need to move the mapping to the data fork.  We
> > > > + *           punch the correspond data fork mappings, which decreases
> > > > + *           qt_bcount.  Then we map the CoW fork mapping into the hole we
> > > > + *           just cleared out of the data fork, which increases qt_bcount.
> > > > + *           There's a subtlety here -- if we promoted a write over a hole to
> > > > + *           CoW, there will be a net increase in qt_bcount, which is fine
> > > > + *           because we already reserved the quota when we filled the CoW
> > > > + *           fork.  Finally, we punch the CoW fork mapping, which decreases
> > > > + *           q_res_bcount.
> > > > + *
> > > > + * Notice how all CoW fork extents use transactionless quota reservations and
> > > > + * the in-core fdblocks to maintain state, and we avoid updating any on-disk
> > > > + * metadata.  This is essential to maintain metadata correctness if the system
> > > > + * goes down.
> > > > + */
> > > >  
> > > >  kmem_zone_t		*xfs_bmap_free_item_zone;
> > > >  
> > > > @@ -3337,6 +3476,39 @@ xfs_bmap_btalloc_filestreams(
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/* Deal with CoW fork accounting when we allocate a block. */
> > > > +static void
> > > > +xfs_bmap_btalloc_cow(
> > > > +	struct xfs_bmalloca	*ap,
> > > > +	struct xfs_alloc_arg	*args)
> > > > +{
> > > > +	/* Filling a previously reserved extent; nothing to do here. */
> > > > +	if (ap->wasdel)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * The CoW fork only exists in memory, so the on-disk quota accounting
> > > > +	 * must not incude any CoW fork extents.  Therefore, CoW blocks are
> > > > +	 * only tracked in the in-core dquot block count (q_res_bcount).
> > > > +	 *
> > > > +	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
> > > > +	 * CoW extent having reserved enough blocks from both q_res_bcount and
> > > > +	 * qt_blk_res to guarantee that we won't run out of space.  The unused
> > > > +	 * qt_blk_res is given back to q_res_bcount when the transaction
> > > > +	 * commits.
> > > > +	 *
> > > > +	 * We don't want the quota accounting for our newly allocated blocks
> > > > +	 * to be given back, so we must decrease qt_blk_res without decreasing
> > > > +	 * q_res_bcount.
> > > > +	 *
> > > > +	 * Note: If we're allocating a delalloc extent, we already reserved
> > > > +	 * the q_res_bcount blocks, so no quota accounting update is needed
> > > > +	 * here.
> > > > +	 */
> > > > +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> > > > +			-(long)args->len);
> > > > +}
> > > 
> > > Factoring nit.. if we're going to refactor bits of xfs_bmap_btalloc()
> > > out, it might be cleaner to factor out all of the quota logic rather
> > > than just the cow bits (which is basically just a simple check and
> > > function call). E.g., refactor into an xfs_bmap_btalloc_quota() helper
> > > that does the right thing based on the fork, with comments as to why,
> > > etc. (and perhaps just leave the unrelated di_nblocks change behind).
> > 
> > I thought about factoring the data/attr fork stuff into its own
> > xfs_bmap_btalloc_quota() function too, since this function is already
> > eyewateringly long.  I think I'll do that as a separate refactor at the
> > end of the series, though...
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > +
> > > >  STATIC int
> > > >  xfs_bmap_btalloc(
> > > >  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> > > > @@ -3571,19 +3743,22 @@ xfs_bmap_btalloc(
> > > >  			*ap->firstblock = args.fsbno;
> > > >  		ASSERT(nullfb || fb_agno <= args.agno);
> > > >  		ap->length = args.len;
> > > > -		if (!(ap->flags & XFS_BMAPI_COWFORK))
> > > > -			ap->ip->i_d.di_nblocks += args.len;
> > > > -		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > >  		if (ap->wasdel)
> > > >  			ap->ip->i_delayed_blks -= args.len;
> > > > -		/*
> > > > -		 * Adjust the disk quota also. This was reserved
> > > > -		 * earlier.
> > > > -		 */
> > > > -		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > > > -			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > > > -					XFS_TRANS_DQ_BCOUNT,
> > > > -			(long) args.len);
> > > > +		if (ap->flags & XFS_BMAPI_COWFORK) {
> > > > +			xfs_bmap_btalloc_cow(ap, &args);
> > > > +		} else {
> > > > +			ap->ip->i_d.di_nblocks += args.len;
> > > > +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > > +			/*
> > > > +			 * Adjust the disk quota also. This was reserved
> > > > +			 * earlier.
> > > > +			 */
> > > > +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > > > +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > > > +						XFS_TRANS_DQ_BCOUNT,
> > > > +				(long) args.len);
> > > > +		}
> > > >  	} else {
> > > >  		ap->blkno = NULLFSBLOCK;
> > > >  		ap->length = 0;
> > > > @@ -4776,6 +4951,7 @@ xfs_bmap_del_extent_cow(
> > > >  	struct xfs_bmbt_irec	new;
> > > >  	xfs_fileoff_t		del_endoff, got_endoff;
> > > >  	int			state = BMAP_COWFORK;
> > > > +	int			error;
> > > >  
> > > >  	XFS_STATS_INC(mp, xs_del_exlist);
> > > >  
> > > > @@ -4832,6 +5008,11 @@ xfs_bmap_del_extent_cow(
> > > >  		xfs_iext_insert(ip, icur, &new, state);
> > > >  		break;
> > > >  	}
> > > > +
> > > > +	/* Remove the quota reservation */
> > > > +	error = xfs_trans_reserve_quota_nblks(NULL, ip,
> > > > +			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > > > +	ASSERT(error == 0);
> > > >  }
> > > >  
> > > >  /*
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index 82abff6..e367351 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
> > > >  					del.br_startblock, del.br_blockcount,
> > > >  					NULL);
> > > >  
> > > > -			/* Update quota accounting */
> > > > -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > -					-(long)del.br_blockcount);
> > > > -
> > > >  			/* Roll the transaction */
> > > >  			xfs_defer_ijoin(&dfops, ip);
> > > >  			error = xfs_defer_finish(tpp, &dfops);
> > > > @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
> > > >  		if (error)
> > > >  			goto out_defer;
> > > >  
> > > > +		/* Charge this new data fork mapping to the on-disk quota. */
> > > > +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > +				(long)del.br_blockcount);
> > > > +
> > > >  		/* Remove the mapping from the CoW fork. */
> > > >  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > > >  
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-25 13:03     ` Brian Foster
@ 2018-01-25 18:20       ` Darrick J. Wong
  2018-01-26 13:02         ` Brian Foster
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25 18:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 08:03:53AM -0500, Brian Foster wrote:
> On Wed, Jan 24, 2018 at 05:20:35PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Since the CoW fork only exists in memory, it is incorrect to update the
> > on-disk quota block counts when we modify the CoW fork.  Unlike the data
> > fork, even real extents in the CoW fork are only reservations (on-disk
> > they're owned by the refcountbt) so they must not be tracked in the on
> > disk quota info.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > v2: make documentation more crisp and to the point
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c |  118 ++++++++++++++++++++++++++++++++++++++++++----
> >  fs/xfs/xfs_quota.h       |   14 ++++-
> >  fs/xfs/xfs_reflink.c     |    8 ++-
> >  3 files changed, 122 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 0c9c9cd..7f0ac40 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -52,6 +52,71 @@
> >  #include "xfs_refcount.h"
> >  #include "xfs_icache.h"
> >  
> > +/*
> > + * Data/Attribute Fork Mapping Lifecycle
> > + *
> > + * The data fork contains the block mappings between logical blocks in a file
> > + * and physical blocks on the disk.  The XFS notions of delayed allocation
> > + * reservations, unwritten extents, and real extents follow well known
> > + * conventions in the filesystem world.
> > + *
> > + * Data fork extent states follow these transitions:
> > + *
> > + *    +--------> UNWRITTEN <------+
> > + *    |              ^            |
> > + *    |              v            v
> > + * DELALLOC <----> HOLE <------> REAL
> > + *    |                           ^
> > + *    |                           |
> > + *    +---------------------------+
> > + *
> > + * Every delayed allocation reserves in-memory quota blocks (q_res_bcount) and
> > + * in-memory fs free blocks (fdblocks), and increases the in-memory per-inode
> > + * i_delayed_blks.  The reservation includes potentially required bmbt blocks.
> > + *
> 
> So we have some bits about extent states..
> 
> > + * Every transaction reserves quota blocks (qt_blk_res) from the in-memory
> > + * quota blocks and free blocks (t_blk_res) from the in-memory fs free blocks.
> > + * The transaction tracks both the number of blocks used from its own
> > + * reservation as well as the number of blocks used that came from a delayed
> > + * allocation.  When the transaction commits, it gives back the unused parts
> > + * of its own block reservations.  Next, it adds any block usage that came
> > + * from a delayed allocation to the on-disk counters without changing the
> > + * in-memory reservations (q_res_bcount and fdblocks).
> > + *
> 
> Some bits about quota, transactions..
> 
> > + * To convert a delayed allocation to a real or unwritten extent, we use a
> > + * transaction to allocate the blocks.  At commit time, the block reservations
> > + * are given back or added to the on-disk counters as noted above.
> > + * i_delayed_blks is decreased while the on-disk per-inode di_nblocks is
> > + * increased.
> > + *
> > + * The attribute fork works in the same way as the data fork except that the
> > + * only valid states are REAL and HOLE.
> > + *
> > + * Copy on Write Fork Mapping Lifecycle
> > + *
> > + * The CoW fork exists only in memory and is used to stage copy writes for
> > + * file data and has fewer transitions:
> > + *
> > + *    +--------> UNWRITTEN -------+
> > + *    |              ^            |
> > + *    |              v            v
> > + * DELALLOC <----> HOLE <------- REAL
> > + *
> > + * Delayed allocation extents here are treated the same as in the data fork
> > + * except that they are counted by the per-inode i_cow_blocks instead of
> > + * i_delayed_blks.
> > + *
> 
> COW fork extent state..
> 
> > + * Unwritten and real extents are counted by the quota code as block
> > + * reservations (q_res_bcount) and not on-disk quota blocks (d_bcount), and
> > + * are counted by the free block counters as in-memory reservations (fdblocks)
> > + * and not on-disk free blocks (sb_fdblocks).  These blocks are also counted
> > + * by i_cow_blocks and not the on-disk di_nblocks.
> > + *
> > + * When a CoW fork extent is remapped to the data fork, the reservations are
> > + * converted into on-disk counts in the same manner as a delayed allocation
> > + * conversion in the data fork.  The number of blocks being remapped is
> > + * subtracted from i_cow_blocks and added to di_nblocks.
> > + */
> >  
> 
> COW fork quota quirkiness..
> 
> In all, I'd probably suggest to split this whole thing up into maybe 3
> independent comments. If we don't have anything that explains extent
> states properly, perhaps do that in the headers where we define the
> extent state definitions (and then explain how they are used differently
> between forks). If we don't have sufficient explanation of how quota
> reservation lifecycle works, do the same in the headers that define the
> data structures that map transaction deltas to actual quota accounting.
> 
> Finally, this patch already does some refactoring below to deal with the
> COW fork quota accounting quirkiness, so we can use the comment in that
> function to explain exactly what/why is going on here. That's just my
> .02 on the whole comment.
> 
> >  kmem_zone_t		*xfs_bmap_free_item_zone;
> >  
> > @@ -3337,6 +3402,28 @@ xfs_bmap_btalloc_filestreams(
> >  	return 0;
> >  }
> >  
> > +/* Deal with CoW fork accounting when we allocate a block. */
> > +static void
> > +xfs_bmap_btalloc_quota_cow(
> > +	struct xfs_bmalloca	*ap,
> > +	struct xfs_alloc_arg	*args)
> > +{
> > +	/* Filling a previously reserved extent; nothing to do here. */
> > +	if (ap->wasdel)
> > +		return;
> > +
> > +	/*
> > +	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
> > +	 * CoW extent having reserved enough blocks from both q_res_bcount and
> > +	 * qt_blk_res to guarantee that we won't run out of space.  The unused
> > +	 * qt_blk_res is given back to q_res_bcount when the transaction
> > +	 * commits, so we must decrease qt_blk_res without decreasing
> > +	 * q_res_bcount.
> > +	 */
> > +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> > +			-(long)args->len);
> > +}
> > +
> 
> Not sure we were on the same page wrt to my previous comment here.. I
> think this should look something like:
> 
> /*
>  * Update quota accounting for physical allocation. Depends on extent
>  * state, target fork, etc.
>  */
> static void
> xfs_bmap_btalloc_dquot(...)
> {
> 	/*
> 	 * COW fork requires quota accounting magic words words words ..
> 	 */
> 	if (COWFORK) {
> 		/* do not account wasdel extents as used because ... */
> 		if (ap->wasdel)
> 			return;
> 		/*
> 		 * quota reservation exists in transaction, do magic to
> 		 * cause tx to leave a delalloc-like reservation after
> 		 * the transaction commits because cow fork words words
> 		 * words ...
> 		 */
> 		xfs_trans_mod_dquot_byino(..)
> 	} else {
> 		/* adjust disk quota ... */
> 		xfs_trans_mod_dquot_byino(...)
> 	}
> }
> 
> So we don't have to scroll up and down the source file to understand
> how/why we process quotas differently between the cow and other forks.
> 
> >  STATIC int
> >  xfs_bmap_btalloc(
> >  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> > @@ -3571,19 +3658,22 @@ xfs_bmap_btalloc(
> >  			*ap->firstblock = args.fsbno;
> >  		ASSERT(nullfb || fb_agno <= args.agno);
> >  		ap->length = args.len;
> > -		if (!(ap->flags & XFS_BMAPI_COWFORK))
> > -			ap->ip->i_d.di_nblocks += args.len;
> > -		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> 
> ... and I'd just leave this here.
> 
> (Or alternately rename the helper function to something like
> xfs_bmap_btalloc_fork() and put the entire if/else logic in there.)

Yeah, I like the idea of xfs_bmap_btalloc_accounting() taking care of
all the per-inode counters and the quota at the same time.

> >  		if (ap->wasdel)
> >  			ap->ip->i_delayed_blks -= args.len;
> > -		/*
> > -		 * Adjust the disk quota also. This was reserved
> > -		 * earlier.
> > -		 */
> > -		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > -			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > -					XFS_TRANS_DQ_BCOUNT,
> > -			(long) args.len);
> > +		if (ap->flags & XFS_BMAPI_COWFORK) {
> > +			xfs_bmap_btalloc_quota_cow(ap, &args);
> > +		} else {
> > +			ap->ip->i_d.di_nblocks += args.len;
> > +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > +			/*
> > +			 * Adjust the disk quota also. This was reserved
> > +			 * earlier.
> > +			 */
> > +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> > +						XFS_TRANS_DQ_BCOUNT,
> > +				(long) args.len);
> > +		}
> 
> And this just calls the helper above.
> 
> >  	} else {
> >  		ap->blkno = NULLFSBLOCK;
> >  		ap->length = 0;
> > @@ -4760,6 +4850,7 @@ xfs_bmap_del_extent_cow(
> >  	struct xfs_bmbt_irec	new;
> >  	xfs_fileoff_t		del_endoff, got_endoff;
> >  	int			state = BMAP_COWFORK;
> > +	int			error;
> >  
> >  	XFS_STATS_INC(mp, xs_del_exlist);
> >  
> > @@ -4816,6 +4907,11 @@ xfs_bmap_del_extent_cow(
> >  		xfs_iext_insert(ip, icur, &new, state);
> >  		break;
> >  	}
> > +
> > +	/* Remove the quota reservation */
> > +	error = xfs_trans_reserve_quota_nblks(NULL, ip,
> > +			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> > +	ASSERT(error == 0);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
> > index ce6506a..34b4ec2 100644
> > --- a/fs/xfs/xfs_quota.h
> > +++ b/fs/xfs/xfs_quota.h
> > @@ -54,11 +54,19 @@ struct xfs_trans;
> >   */
> >  typedef struct xfs_dqtrx {
> >  	struct xfs_dquot *qt_dquot;	  /* the dquot this refers to */
> > -	ulong		qt_blk_res;	  /* blks reserved on a dquot */
> > +
> > +	/* dquot bcount blks reserved for this transaction */
> > +	ulong		qt_blk_res;
> > +
> >  	ulong		qt_ino_res;	  /* inode reserved on a dquot */
> >  	ulong		qt_ino_res_used;  /* inodes used from the reservation */
> > -	long		qt_bcount_delta;  /* dquot blk count changes */
> > -	long		qt_delbcnt_delta; /* delayed dquot blk count changes */
> > +
> > +	/* dquot block count changes taken from qt_blk_res */
> > +	long		qt_bcount_delta;
> > +
> > +	/* dquot block count changes taken from delalloc reservation */
> > +	long		qt_delbcnt_delta;
> > +
> 
> Separate patch.

Now that I've slept on it, I thnk I'll just drop this entirely since the
expanded comments aren't sufficiently distinct from the old ones.

> >  	long		qt_icount_delta;  /* dquot inode count changes */
> >  	ulong		qt_rtblk_res;	  /* # blks reserved on a dquot */
> >  	ulong		qt_rtblk_res_used;/* # blks used from reservation */
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 82abff6..e367351 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
> >  					del.br_startblock, del.br_blockcount,
> >  					NULL);
> >  
> > -			/* Update quota accounting */
> > -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > -					-(long)del.br_blockcount);
> > -
> >  			/* Roll the transaction */
> >  			xfs_defer_ijoin(&dfops, ip);
> >  			error = xfs_defer_finish(tpp, &dfops);
> > @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
> >  		if (error)
> >  			goto out_defer;
> >  
> > +		/* Charge this new data fork mapping to the on-disk quota. */
> > +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > +				(long)del.br_blockcount);
> > +
> 
> Should this technically be XFS_TRANS_DQ_DELBCOUNT? The blocks obviously
> aren't delalloc and this transaction doesn't make a quota reservation so
> I don't think it screws up accounting. But if the transaction did make a
> quota reservation, it seems like this would account the extent against
> the tx reservation where it instead should recognize that cow blocks
> have already been reserved (which is essentially what DELBCOUNT means,
> IIUC).

Hmmm, there's a subtlety here -- we're opencoding what DELBCOUNT does,
because the subsequent xfs_bmap_del_extent_cow unconditionally reduces
the in-core reservation after we've mapped in the extent as if it had
been accounted as a real extent all along.  But considering all the
blather about how cow fork blocks are treated as incore reservations, it
does look funny, doesn't it?

So perhaps the solution is to pass intent into xfs_bmap_del_extent_cow:
if we're calling it from _end_cow then we want to hang on to the
reservation so that delbcount can do its thing, but if we're calling
from _cancel_cow then we're dumping the extent and reservation.

--D

> 
> Other than that the code seems Ok to me.
> 
> Brian
> 
> >  		/* Remove the mapping from the CoW fork. */
> >  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> >  
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/11] xfs: refactor inode verifier corruption error printing
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-25 18:23     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25 18:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 12:31:40PM -0500, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:19:06PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Refactor inode verifier error reporting into a non-libxfs function so
> > that we aren't encoding the message format in libxfs.  This also
> > changes the kernel dmesg output to resemble buffer verifier errors
> > more closely.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_inode_buf.c |    6 ++----
> >  fs/xfs/xfs_error.c            |   37 +++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_error.h            |    3 +++
> >  fs/xfs/xfs_inode.c            |   14 ++++++++------
> >  4 files changed, 50 insertions(+), 10 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 6e9dcdb..6d05ba6 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -578,10 +578,8 @@ xfs_iread(
> >  	/* even unallocated inodes are verified */
> >  	fa = xfs_dinode_verify(mp, ip->i_ino, dip);
> >  	if (fa) {
> > -		xfs_alert(mp, "%s: validation failed for inode %lld at %pS",
> > -				__func__, ip->i_ino, fa);
> > -
> > -		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, dip);
> > +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "record", dip,
> > +				sizeof(*dip), fa);
> 
> What does "record" mean? "dinode" perhaps? Otherwise looks fine:

Sure, 'dinode' works.  Ultimately this comes out as:

"Metadata corruption detected at xfs_fubar+0x0, inode 0x8008 record"

but I guess

"Metadata corruption detected at xfs_fubar+0x0, inode 0x8008 dinode"

works just as well.

> Reviewed-by: Brian Foster <bfoster@redhat.com>

Thank you for reviewing this series!

--D

> 
> >  		error = -EFSCORRUPTED;
> >  		goto out_brelse;
> >  	}
> > diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
> > index 980d5f0..ccf520f 100644
> > --- a/fs/xfs/xfs_error.c
> > +++ b/fs/xfs/xfs_error.c
> > @@ -24,6 +24,7 @@
> >  #include "xfs_errortag.h"
> >  #include "xfs_error.h"
> >  #include "xfs_sysfs.h"
> > +#include "xfs_inode.h"
> >  
> >  #ifdef DEBUG
> >  
> > @@ -372,3 +373,39 @@ xfs_verifier_error(
> >  	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
> >  		xfs_stack_trace();
> >  }
> > +
> > +/*
> > + * Warnings for inode corruption problems.  Don't bother with the stack
> > + * trace unless the error level is turned up high.
> > + */
> > +void
> > +xfs_inode_verifier_error(
> > +	struct xfs_inode	*ip,
> > +	int			error,
> > +	const char		*name,
> > +	void			*buf,
> > +	size_t			bufsz,
> > +	xfs_failaddr_t		failaddr)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_failaddr_t		fa;
> > +	int			sz;
> > +
> > +	fa = failaddr ? failaddr : __return_address;
> > +
> > +	xfs_alert(mp, "Metadata %s detected at %pS, inode 0x%llx %s",
> > +		  error == -EFSBADCRC ? "CRC error" : "corruption",
> > +		  fa, ip->i_ino, name);
> > +
> > +	xfs_alert(mp, "Unmount and run xfs_repair");
> > +
> > +	if (buf && xfs_error_level >= XFS_ERRLEVEL_LOW) {
> > +		sz = min_t(size_t, XFS_CORRUPTION_DUMP_LEN, bufsz);
> > +		xfs_alert(mp, "First %d bytes of corrupted metadata buffer:",
> > +				sz);
> > +		xfs_hex_dump(buf, sz);
> > +	}
> > +
> > +	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
> > +		xfs_stack_trace();
> > +}
> > diff --git a/fs/xfs/xfs_error.h b/fs/xfs/xfs_error.h
> > index a3ba05b..7e728c5 100644
> > --- a/fs/xfs/xfs_error.h
> > +++ b/fs/xfs/xfs_error.h
> > @@ -28,6 +28,9 @@ extern void xfs_corruption_error(const char *tag, int level,
> >  			int linenum, xfs_failaddr_t failaddr);
> >  extern void xfs_verifier_error(struct xfs_buf *bp, int error,
> >  			xfs_failaddr_t failaddr);
> > +extern void xfs_inode_verifier_error(struct xfs_inode *ip, int error,
> > +			const char *name, void *buf, size_t bufsz,
> > +			xfs_failaddr_t failaddr);
> >  
> >  #define	XFS_ERROR_REPORT(e, lvl, mp)	\
> >  	xfs_error_report(e, lvl, mp, __FILE__, __LINE__, __return_address)
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index fc118dd..c60efec 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3502,21 +3502,23 @@ bool
> >  xfs_inode_verify_forks(
> >  	struct xfs_inode	*ip)
> >  {
> > +	struct xfs_ifork	*ifp;
> >  	xfs_failaddr_t		fa;
> >  
> >  	fa = xfs_ifork_verify_data(ip, &xfs_default_ifork_ops);
> >  	if (fa) {
> > -		xfs_alert(ip->i_mount,
> > -				"%s: bad inode %llu inline data fork at %pS",
> > -				__func__, ip->i_ino, fa);
> > +		ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> > +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "data fork",
> > +				ifp->if_u1.if_data, ifp->if_bytes, fa);
> >  		return false;
> >  	}
> >  
> >  	fa = xfs_ifork_verify_attr(ip, &xfs_default_ifork_ops);
> >  	if (fa) {
> > -		xfs_alert(ip->i_mount,
> > -				"%s: bad inode %llu inline attr fork at %pS",
> > -				__func__, ip->i_ino, fa);
> > +		ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
> > +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, "attr fork",
> > +				ifp ? ifp->if_u1.if_data : NULL,
> > +				ifp ? ifp->if_bytes : 0, fa);
> >  		return false;
> >  	}
> >  	return true;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/11] xfs: always zero di_flags2 when we free the inode
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-25 18:36     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25 18:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 12:31:21PM -0500, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:18:41PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Always zero the di_flags2 field when we free the inode so that we never
> > write reflinked non-file inode records to disk.
> > 
> 
> By "non-file," do you mean "invalid" or "unallocated?"

In this particular function we're preventing unallocated inodes from
having the reflink flag set in order to uphold the general policy that only
files can have reflink set.

"Always zero the di_flags2 field when we free the inode so that we never
end up with an on-disk record for an unallocated inode that also has the
reflink iflag set.  This is in keeping with the general principle that
only files can have the reflink iflag set, even though we'll zero out
di_flags2 if we ever reallocate the inode."

--D

> Otherwise looks fine:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_inode.c |    1 +
> >  1 file changed, 1 insertion(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index a208825..fc118dd 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2465,6 +2465,7 @@ xfs_ifree(
> >  
> >  	VFS_I(ip)->i_mode = 0;		/* mark incore inode as free */
> >  	ip->i_d.di_flags = 0;
> > +	ip->i_d.di_flags2 = 0;
> >  	ip->i_d.di_dmevmask = 0;
> >  	ip->i_d.di_forkoff = 0;		/* mark the attr fork not in use */
> >  	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/11] xfs: fix tracepoint %p formats
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-25 18:47     ` Darrick J. Wong
  2018-01-26  0:19       ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25 18:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 12:31:28PM -0500, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:18:47PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Tracepoint printk doesn't have any of the %p suffixes, so use %p.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> I see different behavior with this. E.g.,
> 
>   umount-1130  [003] ...1  1995.947789: xfs_log_force: dev 253:3 lsn 0x0 caller xfs_log_quiesce+0x3c/0x4b0 [xfs]
> 
> vs.
> 
>   umount-1272  [002] ...1  2089.445135: xfs_log_force: dev 253:3 lsn 0x0 caller 00000000937cbc85

Hmm, on my system all I get is:

mount-3125  [000]  1634.386726: xfs_buf_submit_wait:  dev 8:0 bno
	0x4b0020 nblks 0x8 hold 1 pincount 0 lock 0 flags READ|PAGES caller
	0xffffffffa06cca52S

...which is odd since they all map to the vsnprintf implementation, so
this ought to work.

# trace-cmd record -e 'xfs_buf*' -F mount /dev/sda
# trace-cmd report

(I don't see anything in the trace-cmd-report manpage about "resolve
symbolic addresses" but maybe I just have an old version... or maybe
we're just using different tools?)

> Expected?

No, not at all.  But since it clearly works on your system, I'll call
you fortunate and drop this patch. :)

--D

> 
> Brian
> 
> >  fs/xfs/scrub/trace.h |   20 ++++++++++----------
> >  fs/xfs/xfs_trace.h   |   24 ++++++++++++------------
> >  2 files changed, 22 insertions(+), 22 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> > index a0a6d3c..732775f 100644
> > --- a/fs/xfs/scrub/trace.h
> > +++ b/fs/xfs/scrub/trace.h
> > @@ -90,7 +90,7 @@ TRACE_EVENT(xfs_scrub_op_error,
> >  		__entry->error = error;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %pS",
> > +	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->type,
> >  		  __entry->agno,
> > @@ -121,7 +121,7 @@ TRACE_EVENT(xfs_scrub_file_op_error,
> >  		__entry->error = error;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %pS",
> > +	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->whichfork,
> > @@ -156,7 +156,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_block_error_class,
> >  		__entry->bno = bno;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %pS",
> > +	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->type,
> >  		  __entry->agno,
> > @@ -207,7 +207,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_ino_error_class,
> >  		__entry->bno = bno;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %pS",
> > +	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->type,
> > @@ -246,7 +246,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_fblock_error_class,
> >  		__entry->offset = offset;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %pS",
> > +	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->whichfork,
> > @@ -277,7 +277,7 @@ TRACE_EVENT(xfs_scrub_incomplete,
> >  		__entry->type = sc->sm->sm_type;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d type %u ret_ip %pS",
> > +	TP_printk("dev %d:%d type %u ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->type,
> >  		  __entry->ret_ip)
> > @@ -311,7 +311,7 @@ TRACE_EVENT(xfs_scrub_btree_op_error,
> >  		__entry->error = error;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
> > +	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->type,
> >  		  __entry->btnum,
> > @@ -354,7 +354,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_op_error,
> >  		__entry->error = error;
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
> > +	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->whichfork,
> > @@ -393,7 +393,7 @@ TRACE_EVENT(xfs_scrub_btree_error,
> >  		__entry->ptr = cur->bc_ptrs[level];
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
> > +	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->type,
> >  		  __entry->btnum,
> > @@ -433,7 +433,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
> >  		__entry->ptr = cur->bc_ptrs[level];
> >  		__entry->ret_ip = ret_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
> > +	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->whichfork,
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 945de08..893081e 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -119,7 +119,7 @@ DECLARE_EVENT_CLASS(xfs_perag_class,
> >  		__entry->refcount = refcount;
> >  		__entry->caller_ip = caller_ip;
> >  	),
> > -	TP_printk("dev %d:%d agno %u refcount %d caller %pS",
> > +	TP_printk("dev %d:%d agno %u refcount %d caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->agno,
> >  		  __entry->refcount,
> > @@ -252,7 +252,7 @@ DECLARE_EVENT_CLASS(xfs_bmap_class,
> >  		__entry->caller_ip = caller_ip;
> >  	),
> >  	TP_printk("dev %d:%d ino 0x%llx state %s cur %p/%d "
> > -		  "offset %lld block %lld count %lld flag %d caller %pS",
> > +		  "offset %lld block %lld count %lld flag %d caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __print_flags(__entry->bmap_state, "|", XFS_BMAP_EXT_FLAGS),
> > @@ -301,7 +301,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
> >  		__entry->caller_ip = caller_ip;
> >  	),
> >  	TP_printk("dev %d:%d bno 0x%llx nblks 0x%x hold %d pincount %d "
> > -		  "lock %d flags %s caller %pS",
> > +		  "lock %d flags %s caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  (unsigned long long)__entry->bno,
> >  		  __entry->nblks,
> > @@ -370,7 +370,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class,
> >  		__entry->caller_ip = caller_ip;
> >  	),
> >  	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
> > -		  "lock %d flags %s caller %pS",
> > +		  "lock %d flags %s caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  (unsigned long long)__entry->bno,
> >  		  __entry->buffer_length,
> > @@ -415,7 +415,7 @@ TRACE_EVENT(xfs_buf_ioerror,
> >  		__entry->caller_ip = caller_ip;
> >  	),
> >  	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
> > -		  "lock %d error %d flags %s caller %pS",
> > +		  "lock %d error %d flags %s caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  (unsigned long long)__entry->bno,
> >  		  __entry->buffer_length,
> > @@ -579,7 +579,7 @@ DECLARE_EVENT_CLASS(xfs_lock_class,
> >  		__entry->lock_flags = lock_flags;
> >  		__entry->caller_ip = caller_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino 0x%llx flags %s caller %pS",
> > +	TP_printk("dev %d:%d ino 0x%llx flags %s caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __print_flags(__entry->lock_flags, "|", XFS_LOCK_FLAGS),
> > @@ -697,7 +697,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
> >  		__entry->pincount = atomic_read(&ip->i_pincount);
> >  		__entry->caller_ip = caller_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %pS",
> > +	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->count,
> > @@ -1049,7 +1049,7 @@ TRACE_EVENT(xfs_log_force,
> >  		__entry->lsn = lsn;
> >  		__entry->caller_ip = caller_ip;
> >  	),
> > -	TP_printk("dev %d:%d lsn 0x%llx caller %pS",
> > +	TP_printk("dev %d:%d lsn 0x%llx caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->lsn, (void *)__entry->caller_ip)
> >  )
> > @@ -1403,7 +1403,7 @@ TRACE_EVENT(xfs_bunmap,
> >  		__entry->flags = flags;
> >  	),
> >  	TP_printk("dev %d:%d ino 0x%llx size 0x%llx bno 0x%llx len 0x%llx"
> > -		  "flags %s caller %pS",
> > +		  "flags %s caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->size,
> > @@ -1517,7 +1517,7 @@ TRACE_EVENT(xfs_agf,
> >  	),
> >  	TP_printk("dev %d:%d agno %u flags %s length %u roots b %u c %u "
> >  		  "levels b %u c %u flfirst %u fllast %u flcount %u "
> > -		  "freeblks %u longest %u caller %pS",
> > +		  "freeblks %u longest %u caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->agno,
> >  		  __print_flags(__entry->flags, "|", XFS_AGF_FLAGS),
> > @@ -2486,7 +2486,7 @@ DECLARE_EVENT_CLASS(xfs_ag_error_class,
> >  		__entry->error = error;
> >  		__entry->caller_ip = caller_ip;
> >  	),
> > -	TP_printk("dev %d:%d agno %u error %d caller %pS",
> > +	TP_printk("dev %d:%d agno %u error %d caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->agno,
> >  		  __entry->error,
> > @@ -2977,7 +2977,7 @@ DECLARE_EVENT_CLASS(xfs_inode_error_class,
> >  		__entry->error = error;
> >  		__entry->caller_ip = caller_ip;
> >  	),
> > -	TP_printk("dev %d:%d ino %llx error %d caller %pS",
> > +	TP_printk("dev %d:%d ino %llx error %d caller %p",
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >  		  __entry->ino,
> >  		  __entry->error,
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-25 13:06   ` Brian Foster
@ 2018-01-25 19:21     ` Darrick J. Wong
  2018-01-26 13:04       ` Brian Foster
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25 19:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 08:06:45AM -0500, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:18:29PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Track the number of blocks reserved in the CoW fork so that we can
> > move the quota reservations whenever we chown, and don't account for
> > CoW fork delalloc reservations in i_delayed_blks.  This should make
> > chown work properly for quota reservations, enables us to fully
> > account for real extents in the cow fork in the file stat info, and
> > improves the post-eof scanning decisions because we're no longer
> > confusing data fork delalloc extents with cow fork delalloc extents.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c      |   16 ++++++++++++----
> >  fs/xfs/libxfs/xfs_inode_buf.c |    1 +
> >  fs/xfs/xfs_bmap_util.c        |    5 +++++
> >  fs/xfs/xfs_icache.c           |    3 ++-
> >  fs/xfs/xfs_inode.c            |   11 +++++------
> >  fs/xfs/xfs_inode.h            |    1 +
> >  fs/xfs/xfs_iops.c             |    3 ++-
> >  fs/xfs/xfs_itable.c           |    3 ++-
> >  fs/xfs/xfs_qm.c               |    2 +-
> >  fs/xfs/xfs_reflink.c          |    4 ++--
> >  fs/xfs/xfs_super.c            |    1 +
> >  11 files changed, 34 insertions(+), 16 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 4a38cfc..a208825 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> ...
> > @@ -1669,7 +1667,7 @@ xfs_release(
> >  		truncated = xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED);
> >  		if (truncated) {
> >  			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
> > -			if (ip->i_delayed_blks > 0) {
> > +			if (ip->i_delayed_blks > 0 || ip->i_cow_blocks > 0) {
> >  				error = filemap_flush(VFS_I(ip)->i_mapping);
> >  				if (error)
> >  					return error;
> 
> Is having cowblocks really relevant to this hunk? I thought this was
> purely a delalloc vs. file size thing, but I could be wrong. 

AFAICT, if we (1) use truncate to reduce a file's size, (2) write
somewhere past eof, (3) make some delalloc reservations for the post-eof
write, and (4) close the file, then this chunk flushes the dirty data to
disk so that if we crash after the close() call returns, the file will
still have all the data that was written out.  IOWs, this provides for
flush-on-close after a file size reduction.

So I was thinking that if a write to a lower offset causes the creation
of a speculative cow extent of some kind that extends past eof, we'd
still want to flush the dirty data to disk on close even if there are no
delalloc reservations in the data fork.

Ofc now I see that xfs_file_iomap_begin_delay will create the data fork
da reservation for a non-shared block even if a cow fork extent already
exists (the write is promoted to cow), so perhaps this isn't strictly
necessary... but adding a data fork da extent when there's already a cow
fork extent seems like a (mostly harmless) bug to me.

--D

> 
> Brian
> 
> > @@ -1909,7 +1907,8 @@ xfs_inactive(
> >  
> >  	if (S_ISREG(VFS_I(ip)->i_mode) &&
> >  	    (ip->i_d.di_size != 0 || XFS_ISIZE(ip) != 0 ||
> > -	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0))
> > +	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0 ||
> > +	     ip->i_cow_blocks > 0))
> >  		truncate = 1;
> >  
> >  	error = xfs_qm_dqattach(ip, 0);
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index ff56486..6feee8a 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -62,6 +62,7 @@ typedef struct xfs_inode {
> >  	/* Miscellaneous state. */
> >  	unsigned long		i_flags;	/* see defined flags below */
> >  	unsigned int		i_delayed_blks;	/* count of delay alloc blks */
> > +	unsigned int		i_cow_blocks;	/* count of cow fork blocks */
> >  
> >  	struct xfs_icdinode	i_d;		/* most of ondisk inode */
> >  
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 56475fc..6c3381c 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -513,7 +513,8 @@ xfs_vn_getattr(
> >  	stat->mtime = inode->i_mtime;
> >  	stat->ctime = inode->i_ctime;
> >  	stat->blocks =
> > -		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
> > +		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks +
> > +				  ip->i_cow_blocks);
> >  
> >  	if (ip->i_d.di_version == 3) {
> >  		if (request_mask & STATX_BTIME) {
> > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > index d583105..412d7eb 100644
> > --- a/fs/xfs/xfs_itable.c
> > +++ b/fs/xfs/xfs_itable.c
> > @@ -122,7 +122,8 @@ xfs_bulkstat_one_int(
> >  	case XFS_DINODE_FMT_BTREE:
> >  		buf->bs_rdev = 0;
> >  		buf->bs_blksize = mp->m_sb.sb_blocksize;
> > -		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks;
> > +		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks +
> > +				 ip->i_cow_blocks;
> >  		break;
> >  	}
> >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index 5b848f4..28f12f8 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> > @@ -1847,7 +1847,7 @@ xfs_qm_vop_chown_reserve(
> >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
> >  	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
> >  
> > -	delblks = ip->i_delayed_blks;
> > +	delblks = ip->i_delayed_blks + ip->i_cow_blocks;
> >  	blkflags = XFS_IS_REALTIME_INODE(ip) ?
> >  			XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS;
> >  
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index e367351..f875ea7 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -619,7 +619,7 @@ xfs_reflink_cancel_cow_blocks(
> >  	}
> >  
> >  	/* clear tag if cow fork is emptied */
> > -	if (!ifp->if_bytes)
> > +	if (ip->i_cow_blocks == 0)
> >  		xfs_inode_clear_cowblocks_tag(ip);
> >  
> >  	return error;
> > @@ -704,7 +704,7 @@ xfs_reflink_end_cow(
> >  	trace_xfs_reflink_end_cow(ip, offset, count);
> >  
> >  	/* No COW extents?  That's easy! */
> > -	if (ifp->if_bytes == 0)
> > +	if (ip->i_cow_blocks == 0)
> >  		return 0;
> >  
> >  	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index f3e0001..9d04cfb 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -989,6 +989,7 @@ xfs_fs_destroy_inode(
> >  	xfs_inactive(ip);
> >  
> >  	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> > +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_cow_blocks == 0);
> >  	XFS_STATS_INC(ip->i_mount, vn_reclaim);
> >  
> >  	/*
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-25 20:20     ` Darrick J. Wong
  2018-01-26 13:06       ` Brian Foster
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-25 20:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 12:31:12PM -0500, Brian Foster wrote:
> On Tue, Jan 23, 2018 at 06:18:35PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > In xfs_bmap_btalloc, we try using the CoW extent size hint to force
> > allocations to align (offset-wise) to cowextsz granularity to reduce CoW
> > fragmentation.  This works fine until we cannot satisfy the allocation
> > with enough blocks to cover the requested range and the alignment hints.
> > If this happens, return an unaligned region because if we don't the
> > extent trim functions cause us to return a zero-length extent to iomap,
> > which iomap doesn't catch and thus blows up.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Hmm.. is this a direct I/O thing? The description of the problem had me

Yes.

> wondering how we handle this with regard to dio and traditional extent
> size hints. It looks like we just return -ENOSPC if xfs_bmapi_write()
> doesn't return a mapping that covers the target range of the write (even
> if it apparently attempts to allocate part of the associated extent size
> hint range). E.g., see the nimaps == 0 check in
> xfs_iomap_write_direct() after we commit the transaction.

I did take a look at that, and didn't like it.

There's enough free space to fill the dio write, but the free space
itself is very fragmented so we can't honor the hint.  We did however
manage to allocate /some/ blocks, so we might as well return what we got
and let the next iteration of the iomap_apply loop try to fill the rest
of the write request.  We already reserved enough space, so the write
should succeed totally, not return to userspace with either a short
write or ENOSPC just because free space is fragmented.

(The other problem is that if we return ENOSPC out of iomap_begin, that
error code will bubble all the way back to userspace even if we /did/
write something, which means that even the programs that handle short
dio writes correctly will see that ENOSPC and bail out.  Goldwyn has
been trying to fix that braindamage for some time now.)

> In fact, it looks like just repeating the failed write could eventually
> succeed if the issue is that there is actually enough free space
> available to allocate the hint range up to where the write is targeted,
> just no long enough extent available to fill the extent size hint range
> in a single bmapi_write call. That behavior is a bit strange, I admit,
> but I'm wondering if we could do the same thing for the cow hint. Would
> a similar nimaps check in the xfs_bmapi_write() caller resolve the bug
> described here?
> 
> If so and if we still care to actually change/fix the allocation
> behavior with regard to the hints, perhaps we could do that in a
> separate patch more generically for both hints..?

I get the feeling we could apply this change to all the data fork
bmap_btalloc calls too.  I'll go study that in more depth.

--D

> 
> Brian
> 
> >  fs/iomap.c               |    2 +-
> >  fs/xfs/libxfs/xfs_bmap.c |   21 +++++++++++++++++++--
> >  2 files changed, 20 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index e5de772..aec35a0 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> >  	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> >  	if (ret)
> >  		return ret;
> > -	if (WARN_ON(iomap.offset > pos))
> > +	if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
> >  		return -EIO;
> >  
> >  	/*
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 93ce2c6..4ec1fdc5 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3480,8 +3480,20 @@ xfs_bmap_btalloc_filestreams(
> >  static void
> >  xfs_bmap_btalloc_cow(
> >  	struct xfs_bmalloca	*ap,
> > -	struct xfs_alloc_arg	*args)
> > +	struct xfs_alloc_arg	*args,
> > +	xfs_fileoff_t		orig_offset,
> > +	xfs_extlen_t		orig_length)
> >  {
> > +	/*
> > +	 * If we didn't get enough blocks to satisfy the cowextsize
> > +	 * aligned request, break the alignment and return whatever we
> > +	 * got; it's the best we can do.
> > +	 */
> > +	if (ap->length <= orig_length)
> > +		ap->offset = orig_offset;
> > +	else if (ap->offset + ap->length < orig_offset + orig_length)
> > +		ap->offset = orig_offset + orig_length - ap->length;
> > +
> >  	/* Filling a previously reserved extent; nothing to do here. */
> >  	if (ap->wasdel)
> >  		return;
> > @@ -3520,6 +3532,8 @@ xfs_bmap_btalloc(
> >  	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
> >  	xfs_agnumber_t	ag;
> >  	xfs_alloc_arg_t	args;
> > +	xfs_fileoff_t	orig_offset;
> > +	xfs_extlen_t	orig_length;
> >  	xfs_extlen_t	blen;
> >  	xfs_extlen_t	nextminlen = 0;
> >  	int		nullfb;		/* true if ap->firstblock isn't set */
> > @@ -3529,6 +3543,8 @@ xfs_bmap_btalloc(
> >  	int		stripe_align;
> >  
> >  	ASSERT(ap->length);
> > +	orig_offset = ap->offset;
> > +	orig_length = ap->length;
> >  
> >  	mp = ap->ip->i_mount;
> >  
> > @@ -3745,7 +3761,8 @@ xfs_bmap_btalloc(
> >  		ASSERT(nullfb || fb_agno <= args.agno);
> >  		ap->length = args.len;
> >  		if (ap->flags & XFS_BMAPI_COWFORK) {
> > -			xfs_bmap_btalloc_cow(ap, &args);
> > +			xfs_bmap_btalloc_cow(ap, &args, orig_offset,
> > +					orig_length);
> >  		} else {
> >  			ap->ip->i_d.di_nblocks += args.len;
> >  			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/11] xfs: fix tracepoint %p formats
  2018-01-25 18:47     ` Darrick J. Wong
@ 2018-01-26  0:19       ` Darrick J. Wong
  2018-01-26  9:09         ` Christoph Hellwig
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26  0:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 10:47:04AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 25, 2018 at 12:31:28PM -0500, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 06:18:47PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Tracepoint printk doesn't have any of the %p suffixes, so use %p.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > I see different behavior with this. E.g.,
> > 
> >   umount-1130  [003] ...1  1995.947789: xfs_log_force: dev 253:3 lsn 0x0 caller xfs_log_quiesce+0x3c/0x4b0 [xfs]
> > 
> > vs.
> > 
> >   umount-1272  [002] ...1  2089.445135: xfs_log_force: dev 253:3 lsn 0x0 caller 00000000937cbc85
> 
> Hmm, on my system all I get is:
> 
> mount-3125  [000]  1634.386726: xfs_buf_submit_wait:  dev 8:0 bno
> 	0x4b0020 nblks 0x8 hold 1 pincount 0 lock 0 flags READ|PAGES caller
> 	0xffffffffa06cca52S
> 
> ...which is odd since they all map to the vsnprintf implementation, so
> this ought to work.
> 
> # trace-cmd record -e 'xfs_buf*' -F mount /dev/sda
> # trace-cmd report
> 
> (I don't see anything in the trace-cmd-report manpage about "resolve
> symbolic addresses" but maybe I just have an old version... or maybe
> we're just using different tools?)

For anyone following along at home, trace-cmd report does not resolve
instruction pointer addrs to name+offset tuples; for that you have to
drain the raw output:

# trace-cmd start <same arguments as record>
# cat /sys/kernel/debug/tracing/trace_pipe
# trace-cmd stop

--D

> 
> > Expected?
> 
> No, not at all.  But since it clearly works on your system, I'll call
> you fortunate and drop this patch. :)
> 
> --D
> 
> > 
> > Brian
> > 
> > >  fs/xfs/scrub/trace.h |   20 ++++++++++----------
> > >  fs/xfs/xfs_trace.h   |   24 ++++++++++++------------
> > >  2 files changed, 22 insertions(+), 22 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> > > index a0a6d3c..732775f 100644
> > > --- a/fs/xfs/scrub/trace.h
> > > +++ b/fs/xfs/scrub/trace.h
> > > @@ -90,7 +90,7 @@ TRACE_EVENT(xfs_scrub_op_error,
> > >  		__entry->error = error;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %pS",
> > > +	TP_printk("dev %d:%d type %u agno %u agbno %u error %d ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->type,
> > >  		  __entry->agno,
> > > @@ -121,7 +121,7 @@ TRACE_EVENT(xfs_scrub_file_op_error,
> > >  		__entry->error = error;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %pS",
> > > +	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu error %d ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->whichfork,
> > > @@ -156,7 +156,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_block_error_class,
> > >  		__entry->bno = bno;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %pS",
> > > +	TP_printk("dev %d:%d type %u agno %u agbno %u ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->type,
> > >  		  __entry->agno,
> > > @@ -207,7 +207,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_ino_error_class,
> > >  		__entry->bno = bno;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %pS",
> > > +	TP_printk("dev %d:%d ino %llu type %u agno %u agbno %u ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->type,
> > > @@ -246,7 +246,7 @@ DECLARE_EVENT_CLASS(xfs_scrub_fblock_error_class,
> > >  		__entry->offset = offset;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %pS",
> > > +	TP_printk("dev %d:%d ino %llu fork %d type %u offset %llu ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->whichfork,
> > > @@ -277,7 +277,7 @@ TRACE_EVENT(xfs_scrub_incomplete,
> > >  		__entry->type = sc->sm->sm_type;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d type %u ret_ip %pS",
> > > +	TP_printk("dev %d:%d type %u ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->type,
> > >  		  __entry->ret_ip)
> > > @@ -311,7 +311,7 @@ TRACE_EVENT(xfs_scrub_btree_op_error,
> > >  		__entry->error = error;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
> > > +	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->type,
> > >  		  __entry->btnum,
> > > @@ -354,7 +354,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_op_error,
> > >  		__entry->error = error;
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %pS",
> > > +	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u error %d ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->whichfork,
> > > @@ -393,7 +393,7 @@ TRACE_EVENT(xfs_scrub_btree_error,
> > >  		__entry->ptr = cur->bc_ptrs[level];
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
> > > +	TP_printk("dev %d:%d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->type,
> > >  		  __entry->btnum,
> > > @@ -433,7 +433,7 @@ TRACE_EVENT(xfs_scrub_ifork_btree_error,
> > >  		__entry->ptr = cur->bc_ptrs[level];
> > >  		__entry->ret_ip = ret_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %pS",
> > > +	TP_printk("dev %d:%d ino %llu fork %d type %u btnum %d level %d ptr %d agno %u agbno %u ret_ip %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->whichfork,
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 945de08..893081e 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -119,7 +119,7 @@ DECLARE_EVENT_CLASS(xfs_perag_class,
> > >  		__entry->refcount = refcount;
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d agno %u refcount %d caller %pS",
> > > +	TP_printk("dev %d:%d agno %u refcount %d caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->agno,
> > >  		  __entry->refcount,
> > > @@ -252,7 +252,7 @@ DECLARE_EVENT_CLASS(xfs_bmap_class,
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > >  	TP_printk("dev %d:%d ino 0x%llx state %s cur %p/%d "
> > > -		  "offset %lld block %lld count %lld flag %d caller %pS",
> > > +		  "offset %lld block %lld count %lld flag %d caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __print_flags(__entry->bmap_state, "|", XFS_BMAP_EXT_FLAGS),
> > > @@ -301,7 +301,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > >  	TP_printk("dev %d:%d bno 0x%llx nblks 0x%x hold %d pincount %d "
> > > -		  "lock %d flags %s caller %pS",
> > > +		  "lock %d flags %s caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  (unsigned long long)__entry->bno,
> > >  		  __entry->nblks,
> > > @@ -370,7 +370,7 @@ DECLARE_EVENT_CLASS(xfs_buf_flags_class,
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > >  	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
> > > -		  "lock %d flags %s caller %pS",
> > > +		  "lock %d flags %s caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  (unsigned long long)__entry->bno,
> > >  		  __entry->buffer_length,
> > > @@ -415,7 +415,7 @@ TRACE_EVENT(xfs_buf_ioerror,
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > >  	TP_printk("dev %d:%d bno 0x%llx len 0x%zx hold %d pincount %d "
> > > -		  "lock %d error %d flags %s caller %pS",
> > > +		  "lock %d error %d flags %s caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  (unsigned long long)__entry->bno,
> > >  		  __entry->buffer_length,
> > > @@ -579,7 +579,7 @@ DECLARE_EVENT_CLASS(xfs_lock_class,
> > >  		__entry->lock_flags = lock_flags;
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino 0x%llx flags %s caller %pS",
> > > +	TP_printk("dev %d:%d ino 0x%llx flags %s caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __print_flags(__entry->lock_flags, "|", XFS_LOCK_FLAGS),
> > > @@ -697,7 +697,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
> > >  		__entry->pincount = atomic_read(&ip->i_pincount);
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %pS",
> > > +	TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->count,
> > > @@ -1049,7 +1049,7 @@ TRACE_EVENT(xfs_log_force,
> > >  		__entry->lsn = lsn;
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d lsn 0x%llx caller %pS",
> > > +	TP_printk("dev %d:%d lsn 0x%llx caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->lsn, (void *)__entry->caller_ip)
> > >  )
> > > @@ -1403,7 +1403,7 @@ TRACE_EVENT(xfs_bunmap,
> > >  		__entry->flags = flags;
> > >  	),
> > >  	TP_printk("dev %d:%d ino 0x%llx size 0x%llx bno 0x%llx len 0x%llx"
> > > -		  "flags %s caller %pS",
> > > +		  "flags %s caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->size,
> > > @@ -1517,7 +1517,7 @@ TRACE_EVENT(xfs_agf,
> > >  	),
> > >  	TP_printk("dev %d:%d agno %u flags %s length %u roots b %u c %u "
> > >  		  "levels b %u c %u flfirst %u fllast %u flcount %u "
> > > -		  "freeblks %u longest %u caller %pS",
> > > +		  "freeblks %u longest %u caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->agno,
> > >  		  __print_flags(__entry->flags, "|", XFS_AGF_FLAGS),
> > > @@ -2486,7 +2486,7 @@ DECLARE_EVENT_CLASS(xfs_ag_error_class,
> > >  		__entry->error = error;
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d agno %u error %d caller %pS",
> > > +	TP_printk("dev %d:%d agno %u error %d caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->agno,
> > >  		  __entry->error,
> > > @@ -2977,7 +2977,7 @@ DECLARE_EVENT_CLASS(xfs_inode_error_class,
> > >  		__entry->error = error;
> > >  		__entry->caller_ip = caller_ip;
> > >  	),
> > > -	TP_printk("dev %d:%d ino %llx error %d caller %pS",
> > > +	TP_printk("dev %d:%d ino %llx error %d caller %p",
> > >  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > >  		  __entry->ino,
> > >  		  __entry->error,
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks
  2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
  2018-01-24 14:16   ` Brian Foster
@ 2018-01-26  9:06   ` Christoph Hellwig
  2018-01-26 18:26     ` Darrick J. Wong
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:03PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Before we share blocks between files, we need to break the pnfs leases
> on the layout before we start slicing and dicing the block map.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 47 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 47aea2e..f89a725 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1245,6 +1245,50 @@ xfs_reflink_remap_blocks(
>  }
>  
>  /*
> + * Grab the exclusive iolock for a data copy from src to dest, making
> + * sure to abide vfs locking order (lowest pointer value goes first) and
> + * breaking the pnfs layout leases on dest before proceeding.  The loop
> + * is needed because we cannot call the blocking break_layout() with the
> + * src iolock held, and therefore have to back out both locks.
> + */
> +static int
> +xfs_iolock_two_inodes_and_break_layout(
> +	struct inode		*src,
> +	struct inode		*dest)
> +{
> +	bool			src_first = src < dest;
> +	bool			src_last = src > dest;

I find the double predicates here highly confusing.

Also the code doesn't seem to handle the src == dest case as
far as I can tell.

> +retry:
> +	if (src_first) {
> +		inode_lock(src);
> +		inode_lock_nested(dest, I_MUTEX_NONDIR2);
> +	} else {
> +		inode_lock(dest);
> +	}

Shouldn't this be replaced by a call to lock_two_nondirectories?
Even if that holds both locks over the noon-blocking break_layout
it makes things a lot simpler and only does an additional rountrip
for the layouts outstanding slow path.

> +	error = break_layout(dest, false);
> +	if (error == -EWOULDBLOCK) {
> +		inode_unlock(dest);
> +		if (src_first)
> +			inode_unlock(src);

unlock_two_nondirectories?

> +		error = break_layout(dest, true);
> +		if (error)
> +			return error;
> +		goto retry;
> +	} else if (error) {

no need for an else after a goto.

> +		inode_unlock(dest);
> +		if (src_first)
> +			inode_unlock(src);

unlock_two_nondirectories?

Also seems like this could be simplified to:

	if (error) {
		unlock_two_nondirectories()
		if (error == -EWOULDBLOCK)
			goto retry;
		return error;
	}

So I guess the whole thing could simply become something like:

retry:
	lock_two_nondirectories(src, dest);
	error = break_layout(dest, false);
	if (error) {
		unlock_two_nondirectories(src, dest);
		if (error == -EWOULDBLOCK)
			goto retry;
		return error;
	}

and could probably just be inlined into the caller..

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations
  2018-01-24  2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
  2018-01-24 14:18   ` Brian Foster
@ 2018-01-26  9:07   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/11] xfs: always zero di_flags2 when we free the inode
  2018-01-24  2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-26  9:08   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/11] xfs: fix tracepoint %p formats
  2018-01-26  0:19       ` Darrick J. Wong
@ 2018-01-26  9:09         ` Christoph Hellwig
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

Please fix trace-cmd instead of losing the specifiers for the kernel
trace buffer.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/11] xfs: make tracepoint inode number format consistent
  2018-01-24  2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-26  9:09   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/11] xfs: refactor inode verifier corruption error printing
  2018-01-24  2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-26  9:10   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap
  2018-01-24  2:19 ` [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
@ 2018-01-26  9:10   ` Christoph Hellwig
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls
  2018-01-24  2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
  2018-01-25 17:31   ` Brian Foster
@ 2018-01-26  9:11   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26  9:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

> diff --git a/fs/iomap.c b/fs/iomap.c
> index e5de772..aec35a0 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
>  	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
>  	if (ret)
>  		return ret;
> -	if (WARN_ON(iomap.offset > pos))
> +	if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
>  		return -EIO;
>  
>  	/*

Please split this into a separate patch.

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink
  2018-01-24  2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
  2018-01-24 14:18   ` Brian Foster
@ 2018-01-26 12:07   ` Christoph Hellwig
  2018-01-26 18:48     ` Darrick J. Wong
  2018-01-27  3:32     ` Dave Chinner
  1 sibling, 2 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26 12:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

> +xfs_lock_two_inodes_separately(
> +	struct xfs_inode	*ip0,
> +	uint			ip0_mode,
> +	struct xfs_inode	*ip1,
> +	uint			ip1_mode)

We only have 6 calls to xfs_lock_two_inodes in total, so just update
the signature to take two modes and be done with it.

Also how about mode1 and mode2 for the argument names?

>  	lp = (xfs_log_item_t *)ip0->i_itemp;
>  	if (lp && (lp->li_flags & XFS_LI_IN_AIL)) {
> -		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(lock_mode, 1))) {
> -			xfs_iunlock(ip0, lock_mode);
> +		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(ip1_mode, 1))) {
> +			xfs_iunlock(ip0, ip0_mode);
>  			if ((++attempts % 5) == 0)
>  				delay(1); /* Don't just spin the CPU */
>  			goto again;
>  		}
>  	} else {
> -		xfs_ilock(ip1, xfs_lock_inumorder(lock_mode, 1));
> +		xfs_ilock(ip1, xfs_lock_inumorder(ip1_mode, 1));
>  	}
>  }

Not directly related to your patch, but the the nowait + retry
mess must go away.

I think we need to move to the VFS locking conventions, that is
based on ancestors for directories (see lock_rename) and otherwise
based on the struct inode address as in lock_two_nondirectories.

>       if (src_last)
> -             inode_lock_nested(src, I_MUTEX_NONDIR2);
> +             down_read_nested(&src->i_rwsem, I_MUTEX_NONDIR2);

Why is this not using inode_lock_nested any more?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-25  1:20   ` [PATCH v2 " Darrick J. Wong
  2018-01-25 13:03     ` Brian Foster
@ 2018-01-26 12:12     ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26 12:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Brian Foster

Just getting up to speed on this patch, so only a few cosmetic
comments so far:

> +		if (ap->flags & XFS_BMAPI_COWFORK) {
> +			xfs_bmap_btalloc_quota_cow(ap, &args);
> +		} else {
> +			ap->ip->i_d.di_nblocks += args.len;
> +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> +			/*
> +			 * Adjust the disk quota also. This was reserved
> +			 * earlier.
> +			 */
> +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> +						XFS_TRANS_DQ_BCOUNT,
> +				(long) args.len);
> +		}

Shouldn't we instead have a helper for the whole above section?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-24  2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
  2018-01-25 13:06   ` Brian Foster
@ 2018-01-26 12:15   ` Christoph Hellwig
  2018-01-26 19:00     ` Darrick J. Wong
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26 12:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jan 23, 2018 at 06:18:29PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Track the number of blocks reserved in the CoW fork so that we can
> move the quota reservations whenever we chown, and don't account for
> CoW fork delalloc reservations in i_delayed_blks.  This should make
> chown work properly for quota reservations, enables us to fully
> account for real extents in the cow fork in the file stat info, and
> improves the post-eof scanning decisions because we're no longer
> confusing data fork delalloc extents with cow fork delalloc extents.

Just curious:  is there any good reason we can't just have an
i_extra_blocks field for the delayed and cow blocks?  Or is there
a place where we care about the difference between the two?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc
  2018-01-25  5:26 ` [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc Darrick J. Wong
@ 2018-01-26 12:17   ` Christoph Hellwig
  2018-01-26 21:46     ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2018-01-26 12:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jan 24, 2018 at 09:26:47PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Since we now have a dedicated function for dealing with CoW allocation
> related quota updates in xfs_bmap_btalloc, we might as well refactor the
> data/attr fork quota update into its own function too.

Any good reason not to have this merged with the cow fork side helper?

> +	/*
> +	 * Adjust the disk quota also. This was reserved
> +	 * earlier.
> +	 */

Please use the full available line length for comments.

> +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> +		ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : XFS_TRANS_DQ_BCOUNT,
> +		(long) args.len);

I don't think we need this cast.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-25 18:20       ` Darrick J. Wong
@ 2018-01-26 13:02         ` Brian Foster
  2018-01-26 18:40           ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-26 13:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 10:20:03AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 25, 2018 at 08:03:53AM -0500, Brian Foster wrote:
> > On Wed, Jan 24, 2018 at 05:20:35PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Since the CoW fork only exists in memory, it is incorrect to update the
> > > on-disk quota block counts when we modify the CoW fork.  Unlike the data
> > > fork, even real extents in the CoW fork are only reservations (on-disk
> > > they're owned by the refcountbt) so they must not be tracked in the on
> > > disk quota info.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > v2: make documentation more crisp and to the point
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c |  118 ++++++++++++++++++++++++++++++++++++++++++----
> > >  fs/xfs/xfs_quota.h       |   14 ++++-
> > >  fs/xfs/xfs_reflink.c     |    8 ++-
> > >  3 files changed, 122 insertions(+), 18 deletions(-)
> > > 
...
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 82abff6..e367351 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
> > >  					del.br_startblock, del.br_blockcount,
> > >  					NULL);
> > >  
> > > -			/* Update quota accounting */
> > > -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > > -					-(long)del.br_blockcount);
> > > -
> > >  			/* Roll the transaction */
> > >  			xfs_defer_ijoin(&dfops, ip);
> > >  			error = xfs_defer_finish(tpp, &dfops);
> > > @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
> > >  		if (error)
> > >  			goto out_defer;
> > >  
> > > +		/* Charge this new data fork mapping to the on-disk quota. */
> > > +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > +				(long)del.br_blockcount);
> > > +
> > 
> > Should this technically be XFS_TRANS_DQ_DELBCOUNT? The blocks obviously
> > aren't delalloc and this transaction doesn't make a quota reservation so
> > I don't think it screws up accounting. But if the transaction did make a
> > quota reservation, it seems like this would account the extent against
> > the tx reservation where it instead should recognize that cow blocks
> > have already been reserved (which is essentially what DELBCOUNT means,
> > IIUC).
> 
> Hmmm, there's a subtlety here -- we're opencoding what DELBCOUNT does,
> because the subsequent xfs_bmap_del_extent_cow unconditionally reduces
> the in-core reservation after we've mapped in the extent as if it had
> been accounted as a real extent all along.  But considering all the
> blather about how cow fork blocks are treated as incore reservations, it
> does look funny, doesn't it?
> 

Ok.. I missed that the end/del cases were tied together, then reconfused
myself over the accounting in the end_cow() path (re: our irc chat
yesterday) when reassessing that bit. So to reset my brain, we have the
following with this current patch:

- cow reserve does a delalloc and in-core dquot reservation
- cow real alloc either skips dquot adjustment if wasdel, else reduces
  the quota res acquired by the transaction by the size of the alloc[1].
  Either way we leave around an in-core quota reservation as if the blocks
  remained delalloc.
- A cancel at this point simply kills the in-core dquot reservation
  along with the cow fork blocks.
- end_cow() unmaps the current data fork blocks and decrements
  associated real quota usage (tx), remaps the cow blocks and increments
  real quota usage (tx), then kills off the in-core dquot reservation.

[1] Would this even be necessary if we just acquired a delalloc like
reservation in xfs_reflink_allocate_cow() rather than associate the
reservation with the transaction in the first place (assuming we have
enough information to cover error handling, extent manipulations and
whatnot)?

When the tx commits, this essentially has the effect of applying the
bcount delta to both the on-disk dquot and the in-core res. The former
reflects the change in the file on-disk and the latter is rectified
because the field accounts for the current real usage plus outstanding
reservation. The original cowblocks res has been dropped directly, so
the bcount delta reflects the change to the data fork.

If we instead use delbcount in end_cow(), we're telling the transaction
to drop bcount by whatever old data fork blocks were removed and that
we've converted N delalloc (cow fork, actually) blocks that already had
in-core reservation. Therefore, transaction commit updates the on-disk
dquot just the same (-dataforkblocks + delallocblocks), but delbcount
blocks have already updated the in-core dquot res so the transaction has
nothing else to do there (and so we must also not remove that
reservation in del_cow()). This approach does seem like it requires a
bit less mental gymnastics to follow because it more closely resembles
delalloc quota accounting. ;)

Another thing that I'm not sure has been considered here is whether
doing the bcount delta in the transaction and dropping the cowblocks res
from the dquot directly leaves a race window where the quota can overrun
a limit. E.g., since the transaction has to up the in-core res in the
original example at commit time, is there anything that locks out
further external reservation from the dquot between the time the in-core
res is dropped and the transaction commits?

> So perhaps the solution is to pass intent into xfs_bmap_del_extent_cow:
> if we're calling it from _end_cow then we want to hang on to the
> reservation so that delbcount can do its thing, but if we're calling
> from _cancel_cow then we're dumping the extent and reservation.
> 

Indeed. But since those are the only callers and we'd already update
delbcount from end_cow(), could we not just lift the del_cow() decrement
into the cancel_cow() function? FWIW, some extra comments around quota
manipulation in the reflink functions would also be useful for future
reference.

Brian

> --D
> 
> > 
> > Other than that the code seems Ok to me.
> > 
> > Brian
> > 
> > >  		/* Remove the mapping from the CoW fork. */
> > >  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > >  
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-25 19:21     ` Darrick J. Wong
@ 2018-01-26 13:04       ` Brian Foster
  2018-01-26 19:08         ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-26 13:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 11:21:42AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 25, 2018 at 08:06:45AM -0500, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 06:18:29PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Track the number of blocks reserved in the CoW fork so that we can
> > > move the quota reservations whenever we chown, and don't account for
> > > CoW fork delalloc reservations in i_delayed_blks.  This should make
> > > chown work properly for quota reservations, enables us to fully
> > > account for real extents in the cow fork in the file stat info, and
> > > improves the post-eof scanning decisions because we're no longer
> > > confusing data fork delalloc extents with cow fork delalloc extents.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c      |   16 ++++++++++++----
> > >  fs/xfs/libxfs/xfs_inode_buf.c |    1 +
> > >  fs/xfs/xfs_bmap_util.c        |    5 +++++
> > >  fs/xfs/xfs_icache.c           |    3 ++-
> > >  fs/xfs/xfs_inode.c            |   11 +++++------
> > >  fs/xfs/xfs_inode.h            |    1 +
> > >  fs/xfs/xfs_iops.c             |    3 ++-
> > >  fs/xfs/xfs_itable.c           |    3 ++-
> > >  fs/xfs/xfs_qm.c               |    2 +-
> > >  fs/xfs/xfs_reflink.c          |    4 ++--
> > >  fs/xfs/xfs_super.c            |    1 +
> > >  11 files changed, 34 insertions(+), 16 deletions(-)
> > > 
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 4a38cfc..a208825 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > ...
> > > @@ -1669,7 +1667,7 @@ xfs_release(
> > >  		truncated = xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED);
> > >  		if (truncated) {
> > >  			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
> > > -			if (ip->i_delayed_blks > 0) {
> > > +			if (ip->i_delayed_blks > 0 || ip->i_cow_blocks > 0) {
> > >  				error = filemap_flush(VFS_I(ip)->i_mapping);
> > >  				if (error)
> > >  					return error;
> > 
> > Is having cowblocks really relevant to this hunk? I thought this was
> > purely a delalloc vs. file size thing, but I could be wrong. 
> 
> AFAICT, if we (1) use truncate to reduce a file's size, (2) write
> somewhere past eof, (3) make some delalloc reservations for the post-eof
> write, and (4) close the file, then this chunk flushes the dirty data to
> disk so that if we crash after the close() call returns, the file will
> still have all the data that was written out.  IOWs, this provides for
> flush-on-close after a file size reduction.
> 

I think it goes back to problems where those subsequent buffered writes
increase the file size again and the fs crashes before all data is
written out. E.g., the problem described by commit ba87ea699e ("[XFS]
Fix to prevent the notorious 'NULL files' problem after a crash."). It's
not totally clear to me whether that fixed the problem and this
particular hack is still needed.

FWIW, the flush code looks like it goes back to commit 7d4fb40ad7
("[XFS] Start writeout earlier (on last close) ...").

> So I was thinking that if a write to a lower offset causes the creation
> of a speculative cow extent of some kind that extends past eof, we'd
> still want to flush the dirty data to disk on close even if there are no
> delalloc reservations in the data fork.
> 

This whole stanza still depends on a truncate in the first place
though..?

I guess I'm not necessarily against doing this, I just think we should
verify whether it's actually useful to prevent some kind of similar
crash-recovery problem it was intended to help mitigate. If not, then
we're subjecting ourselves to the tradeoff, which appears to be that
we'll initiate writeback of any file with cowblocks on close that has
been truncated.

Granted the truncate operation is probably infrequent with respect to
close() so it's probably not that big of a deal, but in the delalloc
case a flush is at least generally expected to clear the file of delayed
allocation. It's my understanding that the same is not necessarily true
for cowblocks.. cow prealloc means blocks can sit around in the cow fork
for a while in anticipation of future copy-on-writes, right?

Brian

> Ofc now I see that xfs_file_iomap_begin_delay will create the data fork
> da reservation for a non-shared block even if a cow fork extent already
> exists (the write is promoted to cow), so perhaps this isn't strictly
> necessary... but adding a data fork da extent when there's already a cow
> fork extent seems like a (mostly harmless) bug to me.
> 
> --D
> 
> > 
> > Brian
> > 
> > > @@ -1909,7 +1907,8 @@ xfs_inactive(
> > >  
> > >  	if (S_ISREG(VFS_I(ip)->i_mode) &&
> > >  	    (ip->i_d.di_size != 0 || XFS_ISIZE(ip) != 0 ||
> > > -	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0))
> > > +	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0 ||
> > > +	     ip->i_cow_blocks > 0))
> > >  		truncate = 1;
> > >  
> > >  	error = xfs_qm_dqattach(ip, 0);
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index ff56486..6feee8a 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -62,6 +62,7 @@ typedef struct xfs_inode {
> > >  	/* Miscellaneous state. */
> > >  	unsigned long		i_flags;	/* see defined flags below */
> > >  	unsigned int		i_delayed_blks;	/* count of delay alloc blks */
> > > +	unsigned int		i_cow_blocks;	/* count of cow fork blocks */
> > >  
> > >  	struct xfs_icdinode	i_d;		/* most of ondisk inode */
> > >  
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index 56475fc..6c3381c 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -513,7 +513,8 @@ xfs_vn_getattr(
> > >  	stat->mtime = inode->i_mtime;
> > >  	stat->ctime = inode->i_ctime;
> > >  	stat->blocks =
> > > -		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
> > > +		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks +
> > > +				  ip->i_cow_blocks);
> > >  
> > >  	if (ip->i_d.di_version == 3) {
> > >  		if (request_mask & STATX_BTIME) {
> > > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > > index d583105..412d7eb 100644
> > > --- a/fs/xfs/xfs_itable.c
> > > +++ b/fs/xfs/xfs_itable.c
> > > @@ -122,7 +122,8 @@ xfs_bulkstat_one_int(
> > >  	case XFS_DINODE_FMT_BTREE:
> > >  		buf->bs_rdev = 0;
> > >  		buf->bs_blksize = mp->m_sb.sb_blocksize;
> > > -		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks;
> > > +		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks +
> > > +				 ip->i_cow_blocks;
> > >  		break;
> > >  	}
> > >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > index 5b848f4..28f12f8 100644
> > > --- a/fs/xfs/xfs_qm.c
> > > +++ b/fs/xfs/xfs_qm.c
> > > @@ -1847,7 +1847,7 @@ xfs_qm_vop_chown_reserve(
> > >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
> > >  	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
> > >  
> > > -	delblks = ip->i_delayed_blks;
> > > +	delblks = ip->i_delayed_blks + ip->i_cow_blocks;
> > >  	blkflags = XFS_IS_REALTIME_INODE(ip) ?
> > >  			XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS;
> > >  
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index e367351..f875ea7 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -619,7 +619,7 @@ xfs_reflink_cancel_cow_blocks(
> > >  	}
> > >  
> > >  	/* clear tag if cow fork is emptied */
> > > -	if (!ifp->if_bytes)
> > > +	if (ip->i_cow_blocks == 0)
> > >  		xfs_inode_clear_cowblocks_tag(ip);
> > >  
> > >  	return error;
> > > @@ -704,7 +704,7 @@ xfs_reflink_end_cow(
> > >  	trace_xfs_reflink_end_cow(ip, offset, count);
> > >  
> > >  	/* No COW extents?  That's easy! */
> > > -	if (ifp->if_bytes == 0)
> > > +	if (ip->i_cow_blocks == 0)
> > >  		return 0;
> > >  
> > >  	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index f3e0001..9d04cfb 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -989,6 +989,7 @@ xfs_fs_destroy_inode(
> > >  	xfs_inactive(ip);
> > >  
> > >  	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> > > +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_cow_blocks == 0);
> > >  	XFS_STATS_INC(ip->i_mount, vn_reclaim);
> > >  
> > >  	/*
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls
  2018-01-25 20:20     ` Darrick J. Wong
@ 2018-01-26 13:06       ` Brian Foster
  2018-01-26 19:12         ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Brian Foster @ 2018-01-26 13:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jan 25, 2018 at 12:20:33PM -0800, Darrick J. Wong wrote:
> On Thu, Jan 25, 2018 at 12:31:12PM -0500, Brian Foster wrote:
> > On Tue, Jan 23, 2018 at 06:18:35PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > In xfs_bmap_btalloc, we try using the CoW extent size hint to force
> > > allocations to align (offset-wise) to cowextsz granularity to reduce CoW
> > > fragmentation.  This works fine until we cannot satisfy the allocation
> > > with enough blocks to cover the requested range and the alignment hints.
> > > If this happens, return an unaligned region because if we don't the
> > > extent trim functions cause us to return a zero-length extent to iomap,
> > > which iomap doesn't catch and thus blows up.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Hmm.. is this a direct I/O thing? The description of the problem had me
> 
> Yes.
> 
> > wondering how we handle this with regard to dio and traditional extent
> > size hints. It looks like we just return -ENOSPC if xfs_bmapi_write()
> > doesn't return a mapping that covers the target range of the write (even
> > if it apparently attempts to allocate part of the associated extent size
> > hint range). E.g., see the nimaps == 0 check in
> > xfs_iomap_write_direct() after we commit the transaction.
> 
> I did take a look at that, and didn't like it.
> 
> There's enough free space to fill the dio write, but the free space
> itself is very fragmented so we can't honor the hint.  We did however
> manage to allocate /some/ blocks, so we might as well return what we got
> and let the next iteration of the iomap_apply loop try to fill the rest
> of the write request.  We already reserved enough space, so the write
> should succeed totally, not return to userspace with either a short
> write or ENOSPC just because free space is fragmented.
> 

I'm not following how a short write can necessarily be prevented, since
space reservation doesn't guarantee contiguity and afaict we only make a
single mapping call. I suppose the iomap level can loop, but that's
outside of the context where blocks are reserved. Hm?

But regardless, this behavior seems reasonable to me if we apply it
consistently between cow fork hint behavior and traditional extent size
hint behavior. They are both hints, after all. I do think an nimaps
check might still be appropriate in that reflink code path simply to
cover the case of unexpected behavior or a bug, rather than brace for
whatever is going to happen if we continue to shuffle a bogus imap
around.

Brian

> (The other problem is that if we return ENOSPC out of iomap_begin, that
> error code will bubble all the way back to userspace even if we /did/
> write something, which means that even the programs that handle short
> dio writes correctly will see that ENOSPC and bail out.  Goldwyn has
> been trying to fix that braindamage for some time now.)
> 
> > In fact, it looks like just repeating the failed write could eventually
> > succeed if the issue is that there is actually enough free space
> > available to allocate the hint range up to where the write is targeted,
> > just no long enough extent available to fill the extent size hint range
> > in a single bmapi_write call. That behavior is a bit strange, I admit,
> > but I'm wondering if we could do the same thing for the cow hint. Would
> > a similar nimaps check in the xfs_bmapi_write() caller resolve the bug
> > described here?
> > 
> > If so and if we still care to actually change/fix the allocation
> > behavior with regard to the hints, perhaps we could do that in a
> > separate patch more generically for both hints..?
> 
> I get the feeling we could apply this change to all the data fork
> bmap_btalloc calls too.  I'll go study that in more depth.
> 
> --D
> 
> > 
> > Brian
> > 
> > >  fs/iomap.c               |    2 +-
> > >  fs/xfs/libxfs/xfs_bmap.c |   21 +++++++++++++++++++--
> > >  2 files changed, 20 insertions(+), 3 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > index e5de772..aec35a0 100644
> > > --- a/fs/iomap.c
> > > +++ b/fs/iomap.c
> > > @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> > >  	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> > >  	if (ret)
> > >  		return ret;
> > > -	if (WARN_ON(iomap.offset > pos))
> > > +	if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
> > >  		return -EIO;
> > >  
> > >  	/*
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 93ce2c6..4ec1fdc5 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -3480,8 +3480,20 @@ xfs_bmap_btalloc_filestreams(
> > >  static void
> > >  xfs_bmap_btalloc_cow(
> > >  	struct xfs_bmalloca	*ap,
> > > -	struct xfs_alloc_arg	*args)
> > > +	struct xfs_alloc_arg	*args,
> > > +	xfs_fileoff_t		orig_offset,
> > > +	xfs_extlen_t		orig_length)
> > >  {
> > > +	/*
> > > +	 * If we didn't get enough blocks to satisfy the cowextsize
> > > +	 * aligned request, break the alignment and return whatever we
> > > +	 * got; it's the best we can do.
> > > +	 */
> > > +	if (ap->length <= orig_length)
> > > +		ap->offset = orig_offset;
> > > +	else if (ap->offset + ap->length < orig_offset + orig_length)
> > > +		ap->offset = orig_offset + orig_length - ap->length;
> > > +
> > >  	/* Filling a previously reserved extent; nothing to do here. */
> > >  	if (ap->wasdel)
> > >  		return;
> > > @@ -3520,6 +3532,8 @@ xfs_bmap_btalloc(
> > >  	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
> > >  	xfs_agnumber_t	ag;
> > >  	xfs_alloc_arg_t	args;
> > > +	xfs_fileoff_t	orig_offset;
> > > +	xfs_extlen_t	orig_length;
> > >  	xfs_extlen_t	blen;
> > >  	xfs_extlen_t	nextminlen = 0;
> > >  	int		nullfb;		/* true if ap->firstblock isn't set */
> > > @@ -3529,6 +3543,8 @@ xfs_bmap_btalloc(
> > >  	int		stripe_align;
> > >  
> > >  	ASSERT(ap->length);
> > > +	orig_offset = ap->offset;
> > > +	orig_length = ap->length;
> > >  
> > >  	mp = ap->ip->i_mount;
> > >  
> > > @@ -3745,7 +3761,8 @@ xfs_bmap_btalloc(
> > >  		ASSERT(nullfb || fb_agno <= args.agno);
> > >  		ap->length = args.len;
> > >  		if (ap->flags & XFS_BMAPI_COWFORK) {
> > > -			xfs_bmap_btalloc_cow(ap, &args);
> > > +			xfs_bmap_btalloc_cow(ap, &args, orig_offset,
> > > +					orig_length);
> > >  		} else {
> > >  			ap->ip->i_d.di_nblocks += args.len;
> > >  			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks
  2018-01-26  9:06   ` Christoph Hellwig
@ 2018-01-26 18:26     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 18:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 01:06:52AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 23, 2018 at 06:18:03PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Before we share blocks between files, we need to break the pnfs leases
> > on the layout before we start slicing and dicing the block map.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_reflink.c |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 47 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 47aea2e..f89a725 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1245,6 +1245,50 @@ xfs_reflink_remap_blocks(
> >  }
> >  
> >  /*
> > + * Grab the exclusive iolock for a data copy from src to dest, making
> > + * sure to abide vfs locking order (lowest pointer value goes first) and
> > + * breaking the pnfs layout leases on dest before proceeding.  The loop
> > + * is needed because we cannot call the blocking break_layout() with the
> > + * src iolock held, and therefore have to back out both locks.
> > + */
> > +static int
> > +xfs_iolock_two_inodes_and_break_layout(
> > +	struct inode		*src,
> > +	struct inode		*dest)
> > +{
> > +	bool			src_first = src < dest;
> > +	bool			src_last = src > dest;
> 
> I find the double predicates here highly confusing.
> 
> Also the code doesn't seem to handle the src == dest case as
> far as I can tell.

I guess they are confusing; when src == dest, src_first and src_last are
both false.

> > +retry:
> > +	if (src_first) {
> > +		inode_lock(src);
> > +		inode_lock_nested(dest, I_MUTEX_NONDIR2);
> > +	} else {
> > +		inode_lock(dest);
> > +	}
> 
> Shouldn't this be replaced by a call to lock_two_nondirectories?
> Even if that holds both locks over the noon-blocking break_layout
> it makes things a lot simpler and only does an additional rountrip
> for the layouts outstanding slow path.
> 
> > +	error = break_layout(dest, false);
> > +	if (error == -EWOULDBLOCK) {
> > +		inode_unlock(dest);
> > +		if (src_first)
> > +			inode_unlock(src);
> 
> unlock_two_nondirectories?
> 
> > +		error = break_layout(dest, true);
> > +		if (error)
> > +			return error;
> > +		goto retry;
> > +	} else if (error) {
> 
> no need for an else after a goto.
> 
> > +		inode_unlock(dest);
> > +		if (src_first)
> > +			inode_unlock(src);
> 
> unlock_two_nondirectories?
> 
> Also seems like this could be simplified to:
> 
> 	if (error) {
> 		unlock_two_nondirectories()
> 		if (error == -EWOULDBLOCK)
> 			goto retry;
> 		return error;
> 	}
> 
> So I guess the whole thing could simply become something like:
> 
> retry:
> 	lock_two_nondirectories(src, dest);
> 	error = break_layout(dest, false);
> 	if (error) {
> 		unlock_two_nondirectories(src, dest);
> 		if (error == -EWOULDBLOCK)
> 			goto retry;
> 		return error;
> 	}
> 
> and could probably just be inlined into the caller..

Yeah, that's simpler... though at this point I'll have to put all this
into a new series having already pushed to for-next.  :/

(Sorry, kinda overburdened with this week)

--D

> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 04/11] xfs: CoW fork operations should only update quota reservations
  2018-01-26 13:02         ` Brian Foster
@ 2018-01-26 18:40           ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 18:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 08:02:16AM -0500, Brian Foster wrote:
> On Thu, Jan 25, 2018 at 10:20:03AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 25, 2018 at 08:03:53AM -0500, Brian Foster wrote:
> > > On Wed, Jan 24, 2018 at 05:20:35PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Since the CoW fork only exists in memory, it is incorrect to update the
> > > > on-disk quota block counts when we modify the CoW fork.  Unlike the data
> > > > fork, even real extents in the CoW fork are only reservations (on-disk
> > > > they're owned by the refcountbt) so they must not be tracked in the on
> > > > disk quota info.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > > v2: make documentation more crisp and to the point
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c |  118 ++++++++++++++++++++++++++++++++++++++++++----
> > > >  fs/xfs/xfs_quota.h       |   14 ++++-
> > > >  fs/xfs/xfs_reflink.c     |    8 ++-
> > > >  3 files changed, 122 insertions(+), 18 deletions(-)
> > > > 
> ...
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index 82abff6..e367351 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
> > > >  					del.br_startblock, del.br_blockcount,
> > > >  					NULL);
> > > >  
> > > > -			/* Update quota accounting */
> > > > -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > -					-(long)del.br_blockcount);
> > > > -
> > > >  			/* Roll the transaction */
> > > >  			xfs_defer_ijoin(&dfops, ip);
> > > >  			error = xfs_defer_finish(tpp, &dfops);
> > > > @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
> > > >  		if (error)
> > > >  			goto out_defer;
> > > >  
> > > > +		/* Charge this new data fork mapping to the on-disk quota. */
> > > > +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> > > > +				(long)del.br_blockcount);
> > > > +
> > > 
> > > Should this technically be XFS_TRANS_DQ_DELBCOUNT? The blocks obviously
> > > aren't delalloc and this transaction doesn't make a quota reservation so
> > > I don't think it screws up accounting. But if the transaction did make a
> > > quota reservation, it seems like this would account the extent against
> > > the tx reservation where it instead should recognize that cow blocks
> > > have already been reserved (which is essentially what DELBCOUNT means,
> > > IIUC).
> > 
> > Hmmm, there's a subtlety here -- we're opencoding what DELBCOUNT does,
> > because the subsequent xfs_bmap_del_extent_cow unconditionally reduces
> > the in-core reservation after we've mapped in the extent as if it had
> > been accounted as a real extent all along.  But considering all the
> > blather about how cow fork blocks are treated as incore reservations, it
> > does look funny, doesn't it?
> > 
> 
> Ok.. I missed that the end/del cases were tied together, then reconfused
> myself over the accounting in the end_cow() path (re: our irc chat
> yesterday) when reassessing that bit. So to reset my brain, we have the
> following with this current patch:
> 
> - cow reserve does a delalloc and in-core dquot reservation
> - cow real alloc either skips dquot adjustment if wasdel, else reduces
>   the quota res acquired by the transaction by the size of the alloc[1].
>   Either way we leave around an in-core quota reservation as if the blocks
>   remained delalloc.
> - A cancel at this point simply kills the in-core dquot reservation
>   along with the cow fork blocks.
> - end_cow() unmaps the current data fork blocks and decrements
>   associated real quota usage (tx), remaps the cow blocks and increments
>   real quota usage (tx), then kills off the in-core dquot reservation.

Correct.

> [1] Would this even be necessary if we just acquired a delalloc like
> reservation in xfs_reflink_allocate_cow() rather than associate the
> reservation with the transaction in the first place (assuming we have
> enough information to cover error handling, extent manipulations and
> whatnot)?

Originally cow did make da reservations even for direct writes, but
Christoph thought that we could avoid the overhead of running through
the cow fork an extra time by mapping directly to the cow fork.

> When the tx commits, this essentially has the effect of applying the
> bcount delta to both the on-disk dquot and the in-core res. The former
> reflects the change in the file on-disk and the latter is rectified
> because the field accounts for the current real usage plus outstanding
> reservation. The original cowblocks res has been dropped directly, so
> the bcount delta reflects the change to the data fork.

<nod>

> If we instead use delbcount in end_cow(), we're telling the transaction
> to drop bcount by whatever old data fork blocks were removed and that
> we've converted N delalloc (cow fork, actually) blocks that already had
> in-core reservation. Therefore, transaction commit updates the on-disk
> dquot just the same (-dataforkblocks + delallocblocks), but delbcount
> blocks have already updated the in-core dquot res so the transaction has
> nothing else to do there (and so we must also not remove that
> reservation in del_cow()). This approach does seem like it requires a
> bit less mental gymnastics to follow because it more closely resembles
> delalloc quota accounting. ;)

Yes, that's less brain muddling; last night's patchpile incorporates
that.

> Another thing that I'm not sure has been considered here is whether
> doing the bcount delta in the transaction and dropping the cowblocks res
> from the dquot directly leaves a race window where the quota can overrun
> a limit. E.g., since the transaction has to up the in-core res in the
> original example at commit time, is there anything that locks out
> further external reservation from the dquot between the time the in-core
> res is dropped and the transaction commits?

Yes, that's a theoretical race (as in I've never seen it happen) that
is fixed by using delbcount in end_cow.

> > So perhaps the solution is to pass intent into xfs_bmap_del_extent_cow:
> > if we're calling it from _end_cow then we want to hang on to the
> > reservation so that delbcount can do its thing, but if we're calling
> > from _cancel_cow then we're dumping the extent and reservation.
> > 
> 
> Indeed. But since those are the only callers and we'd already update
> delbcount from end_cow(), could we not just lift the del_cow() decrement
> into the cancel_cow() function? FWIW, some extra comments around quota
> manipulation in the reflink functions would also be useful for future
> reference.

Hm, yes, could do that too.

TBH I had the moment of "doh, just call the quota unreserve in
cancel_cow directly instead of at the end of del_extent_cow" right after
I hit send. :(

--D

> Brian
> 
> > --D
> > 
> > > 
> > > Other than that the code seems Ok to me.
> > > 
> > > Brian
> > > 
> > > >  		/* Remove the mapping from the CoW fork. */
> > > >  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
> > > >  
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink
  2018-01-26 12:07   ` Christoph Hellwig
@ 2018-01-26 18:48     ` Darrick J. Wong
  2018-01-27  3:32     ` Dave Chinner
  1 sibling, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 18:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 04:07:41AM -0800, Christoph Hellwig wrote:
> > +xfs_lock_two_inodes_separately(
> > +	struct xfs_inode	*ip0,
> > +	uint			ip0_mode,
> > +	struct xfs_inode	*ip1,
> > +	uint			ip1_mode)
> 
> We only have 6 calls to xfs_lock_two_inodes in total, so just update
> the signature to take two modes and be done with it.
> 
> Also how about mode1 and mode2 for the argument names?
> 
> >  	lp = (xfs_log_item_t *)ip0->i_itemp;
> >  	if (lp && (lp->li_flags & XFS_LI_IN_AIL)) {
> > -		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(lock_mode, 1))) {
> > -			xfs_iunlock(ip0, lock_mode);
> > +		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(ip1_mode, 1))) {
> > +			xfs_iunlock(ip0, ip0_mode);
> >  			if ((++attempts % 5) == 0)
> >  				delay(1); /* Don't just spin the CPU */
> >  			goto again;
> >  		}
> >  	} else {
> > -		xfs_ilock(ip1, xfs_lock_inumorder(lock_mode, 1));
> > +		xfs_ilock(ip1, xfs_lock_inumorder(ip1_mode, 1));
> >  	}
> >  }
> 
> Not directly related to your patch, but the the nowait + retry
> mess must go away.

Agree with the yuckiness; how often do we encounter that situation?

> I think we need to move to the VFS locking conventions, that is
> based on ancestors for directories (see lock_rename) and otherwise
> based on the struct inode address as in lock_two_nondirectories.

I'd been wondering if the difference in locking conventions would ever
come to bite us...

> >       if (src_last)
> > -             inode_lock_nested(src, I_MUTEX_NONDIR2);
> > +             down_read_nested(&src->i_rwsem, I_MUTEX_NONDIR2);
> 
> Why is this not using inode_lock_nested any more?

Because inode_lock_nested calls down_write_nested, but I suppose I could
have simply added an inode_lock_shared_nested helper to make it more
consistent.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-26 12:15   ` Christoph Hellwig
@ 2018-01-26 19:00     ` Darrick J. Wong
  2018-01-26 23:51       ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 19:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 04:15:46AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 23, 2018 at 06:18:29PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Track the number of blocks reserved in the CoW fork so that we can
> > move the quota reservations whenever we chown, and don't account for
> > CoW fork delalloc reservations in i_delayed_blks.  This should make
> > chown work properly for quota reservations, enables us to fully
> > account for real extents in the cow fork in the file stat info, and
> > improves the post-eof scanning decisions because we're no longer
> > confusing data fork delalloc extents with cow fork delalloc extents.
> 
> Just curious:  is there any good reason we can't just have an
> i_extra_blocks field for the delayed and cow blocks?  Or is there
> a place where we care about the difference between the two?

"cow blocks" now includes real and unwritten extents sitting around in
the cow fork in addition to delalloc extents in the cow fork, and I
didn't want the field to have overlapping meanings.  On a practical
level, it also means we avoid eofblocks scans on inodes that have cow
blocks but no da blocks.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-26 13:04       ` Brian Foster
@ 2018-01-26 19:08         ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 19:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 08:04:29AM -0500, Brian Foster wrote:
> On Thu, Jan 25, 2018 at 11:21:42AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 25, 2018 at 08:06:45AM -0500, Brian Foster wrote:
> > > On Tue, Jan 23, 2018 at 06:18:29PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Track the number of blocks reserved in the CoW fork so that we can
> > > > move the quota reservations whenever we chown, and don't account for
> > > > CoW fork delalloc reservations in i_delayed_blks.  This should make
> > > > chown work properly for quota reservations, enables us to fully
> > > > account for real extents in the cow fork in the file stat info, and
> > > > improves the post-eof scanning decisions because we're no longer
> > > > confusing data fork delalloc extents with cow fork delalloc extents.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c      |   16 ++++++++++++----
> > > >  fs/xfs/libxfs/xfs_inode_buf.c |    1 +
> > > >  fs/xfs/xfs_bmap_util.c        |    5 +++++
> > > >  fs/xfs/xfs_icache.c           |    3 ++-
> > > >  fs/xfs/xfs_inode.c            |   11 +++++------
> > > >  fs/xfs/xfs_inode.h            |    1 +
> > > >  fs/xfs/xfs_iops.c             |    3 ++-
> > > >  fs/xfs/xfs_itable.c           |    3 ++-
> > > >  fs/xfs/xfs_qm.c               |    2 +-
> > > >  fs/xfs/xfs_reflink.c          |    4 ++--
> > > >  fs/xfs/xfs_super.c            |    1 +
> > > >  11 files changed, 34 insertions(+), 16 deletions(-)
> > > > 
> > > > 
> > > ...
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 4a38cfc..a208825 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > ...
> > > > @@ -1669,7 +1667,7 @@ xfs_release(
> > > >  		truncated = xfs_iflags_test_and_clear(ip, XFS_ITRUNCATED);
> > > >  		if (truncated) {
> > > >  			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
> > > > -			if (ip->i_delayed_blks > 0) {
> > > > +			if (ip->i_delayed_blks > 0 || ip->i_cow_blocks > 0) {
> > > >  				error = filemap_flush(VFS_I(ip)->i_mapping);
> > > >  				if (error)
> > > >  					return error;
> > > 
> > > Is having cowblocks really relevant to this hunk? I thought this was
> > > purely a delalloc vs. file size thing, but I could be wrong. 
> > 
> > AFAICT, if we (1) use truncate to reduce a file's size, (2) write
> > somewhere past eof, (3) make some delalloc reservations for the post-eof
> > write, and (4) close the file, then this chunk flushes the dirty data to
> > disk so that if we crash after the close() call returns, the file will
> > still have all the data that was written out.  IOWs, this provides for
> > flush-on-close after a file size reduction.
> > 
> 
> I think it goes back to problems where those subsequent buffered writes
> increase the file size again and the fs crashes before all data is
> written out. E.g., the problem described by commit ba87ea699e ("[XFS]
> Fix to prevent the notorious 'NULL files' problem after a crash."). It's
> not totally clear to me whether that fixed the problem and this
> particular hack is still needed.

Me neither.  It looks like deferring the size update until the write
end_io would have closed this bug... but on the other hand maybe its
function is more to avoid disappointing the people who expect flush on
close behavior...

> FWIW, the flush code looks like it goes back to commit 7d4fb40ad7
> ("[XFS] Start writeout earlier (on last close) ...").
> 
> > So I was thinking that if a write to a lower offset causes the creation
> > of a speculative cow extent of some kind that extends past eof, we'd
> > still want to flush the dirty data to disk on close even if there are no
> > delalloc reservations in the data fork.
> > 
> 
> This whole stanza still depends on a truncate in the first place
> though..?
> 
> I guess I'm not necessarily against doing this, I just think we should
> verify whether it's actually useful to prevent some kind of similar
> crash-recovery problem it was intended to help mitigate. If not, then
> we're subjecting ourselves to the tradeoff, which appears to be that
> we'll initiate writeback of any file with cowblocks on close that has
> been truncated.
> 
> Granted the truncate operation is probably infrequent with respect to
> close() so it's probably not that big of a deal, but in the delalloc

It's probably infrequent wrt cow-and-close, but "echo foo > existingfile"
would trigger this for the regular da case.  I don't really mind
dropping it either, aside from my sense of paranoia. :P

> case a flush is at least generally expected to clear the file of delayed
> allocation. It's my understanding that the same is not necessarily true
> for cowblocks.. cow prealloc means blocks can sit around in the cow fork
> for a while in anticipation of future copy-on-writes, right?

Yes.

--D

> 
> Brian
> 
> > Ofc now I see that xfs_file_iomap_begin_delay will create the data fork
> > da reservation for a non-shared block even if a cow fork extent already
> > exists (the write is promoted to cow), so perhaps this isn't strictly
> > necessary... but adding a data fork da extent when there's already a cow
> > fork extent seems like a (mostly harmless) bug to me.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > @@ -1909,7 +1907,8 @@ xfs_inactive(
> > > >  
> > > >  	if (S_ISREG(VFS_I(ip)->i_mode) &&
> > > >  	    (ip->i_d.di_size != 0 || XFS_ISIZE(ip) != 0 ||
> > > > -	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0))
> > > > +	     ip->i_d.di_nextents > 0 || ip->i_delayed_blks > 0 ||
> > > > +	     ip->i_cow_blocks > 0))
> > > >  		truncate = 1;
> > > >  
> > > >  	error = xfs_qm_dqattach(ip, 0);
> > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > index ff56486..6feee8a 100644
> > > > --- a/fs/xfs/xfs_inode.h
> > > > +++ b/fs/xfs/xfs_inode.h
> > > > @@ -62,6 +62,7 @@ typedef struct xfs_inode {
> > > >  	/* Miscellaneous state. */
> > > >  	unsigned long		i_flags;	/* see defined flags below */
> > > >  	unsigned int		i_delayed_blks;	/* count of delay alloc blks */
> > > > +	unsigned int		i_cow_blocks;	/* count of cow fork blocks */
> > > >  
> > > >  	struct xfs_icdinode	i_d;		/* most of ondisk inode */
> > > >  
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > index 56475fc..6c3381c 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -513,7 +513,8 @@ xfs_vn_getattr(
> > > >  	stat->mtime = inode->i_mtime;
> > > >  	stat->ctime = inode->i_ctime;
> > > >  	stat->blocks =
> > > > -		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks);
> > > > +		XFS_FSB_TO_BB(mp, ip->i_d.di_nblocks + ip->i_delayed_blks +
> > > > +				  ip->i_cow_blocks);
> > > >  
> > > >  	if (ip->i_d.di_version == 3) {
> > > >  		if (request_mask & STATX_BTIME) {
> > > > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > > > index d583105..412d7eb 100644
> > > > --- a/fs/xfs/xfs_itable.c
> > > > +++ b/fs/xfs/xfs_itable.c
> > > > @@ -122,7 +122,8 @@ xfs_bulkstat_one_int(
> > > >  	case XFS_DINODE_FMT_BTREE:
> > > >  		buf->bs_rdev = 0;
> > > >  		buf->bs_blksize = mp->m_sb.sb_blocksize;
> > > > -		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks;
> > > > +		buf->bs_blocks = dic->di_nblocks + ip->i_delayed_blks +
> > > > +				 ip->i_cow_blocks;
> > > >  		break;
> > > >  	}
> > > >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > > index 5b848f4..28f12f8 100644
> > > > --- a/fs/xfs/xfs_qm.c
> > > > +++ b/fs/xfs/xfs_qm.c
> > > > @@ -1847,7 +1847,7 @@ xfs_qm_vop_chown_reserve(
> > > >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
> > > >  	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
> > > >  
> > > > -	delblks = ip->i_delayed_blks;
> > > > +	delblks = ip->i_delayed_blks + ip->i_cow_blocks;
> > > >  	blkflags = XFS_IS_REALTIME_INODE(ip) ?
> > > >  			XFS_QMOPT_RES_RTBLKS : XFS_QMOPT_RES_REGBLKS;
> > > >  
> > > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > > index e367351..f875ea7 100644
> > > > --- a/fs/xfs/xfs_reflink.c
> > > > +++ b/fs/xfs/xfs_reflink.c
> > > > @@ -619,7 +619,7 @@ xfs_reflink_cancel_cow_blocks(
> > > >  	}
> > > >  
> > > >  	/* clear tag if cow fork is emptied */
> > > > -	if (!ifp->if_bytes)
> > > > +	if (ip->i_cow_blocks == 0)
> > > >  		xfs_inode_clear_cowblocks_tag(ip);
> > > >  
> > > >  	return error;
> > > > @@ -704,7 +704,7 @@ xfs_reflink_end_cow(
> > > >  	trace_xfs_reflink_end_cow(ip, offset, count);
> > > >  
> > > >  	/* No COW extents?  That's easy! */
> > > > -	if (ifp->if_bytes == 0)
> > > > +	if (ip->i_cow_blocks == 0)
> > > >  		return 0;
> > > >  
> > > >  	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index f3e0001..9d04cfb 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -989,6 +989,7 @@ xfs_fs_destroy_inode(
> > > >  	xfs_inactive(ip);
> > > >  
> > > >  	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0);
> > > > +	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_cow_blocks == 0);
> > > >  	XFS_STATS_INC(ip->i_mount, vn_reclaim);
> > > >  
> > > >  	/*
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls
  2018-01-26 13:06       ` Brian Foster
@ 2018-01-26 19:12         ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 19:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 08:06:25AM -0500, Brian Foster wrote:
> On Thu, Jan 25, 2018 at 12:20:33PM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 25, 2018 at 12:31:12PM -0500, Brian Foster wrote:
> > > On Tue, Jan 23, 2018 at 06:18:35PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > In xfs_bmap_btalloc, we try using the CoW extent size hint to force
> > > > allocations to align (offset-wise) to cowextsz granularity to reduce CoW
> > > > fragmentation.  This works fine until we cannot satisfy the allocation
> > > > with enough blocks to cover the requested range and the alignment hints.
> > > > If this happens, return an unaligned region because if we don't the
> > > > extent trim functions cause us to return a zero-length extent to iomap,
> > > > which iomap doesn't catch and thus blows up.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Hmm.. is this a direct I/O thing? The description of the problem had me
> > 
> > Yes.
> > 
> > > wondering how we handle this with regard to dio and traditional extent
> > > size hints. It looks like we just return -ENOSPC if xfs_bmapi_write()
> > > doesn't return a mapping that covers the target range of the write (even
> > > if it apparently attempts to allocate part of the associated extent size
> > > hint range). E.g., see the nimaps == 0 check in
> > > xfs_iomap_write_direct() after we commit the transaction.
> > 
> > I did take a look at that, and didn't like it.
> > 
> > There's enough free space to fill the dio write, but the free space
> > itself is very fragmented so we can't honor the hint.  We did however
> > manage to allocate /some/ blocks, so we might as well return what we got
> > and let the next iteration of the iomap_apply loop try to fill the rest
> > of the write request.  We already reserved enough space, so the write
> > should succeed totally, not return to userspace with either a short
> > write or ENOSPC just because free space is fragmented.
> > 
> 
> I'm not following how a short write can necessarily be prevented, since
> space reservation doesn't guarantee contiguity and afaict we only make a
> single mapping call. I suppose the iomap level can loop, but that's
> outside of the context where blocks are reserved. Hm?
> 
> But regardless, this behavior seems reasonable to me if we apply it
> consistently between cow fork hint behavior and traditional extent size
> hint behavior. They are both hints, after all. I do think an nimaps
> check might still be appropriate in that reflink code path simply to
> cover the case of unexpected behavior or a bug, rather than brace for
> whatever is going to happen if we continue to shuffle a bogus imap
> around.

Yes, the new version makes the behavior consistent for both hints, and
adds the "got no blocks, so sad" check to _reflink_allocate_cow.

--D

> Brian
> 
> > (The other problem is that if we return ENOSPC out of iomap_begin, that
> > error code will bubble all the way back to userspace even if we /did/
> > write something, which means that even the programs that handle short
> > dio writes correctly will see that ENOSPC and bail out.  Goldwyn has
> > been trying to fix that braindamage for some time now.)
> > 
> > > In fact, it looks like just repeating the failed write could eventually
> > > succeed if the issue is that there is actually enough free space
> > > available to allocate the hint range up to where the write is targeted,
> > > just no long enough extent available to fill the extent size hint range
> > > in a single bmapi_write call. That behavior is a bit strange, I admit,
> > > but I'm wondering if we could do the same thing for the cow hint. Would
> > > a similar nimaps check in the xfs_bmapi_write() caller resolve the bug
> > > described here?
> > > 
> > > If so and if we still care to actually change/fix the allocation
> > > behavior with regard to the hints, perhaps we could do that in a
> > > separate patch more generically for both hints..?
> > 
> > I get the feeling we could apply this change to all the data fork
> > bmap_btalloc calls too.  I'll go study that in more depth.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  fs/iomap.c               |    2 +-
> > > >  fs/xfs/libxfs/xfs_bmap.c |   21 +++++++++++++++++++--
> > > >  2 files changed, 20 insertions(+), 3 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > > index e5de772..aec35a0 100644
> > > > --- a/fs/iomap.c
> > > > +++ b/fs/iomap.c
> > > > @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
> > > >  	ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> > > >  	if (ret)
> > > >  		return ret;
> > > > -	if (WARN_ON(iomap.offset > pos))
> > > > +	if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0))
> > > >  		return -EIO;
> > > >  
> > > >  	/*
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index 93ce2c6..4ec1fdc5 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -3480,8 +3480,20 @@ xfs_bmap_btalloc_filestreams(
> > > >  static void
> > > >  xfs_bmap_btalloc_cow(
> > > >  	struct xfs_bmalloca	*ap,
> > > > -	struct xfs_alloc_arg	*args)
> > > > +	struct xfs_alloc_arg	*args,
> > > > +	xfs_fileoff_t		orig_offset,
> > > > +	xfs_extlen_t		orig_length)
> > > >  {
> > > > +	/*
> > > > +	 * If we didn't get enough blocks to satisfy the cowextsize
> > > > +	 * aligned request, break the alignment and return whatever we
> > > > +	 * got; it's the best we can do.
> > > > +	 */
> > > > +	if (ap->length <= orig_length)
> > > > +		ap->offset = orig_offset;
> > > > +	else if (ap->offset + ap->length < orig_offset + orig_length)
> > > > +		ap->offset = orig_offset + orig_length - ap->length;
> > > > +
> > > >  	/* Filling a previously reserved extent; nothing to do here. */
> > > >  	if (ap->wasdel)
> > > >  		return;
> > > > @@ -3520,6 +3532,8 @@ xfs_bmap_btalloc(
> > > >  	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
> > > >  	xfs_agnumber_t	ag;
> > > >  	xfs_alloc_arg_t	args;
> > > > +	xfs_fileoff_t	orig_offset;
> > > > +	xfs_extlen_t	orig_length;
> > > >  	xfs_extlen_t	blen;
> > > >  	xfs_extlen_t	nextminlen = 0;
> > > >  	int		nullfb;		/* true if ap->firstblock isn't set */
> > > > @@ -3529,6 +3543,8 @@ xfs_bmap_btalloc(
> > > >  	int		stripe_align;
> > > >  
> > > >  	ASSERT(ap->length);
> > > > +	orig_offset = ap->offset;
> > > > +	orig_length = ap->length;
> > > >  
> > > >  	mp = ap->ip->i_mount;
> > > >  
> > > > @@ -3745,7 +3761,8 @@ xfs_bmap_btalloc(
> > > >  		ASSERT(nullfb || fb_agno <= args.agno);
> > > >  		ap->length = args.len;
> > > >  		if (ap->flags & XFS_BMAPI_COWFORK) {
> > > > -			xfs_bmap_btalloc_cow(ap, &args);
> > > > +			xfs_bmap_btalloc_cow(ap, &args, orig_offset,
> > > > +					orig_length);
> > > >  		} else {
> > > >  			ap->ip->i_d.di_nblocks += args.len;
> > > >  			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc
  2018-01-26 12:17   ` Christoph Hellwig
@ 2018-01-26 21:46     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 21:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 04:17:16AM -0800, Christoph Hellwig wrote:
> On Wed, Jan 24, 2018 at 09:26:47PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Since we now have a dedicated function for dealing with CoW allocation
> > related quota updates in xfs_bmap_btalloc, we might as well refactor the
> > data/attr fork quota update into its own function too.
> 
> Any good reason not to have this merged with the cow fork side helper?
> 
> > +	/*
> > +	 * Adjust the disk quota also. This was reserved
> > +	 * earlier.
> > +	 */
> 
> Please use the full available line length for comments.
> 
> > +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> > +		ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT : XFS_TRANS_DQ_BCOUNT,
> > +		(long) args.len);
> 
> I don't think we need this cast.

I fixed all of these; will have a new series out soon.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/11] xfs: track CoW blocks separately in the inode
  2018-01-26 19:00     ` Darrick J. Wong
@ 2018-01-26 23:51       ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2018-01-26 23:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Jan 26, 2018 at 11:00:58AM -0800, Darrick J. Wong wrote:
> On Fri, Jan 26, 2018 at 04:15:46AM -0800, Christoph Hellwig wrote:
> > On Tue, Jan 23, 2018 at 06:18:29PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Track the number of blocks reserved in the CoW fork so that we can
> > > move the quota reservations whenever we chown, and don't account for
> > > CoW fork delalloc reservations in i_delayed_blks.  This should make
> > > chown work properly for quota reservations, enables us to fully
> > > account for real extents in the cow fork in the file stat info, and
> > > improves the post-eof scanning decisions because we're no longer
> > > confusing data fork delalloc extents with cow fork delalloc extents.
> > 
> > Just curious:  is there any good reason we can't just have an
> > i_extra_blocks field for the delayed and cow blocks?  Or is there
> > a place where we care about the difference between the two?
> 
> "cow blocks" now includes real and unwritten extents sitting around in
> the cow fork in addition to delalloc extents in the cow fork, and I
> didn't want the field to have overlapping meanings.  On a practical
> level, it also means we avoid eofblocks scans on inodes that have cow
> blocks but no da blocks.

Oh. Duh, we have the inode tags for that.  Ok, dropping this patch;
will integrate the two i_delayed_blks twiddles we need into the one that
fixes the quota accounting.

--D

> --D
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink
  2018-01-26 12:07   ` Christoph Hellwig
  2018-01-26 18:48     ` Darrick J. Wong
@ 2018-01-27  3:32     ` Dave Chinner
  1 sibling, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2018-01-27  3:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Fri, Jan 26, 2018 at 04:07:41AM -0800, Christoph Hellwig wrote:
> > +xfs_lock_two_inodes_separately(
> > +	struct xfs_inode	*ip0,
> > +	uint			ip0_mode,
> > +	struct xfs_inode	*ip1,
> > +	uint			ip1_mode)
> 
> We only have 6 calls to xfs_lock_two_inodes in total, so just update
> the signature to take two modes and be done with it.
> 
> Also how about mode1 and mode2 for the argument names?
> 
> >  	lp = (xfs_log_item_t *)ip0->i_itemp;
> >  	if (lp && (lp->li_flags & XFS_LI_IN_AIL)) {
> > -		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(lock_mode, 1))) {
> > -			xfs_iunlock(ip0, lock_mode);
> > +		if (!xfs_ilock_nowait(ip1, xfs_lock_inumorder(ip1_mode, 1))) {
> > +			xfs_iunlock(ip0, ip0_mode);
> >  			if ((++attempts % 5) == 0)
> >  				delay(1); /* Don't just spin the CPU */
> >  			goto again;
> >  		}
> >  	} else {
> > -		xfs_ilock(ip1, xfs_lock_inumorder(lock_mode, 1));
> > +		xfs_ilock(ip1, xfs_lock_inumorder(ip1_mode, 1));
> >  	}
> >  }
> 
> Not directly related to your patch, but the the nowait + retry
> mess must go away.
> 
> I think we need to move to the VFS locking conventions, that is
> based on ancestors for directories (see lock_rename) and otherwise
> based on the struct inode address as in lock_two_nondirectories.

I'm pretty sure this has nothing to do with directory lock order.
It's preventing deadlocks with xfs_iflush_cluster() where we lock
other inodes in the cluster and flush them to the backing buffer
while we still hold the ILOCK for the original inode we are pushing.

That's why this is all conditional on the XFS_LI_IN_AIL flag - the
inode writeback code won't be holding or attmpeting to lock the
inode if it is not in the AIL (i.e. it is clean).

Hence I don't think this trylock+backoff can ever go away here,
because there's no guaranteed lock ordering relationship between
directory structure and location in the inode cluster....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2018-01-27  3:52 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
2018-01-24 14:16   ` Brian Foster
2018-01-26  9:06   ` Christoph Hellwig
2018-01-26 18:26     ` Darrick J. Wong
2018-01-24  2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
2018-01-24 14:18   ` Brian Foster
2018-01-24 18:40     ` Darrick J. Wong
2018-01-26 12:07   ` Christoph Hellwig
2018-01-26 18:48     ` Darrick J. Wong
2018-01-27  3:32     ` Dave Chinner
2018-01-24  2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
2018-01-24 14:18   ` Brian Foster
2018-01-26  9:07   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
2018-01-24 14:22   ` Brian Foster
2018-01-24 19:14     ` Darrick J. Wong
2018-01-25 13:01       ` Brian Foster
2018-01-25 17:52         ` Darrick J. Wong
2018-01-25  1:20   ` [PATCH v2 " Darrick J. Wong
2018-01-25 13:03     ` Brian Foster
2018-01-25 18:20       ` Darrick J. Wong
2018-01-26 13:02         ` Brian Foster
2018-01-26 18:40           ` Darrick J. Wong
2018-01-26 12:12     ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
2018-01-25 13:06   ` Brian Foster
2018-01-25 19:21     ` Darrick J. Wong
2018-01-26 13:04       ` Brian Foster
2018-01-26 19:08         ` Darrick J. Wong
2018-01-26 12:15   ` Christoph Hellwig
2018-01-26 19:00     ` Darrick J. Wong
2018-01-26 23:51       ` Darrick J. Wong
2018-01-24  2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 20:20     ` Darrick J. Wong
2018-01-26 13:06       ` Brian Foster
2018-01-26 19:12         ` Darrick J. Wong
2018-01-26  9:11   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:36     ` Darrick J. Wong
2018-01-26  9:08   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 08/11] xfs: fix tracepoint %p formats Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:47     ` Darrick J. Wong
2018-01-26  0:19       ` Darrick J. Wong
2018-01-26  9:09         ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-26  9:09   ` Christoph Hellwig
2018-01-24  2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:23     ` Darrick J. Wong
2018-01-26  9:10   ` Christoph Hellwig
2018-01-24  2:19 ` [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
2018-01-26  9:10   ` Christoph Hellwig
2018-01-25  5:26 ` [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc Darrick J. Wong
2018-01-26 12:17   ` Christoph Hellwig
2018-01-26 21:46     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.