Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1
@ 2020-07-07 13:57 Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-07 13:57 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Hi forks,

This RFC patchset mainly addresses the thoughts [*] and [**] from Dave's
original patchset,
https://lore.kernel.org/r/20200623095015.1934171-1-david@fromorbit.com

In short, it focues on the following ideas mentioned by Dave:
 - use bucket 0 instead of multiple buckets since in-memory double
   linked list finally works;

 - avoid taking AGI buffer and unnecessary AGI update if possible, so
   1) add a new lock and keep proper locking order to avoid deadlock;
   2) insert a new unlinked inode from the tail instead of head;

In addition, it's worth noticing 3 things:
 - xfs_iunlink_remove() should support old multiple buckets in order
   to keep old inode unlinked list (old image) working when recovering.

 - (but) OTOH, the old kernel recovery _shouldn't_ work with new image
   since the bucket_index from old xfs_iunlink_remove() is generated
   by the old formula (rather than keep in xfs_inode), which is now
   fixed as 0. So this feature is not forward compatible without some
   extra backport patches;

 - a tail xfs_inode pointer is also added in the perag, which keeps 
   track of the tail of bucket 0 since it's mainly used for xfs_iunlink().

 - the old kernel recovery shouldn't work with new image since
     its bucket_index from old kernel is 

The git tree is also available at
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git tags/xfs/iunlink_opt_v1

Some limited test for debugging only is done (mainly fsstress and
manual power-cut tests), so it could not work as expected just like
my limited broken xfs knowledge. But I will go on improving this
patchset recently. 

Any comments and directions are welcome. :)

Thanks,
Gao Xiang

Gao Xiang (2):
  xfs: arrange all unlinked inodes into one list
  xfs: don't access AGI on unlinked inodes if it can

 fs/xfs/xfs_inode.c       | 283 ++++++++++++++++++++-------------------
 fs/xfs/xfs_log_recover.c |   6 +
 fs/xfs/xfs_mount.c       |   3 +
 fs/xfs/xfs_mount.h       |   3 +
 4 files changed, 160 insertions(+), 135 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list
  2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
@ 2020-07-07 13:57 ` Gao Xiang
  2020-07-08 22:33   ` Dave Chinner
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2 siblings, 1 reply; 29+ messages in thread
From: Gao Xiang @ 2020-07-07 13:57 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

There is no need to keep old multiple short unlink inode buckets
since we have an in-memory double linked list for all unlinked
inodes.

Apart from the perspective of the necessity, the main advantage
is that the log and AGI update can be reduced since each AG has
the only one head now, which is implemented in the following patch.

Therefore, this patch applies the new way in xfs_iunlink() and
keep the old approach in xfs_iunlink_remove_inode() path as well
so inode eviction can still work properly in recovery.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ab288424764c..10565fa5ace4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -33,6 +33,7 @@
 #include "xfs_symlink.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_iunlink_item.h"
@@ -1955,25 +1956,32 @@ xfs_iunlink_update_bucket(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	struct xfs_buf		*agibp,
-	unsigned int		bucket_index,
+	xfs_agino_t		old_agino,
 	xfs_agino_t		new_agino)
 {
+	struct xlog		*log = tp->t_mountp->m_log;
 	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		old_value;
-	int			offset;
+	unsigned int		bucket_index;
+	int                     offset;
 
 	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
 
+	bucket_index = 0;
+	/* During recovery, the old multiple bucket index can be applied */
+	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
+		ASSERT(old_agino != NULLAGINO);
+
+		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
+			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
+	}
+
 	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
 	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
 			old_value, new_agino);
 
-	/*
-	 * We should never find the head of the list already set to the value
-	 * passed in because either we're adding or removing ourselves from the
-	 * head of the list.
-	 */
-	if (old_value == new_agino) {
+	/* check if the old agi_unlinked head is as expected */
+	if (old_value != old_agino) {
 		xfs_buf_mark_corrupt(agibp);
 		return -EFSCORRUPTED;
 	}
@@ -2001,14 +2009,13 @@ xfs_iunlink_insert_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 
 	/*
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the pointer isn't garbage and that this inode
 	 * isn't already on the list.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
 	if (next_agino == agino ||
 	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
 		xfs_buf_mark_corrupt(agibp);
@@ -2036,7 +2043,7 @@ xfs_iunlink_insert_inode(
 	}
 
 	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
+	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
 }
 
 /*
@@ -2051,27 +2058,20 @@ xfs_iunlink_remove_inode(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
 		/* remove from head of list */
-		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
-			xfs_buf_mark_corrupt(agibp);
-			return -EFSCORRUPTED;
-		}
 		if (next_agino == agino ||
 		    !xfs_verify_agino_or_null(mp, agno, next_agino))
 			return -EFSCORRUPTED;
 
-		error = xfs_iunlink_update_bucket(tp, agno, agibp,
-					bucket_index, next_agino);
+		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
 		if (error)
-			return -EFSCORRUPTED;
+			return error;
 	} else {
 		/* lookup previous inode and update to point at next */
 		struct xfs_inode	*pip;
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
@ 2020-07-07 13:57 ` Gao Xiang
  2020-07-08 17:03   ` Darrick J. Wong
  2020-07-08 23:33   ` Dave Chinner
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2 siblings, 2 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-07 13:57 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, we use AGI buffer lock to protect in-memory linked list for
unlinked inodes but since it's not necessary to modify AGI unless the
head of the unlinked list is modified. So let's removing the AGI buffer
modification dependency if possible, including 1) adding another per-AG
dedicated lock to protect the whole list and 2) inserting unlinked
inodes from tail.

For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
to use. xfs_iunlink_remove() still support old multiple short bucket
lists for recovery code.

Note that some paths take AGI lock in its transaction in advance,
so the proper locking order is only AGI lock -> unlinked list lock.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
 fs/xfs/xfs_log_recover.c |   6 +
 fs/xfs/xfs_mount.c       |   3 +
 fs/xfs/xfs_mount.h       |   3 +
 4 files changed, 144 insertions(+), 119 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 10565fa5ace4..d33e5b198534 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
 }
 
 /*
- * Always insert at the head, so we only have to do a next inode lookup to
- * update it's prev pointer. The AGI bucket will point at the one we are
- * inserting.
+ * This is called when the inode's link count has gone to 0 or we are creating
+ * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
+ *
+ * We place the on-disk inode on a list in the AGI.  It will be pulled from this
+ * list when the inode is freed.
  */
-static int
-xfs_iunlink_insert_inode(
+STATIC int
+xfs_iunlink(
 	struct xfs_trans	*tp,
-	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
+	struct xfs_inode	*pip;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino;
+	int			error;
+
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
+
+	pag = xfs_perag_get(mp, agno);
 
 	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the pointer isn't garbage and that this inode
-	 * isn't already on the list.
+	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
+	 * in this transaction in advance and the only locking order
+	 * AGI buf lock -> pag_unlinked_mutex is safe.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
-	if (next_agino == agino ||
-	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
-		xfs_buf_mark_corrupt(agibp);
-		return -EFSCORRUPTED;
-	}
-
-	ip->i_prev_unlinked = NULLAGINO;
-	ip->i_next_unlinked = next_agino;
-	if (ip->i_next_unlinked != NULLAGINO) {
-		struct xfs_inode	*nip;
+	mutex_lock(&pag->pag_unlinked_mutex);
+	pip = pag->pag_unlinked_tail;
+	if (!pip) {
+		struct xfs_buf	*agibp;
 
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
-		if (IS_ERR_OR_NULL(nip))
-			return -EFSCORRUPTED;
+		mutex_unlock(&pag->pag_unlinked_mutex);
 
-		if (nip->i_prev_unlinked != NULLAGINO) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-						NULL, 0, __this_address);
-			return -EFSCORRUPTED;
+		/*
+		 * there could be some race, but it doesn't matter though
+		 * since !pip doesn't happen frequently.
+		 */
+		error = xfs_read_agi(mp, tp, agno, &agibp);
+		if (error) {
+			xfs_perag_put(pag);
+			return error;
 		}
-		nip->i_prev_unlinked = agino;
 
-		/* update the on disk inode now */
-		xfs_iunlink_log(tp, ip);
-	}
+		mutex_lock(&pag->pag_unlinked_mutex);
+		pip = pag->pag_unlinked_tail;
+		if (!pip) {
+			struct xfs_agi	*agi;
+
+			ip->i_prev_unlinked = NULLAGINO;
 
-	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
+			agi = agibp->b_addr;
+			ASSERT(be32_to_cpu(agi->agi_unlinked[0]) == NULLAGINO);
+
+			/* Point the head of the list to point to this inode. */
+			error = xfs_iunlink_update_bucket(tp, agno,
+					agibp, NULLAGINO, agino);
+			goto out;
+		}
+	}
+	ip->i_prev_unlinked = XFS_INO_TO_AGINO(mp, pip->i_ino);
+	ASSERT(pip->i_next_unlinked == NULLAGINO);
+	pip->i_next_unlinked = agino;
+	xfs_iunlink_log(tp, pip);
+out:
+	ip->i_next_unlinked = NULLAGINO;
+	pag->pag_unlinked_tail = ip;
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
+	return 0;
 }
 
 /*
+ * Pull the on-disk inode from the AGI unlinked list.
+ *
  * Remove can be from anywhere in the list, so we have to do two adjacent inode
  * lookups here so we can update list pointers. We may be at the head or the
  * tail of the list, so we have to handle those cases as well.
  */
-static int
-xfs_iunlink_remove_inode(
+STATIC int
+xfs_iunlink_remove(
 	struct xfs_trans	*tp,
-	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
+	struct xfs_inode	*pip = NULL;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino = ip->i_next_unlinked;
+	xfs_agino_t		next_agino;
 	int			error;
 
+	trace_xfs_iunlink_remove(ip);
+
+	pag = xfs_perag_get(mp, agno);
+	mutex_lock(&pag->pag_unlinked_mutex);
+
+	/* see the comment in xfs_iunlink() on the only proper locking order */
 	if (ip->i_prev_unlinked == NULLAGINO) {
-		/* remove from head of list */
-		if (next_agino == agino ||
-		    !xfs_verify_agino_or_null(mp, agno, next_agino))
-			return -EFSCORRUPTED;
+		struct xfs_buf	*agibp;
 
-		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
-		if (error)
+		mutex_unlock(&pag->pag_unlinked_mutex);
+
+		error = xfs_read_agi(mp, tp, agno, &agibp);
+		if (error) {
+			xfs_perag_put(pag);
 			return error;
-	} else {
-		/* lookup previous inode and update to point at next */
-		struct xfs_inode	*pip;
+		}
 
-		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
-		if (IS_ERR_OR_NULL(pip))
-			return -EFSCORRUPTED;
+		mutex_lock(&pag->pag_unlinked_mutex);
+		if (ip->i_prev_unlinked == NULLAGINO) {
+			struct xfs_agi	*agi;
 
-		if (pip->i_next_unlinked != agino) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-						NULL, 0, __this_address);
-			return -EFSCORRUPTED;
+			next_agino = ip->i_next_unlinked;
+			agi = agibp->b_addr;
+
+			/* remove from head of list */
+			if (next_agino == agino ||
+			    !xfs_verify_agino_or_null(mp, agno, next_agino))
+				goto err_fscorrupted;
+
+			error = xfs_iunlink_update_bucket(tp, agno,
+				agibp, agino, next_agino);
+			if (error)
+				goto err_out;
+
+			goto next_fixup;
 		}
+	}
 
-		/* update the on disk inode now */
-		pip->i_next_unlinked = next_agino;
-		xfs_iunlink_log(tp, pip);
+	next_agino = ip->i_next_unlinked;
+
+	/* lookup previous inode and update to point at next */
+	pip = xfs_iunlink_lookup_prev(pag, ip);
+	if (IS_ERR_OR_NULL(pip))
+		goto err_fscorrupted;
+
+	if (pip->i_next_unlinked != agino) {
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
+					NULL, 0, __this_address);
+		goto err_fscorrupted;
 	}
+	pip->i_next_unlinked = next_agino;
+	xfs_iunlink_log(tp, pip);
 
-	/* lookup the next inode and update to point at prev */
-	if (ip->i_next_unlinked != NULLAGINO) {
+next_fixup:
+	if (next_agino == NULLAGINO) {
+		/* so iunlink recovery can work here */
+		if (pag->pag_unlinked_tail == ip)
+			pag->pag_unlinked_tail = pip;
+	} else {
+		/* lookup the next inode and update to point at prev */
 		struct xfs_inode	*nip;
 
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
+		nip = xfs_iunlink_lookup_next(pag, ip);
 		if (IS_ERR_OR_NULL(nip))
-			return -EFSCORRUPTED;
+			goto err_fscorrupted;
 
 		if (nip->i_prev_unlinked != agino) {
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
 						NULL, 0, __this_address);
-			return -EFSCORRUPTED;
+			goto err_fscorrupted;
 		}
 		/* in memory update only */
 		nip->i_prev_unlinked = ip->i_prev_unlinked;
 	}
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
 
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	/*
 	 * Now clear prev/next from this inode and update on disk if we
 	 * need to clear the on-disk link.
 	 */
 	ip->i_prev_unlinked = NULLAGINO;
-	ip->i_next_unlinked = NULLAGINO;
-	if (next_agino != NULLAGINO)
+	if (next_agino != NULLAGINO) {
+		ip->i_next_unlinked = NULLAGINO;
 		xfs_iunlink_log(tp, ip);
+	}
 	return 0;
-}
 
-/*
- * This is called when the inode's link count has gone to 0 or we are creating
- * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
- *
- * We place the on-disk inode on a list in the AGI.  It will be pulled from this
- * list when the inode is freed.
- */
-STATIC int
-xfs_iunlink(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	int			error;
-
-	ASSERT(ip->i_next_unlinked == NULLAGINO);
-	ASSERT(VFS_I(ip)->i_nlink == 0);
-	ASSERT(VFS_I(ip)->i_mode != 0);
-	trace_xfs_iunlink(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
-
-	return xfs_iunlink_insert_inode(tp, agibp, ip);
-}
-
-/*
- * Pull the on-disk inode from the AGI unlinked list.
- */
-STATIC int
-xfs_iunlink_remove(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	int			error;
-
-	trace_xfs_iunlink_remove(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
-
-	return xfs_iunlink_remove_inode(tp, agibp, ip);
+err_fscorrupted:
+	error = -EFSCORRUPTED;
+err_out:
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index d47eea31c165..3f2739316424 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2801,6 +2801,11 @@ xlog_recover_iunlink_ag(
 					prev_ip->i_next_unlinked = NULLAGINO;
 				break;
 			}
+
+			/* XXX: take pag_unlinked_mutex across the loop? */
+			if (!bucket)
+				agibp->b_pag->pag_unlinked_tail = ip;
+
 			if (prev_ip) {
 				ip->i_prev_unlinked = prev_agino;
 				xfs_irele(prev_ip);
@@ -2812,6 +2817,7 @@ xlog_recover_iunlink_ag(
 		}
 		if (prev_ip)
 			xfs_irele(prev_ip);
+
 		if (error) {
 			/*
 			 * We can't read an inode this bucket points to, or an
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 031e96ff022d..a9701f863e84 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -223,6 +223,9 @@ xfs_initialize_perag(
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
 		spin_lock_init(&pag->pag_state_lock);
+
+		mutex_init(&pag->pag_unlinked_mutex);
+		pag->pag_unlinked_tail = NULL;
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a72cfcaa4ad1..107a634a34f7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -372,6 +372,9 @@ typedef struct xfs_perag {
 	/* reference count */
 	uint8_t			pagf_refcount_level;
 
+	struct mutex		pag_unlinked_mutex;
+	struct xfs_inode	*pag_unlinked_tail;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
@ 2020-07-08 17:03   ` Darrick J. Wong
  2020-07-08 23:40     ` Gao Xiang
  2020-07-08 23:33   ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2020-07-08 17:03 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> Currently, we use AGI buffer lock to protect in-memory linked list for
> unlinked inodes but since it's not necessary to modify AGI unless the
> head of the unlinked list is modified. So let's removing the AGI buffer
> modification dependency if possible, including 1) adding another per-AG
> dedicated lock to protect the whole list and 2) inserting unlinked
> inodes from tail.
> 
> For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> to use. xfs_iunlink_remove() still support old multiple short bucket
> lists for recovery code.
> 
> Note that some paths take AGI lock in its transaction in advance,
> so the proper locking order is only AGI lock -> unlinked list lock.

How much agi buffer lock contention does this eliminate?

> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
>  fs/xfs/xfs_log_recover.c |   6 +
>  fs/xfs/xfs_mount.c       |   3 +
>  fs/xfs/xfs_mount.h       |   3 +
>  4 files changed, 144 insertions(+), 119 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 10565fa5ace4..d33e5b198534 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
>  }
>  
>  /*
> - * Always insert at the head, so we only have to do a next inode lookup to
> - * update it's prev pointer. The AGI bucket will point at the one we are
> - * inserting.
> + * This is called when the inode's link count has gone to 0 or we are creating
> + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> + *
> + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> + * list when the inode is freed.
>   */
> -static int
> -xfs_iunlink_insert_inode(
> +STATIC int
> +xfs_iunlink(
>  	struct xfs_trans	*tp,
> -	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi = agibp->b_addr;
> -	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_perag	*pag;
> +	struct xfs_inode	*pip;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	xfs_agino_t		next_agino;
> +	int			error;
> +
> +	ASSERT(VFS_I(ip)->i_nlink == 0);
> +	ASSERT(VFS_I(ip)->i_mode != 0);
> +	trace_xfs_iunlink(ip);
> +
> +	pag = xfs_perag_get(mp, agno);
>  
>  	/*
> -	 * Get the index into the agi hash table for the list this inode will
> -	 * go on.  Make sure the pointer isn't garbage and that this inode
> -	 * isn't already on the list.
> +	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
> +	 * in this transaction in advance and the only locking order
> +	 * AGI buf lock -> pag_unlinked_mutex is safe.
>  	 */
> -	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
> -	if (next_agino == agino ||
> -	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
> -		xfs_buf_mark_corrupt(agibp);
> -		return -EFSCORRUPTED;
> -	}
> -
> -	ip->i_prev_unlinked = NULLAGINO;
> -	ip->i_next_unlinked = next_agino;
> -	if (ip->i_next_unlinked != NULLAGINO) {
> -		struct xfs_inode	*nip;
> +	mutex_lock(&pag->pag_unlinked_mutex);
> +	pip = pag->pag_unlinked_tail;
> +	if (!pip) {
> +		struct xfs_buf	*agibp;
>  
> -		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
> -		if (IS_ERR_OR_NULL(nip))
> -			return -EFSCORRUPTED;
> +		mutex_unlock(&pag->pag_unlinked_mutex);
>  
> -		if (nip->i_prev_unlinked != NULLAGINO) {
> -			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> -						NULL, 0, __this_address);
> -			return -EFSCORRUPTED;
> +		/*
> +		 * there could be some race, but it doesn't matter though
> +		 * since !pip doesn't happen frequently.
> +		 */
> +		error = xfs_read_agi(mp, tp, agno, &agibp);
> +		if (error) {
> +			xfs_perag_put(pag);
> +			return error;
>  		}
> -		nip->i_prev_unlinked = agino;
>  
> -		/* update the on disk inode now */
> -		xfs_iunlink_log(tp, ip);
> -	}
> +		mutex_lock(&pag->pag_unlinked_mutex);
> +		pip = pag->pag_unlinked_tail;
> +		if (!pip) {
> +			struct xfs_agi	*agi;
> +
> +			ip->i_prev_unlinked = NULLAGINO;
>  
> -	/* Point the head of the list to point to this inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
> +			agi = agibp->b_addr;
> +			ASSERT(be32_to_cpu(agi->agi_unlinked[0]) == NULLAGINO);
> +
> +			/* Point the head of the list to point to this inode. */
> +			error = xfs_iunlink_update_bucket(tp, agno,
> +					agibp, NULLAGINO, agino);
> +			goto out;
> +		}
> +	}
> +	ip->i_prev_unlinked = XFS_INO_TO_AGINO(mp, pip->i_ino);
> +	ASSERT(pip->i_next_unlinked == NULLAGINO);
> +	pip->i_next_unlinked = agino;
> +	xfs_iunlink_log(tp, pip);
> +out:
> +	ip->i_next_unlinked = NULLAGINO;
> +	pag->pag_unlinked_tail = ip;
> +	mutex_unlock(&pag->pag_unlinked_mutex);
> +	xfs_perag_put(pag);
> +	return 0;
>  }
>  
>  /*
> + * Pull the on-disk inode from the AGI unlinked list.
> + *
>   * Remove can be from anywhere in the list, so we have to do two adjacent inode
>   * lookups here so we can update list pointers. We may be at the head or the
>   * tail of the list, so we have to handle those cases as well.
>   */
> -static int
> -xfs_iunlink_remove_inode(
> +STATIC int
> +xfs_iunlink_remove(
>  	struct xfs_trans	*tp,
> -	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_perag	*pag;
> +	struct xfs_inode	*pip = NULL;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	xfs_agino_t		next_agino = ip->i_next_unlinked;
> +	xfs_agino_t		next_agino;
>  	int			error;
>  
> +	trace_xfs_iunlink_remove(ip);
> +
> +	pag = xfs_perag_get(mp, agno);
> +	mutex_lock(&pag->pag_unlinked_mutex);
> +
> +	/* see the comment in xfs_iunlink() on the only proper locking order */
>  	if (ip->i_prev_unlinked == NULLAGINO) {
> -		/* remove from head of list */
> -		if (next_agino == agino ||
> -		    !xfs_verify_agino_or_null(mp, agno, next_agino))
> -			return -EFSCORRUPTED;
> +		struct xfs_buf	*agibp;
>  
> -		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
> -		if (error)
> +		mutex_unlock(&pag->pag_unlinked_mutex);
> +
> +		error = xfs_read_agi(mp, tp, agno, &agibp);
> +		if (error) {
> +			xfs_perag_put(pag);
>  			return error;
> -	} else {
> -		/* lookup previous inode and update to point at next */
> -		struct xfs_inode	*pip;
> +		}
>  
> -		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
> -		if (IS_ERR_OR_NULL(pip))
> -			return -EFSCORRUPTED;
> +		mutex_lock(&pag->pag_unlinked_mutex);
> +		if (ip->i_prev_unlinked == NULLAGINO) {
> +			struct xfs_agi	*agi;
>  
> -		if (pip->i_next_unlinked != agino) {
> -			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> -						NULL, 0, __this_address);
> -			return -EFSCORRUPTED;
> +			next_agino = ip->i_next_unlinked;
> +			agi = agibp->b_addr;
> +
> +			/* remove from head of list */
> +			if (next_agino == agino ||
> +			    !xfs_verify_agino_or_null(mp, agno, next_agino))
> +				goto err_fscorrupted;
> +
> +			error = xfs_iunlink_update_bucket(tp, agno,
> +				agibp, agino, next_agino);
> +			if (error)
> +				goto err_out;
> +
> +			goto next_fixup;
>  		}
> +	}
>  
> -		/* update the on disk inode now */
> -		pip->i_next_unlinked = next_agino;
> -		xfs_iunlink_log(tp, pip);
> +	next_agino = ip->i_next_unlinked;
> +
> +	/* lookup previous inode and update to point at next */
> +	pip = xfs_iunlink_lookup_prev(pag, ip);
> +	if (IS_ERR_OR_NULL(pip))
> +		goto err_fscorrupted;
> +
> +	if (pip->i_next_unlinked != agino) {
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> +					NULL, 0, __this_address);
> +		goto err_fscorrupted;
>  	}
> +	pip->i_next_unlinked = next_agino;
> +	xfs_iunlink_log(tp, pip);
>  
> -	/* lookup the next inode and update to point at prev */
> -	if (ip->i_next_unlinked != NULLAGINO) {
> +next_fixup:
> +	if (next_agino == NULLAGINO) {
> +		/* so iunlink recovery can work here */
> +		if (pag->pag_unlinked_tail == ip)
> +			pag->pag_unlinked_tail = pip;
> +	} else {
> +		/* lookup the next inode and update to point at prev */
>  		struct xfs_inode	*nip;
>  
> -		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
> +		nip = xfs_iunlink_lookup_next(pag, ip);
>  		if (IS_ERR_OR_NULL(nip))
> -			return -EFSCORRUPTED;
> +			goto err_fscorrupted;
>  
>  		if (nip->i_prev_unlinked != agino) {
>  			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
>  						NULL, 0, __this_address);
> -			return -EFSCORRUPTED;
> +			goto err_fscorrupted;
>  		}
>  		/* in memory update only */
>  		nip->i_prev_unlinked = ip->i_prev_unlinked;
>  	}
> +	mutex_unlock(&pag->pag_unlinked_mutex);
> +	xfs_perag_put(pag);
>  
> +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	/*
>  	 * Now clear prev/next from this inode and update on disk if we
>  	 * need to clear the on-disk link.
>  	 */
>  	ip->i_prev_unlinked = NULLAGINO;
> -	ip->i_next_unlinked = NULLAGINO;
> -	if (next_agino != NULLAGINO)
> +	if (next_agino != NULLAGINO) {
> +		ip->i_next_unlinked = NULLAGINO;
>  		xfs_iunlink_log(tp, ip);
> +	}
>  	return 0;
> -}
>  
> -/*
> - * This is called when the inode's link count has gone to 0 or we are creating
> - * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> - *
> - * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> - * list when the inode is freed.
> - */
> -STATIC int
> -xfs_iunlink(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_buf		*agibp;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	int			error;
> -
> -	ASSERT(ip->i_next_unlinked == NULLAGINO);
> -	ASSERT(VFS_I(ip)->i_nlink == 0);
> -	ASSERT(VFS_I(ip)->i_mode != 0);
> -	trace_xfs_iunlink(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
> -
> -	return xfs_iunlink_insert_inode(tp, agibp, ip);
> -}
> -
> -/*
> - * Pull the on-disk inode from the AGI unlinked list.
> - */
> -STATIC int
> -xfs_iunlink_remove(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_buf		*agibp;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	int			error;
> -
> -	trace_xfs_iunlink_remove(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
> -
> -	return xfs_iunlink_remove_inode(tp, agibp, ip);
> +err_fscorrupted:
> +	error = -EFSCORRUPTED;
> +err_out:
> +	mutex_unlock(&pag->pag_unlinked_mutex);
> +	xfs_perag_put(pag);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index d47eea31c165..3f2739316424 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2801,6 +2801,11 @@ xlog_recover_iunlink_ag(
>  					prev_ip->i_next_unlinked = NULLAGINO;
>  				break;
>  			}
> +
> +			/* XXX: take pag_unlinked_mutex across the loop? */
> +			if (!bucket)
> +				agibp->b_pag->pag_unlinked_tail = ip;
> +
>  			if (prev_ip) {
>  				ip->i_prev_unlinked = prev_agino;
>  				xfs_irele(prev_ip);
> @@ -2812,6 +2817,7 @@ xlog_recover_iunlink_ag(
>  		}
>  		if (prev_ip)
>  			xfs_irele(prev_ip);
> +
>  		if (error) {
>  			/*
>  			 * We can't read an inode this bucket points to, or an
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 031e96ff022d..a9701f863e84 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -223,6 +223,9 @@ xfs_initialize_perag(
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
>  		spin_lock_init(&pag->pag_state_lock);
> +
> +		mutex_init(&pag->pag_unlinked_mutex);
> +		pag->pag_unlinked_tail = NULL;
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, agcount);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index a72cfcaa4ad1..107a634a34f7 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -372,6 +372,9 @@ typedef struct xfs_perag {
>  	/* reference count */
>  	uint8_t			pagf_refcount_level;
>  
> +	struct mutex		pag_unlinked_mutex;
> +	struct xfs_inode	*pag_unlinked_tail;

What do these fields do?  There aren't any comments....

--D

> +
>  	/*
>  	 * Unlinked inode information.  This incore information reflects
>  	 * data stored in the AGI, so callers must hold the AGI buffer lock
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
@ 2020-07-08 22:33   ` Dave Chinner
  2020-07-09  0:17     ` Gao Xiang
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2020-07-08 22:33 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Tue, Jul 07, 2020 at 09:57:40PM +0800, Gao Xiang wrote:
> There is no need to keep old multiple short unlink inode buckets
> since we have an in-memory double linked list for all unlinked
> inodes.
> 
> Apart from the perspective of the necessity, the main advantage
> is that the log and AGI update can be reduced since each AG has
> the only one head now, which is implemented in the following patch.
> 
> Therefore, this patch applies the new way in xfs_iunlink() and
> keep the old approach in xfs_iunlink_remove_inode() path as well
> so inode eviction can still work properly in recovery.
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 40 ++++++++++++++++++++--------------------
>  1 file changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ab288424764c..10565fa5ace4 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -33,6 +33,7 @@
>  #include "xfs_symlink.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
>  #include "xfs_iunlink_item.h"
> @@ -1955,25 +1956,32 @@ xfs_iunlink_update_bucket(
>  	struct xfs_trans	*tp,
>  	xfs_agnumber_t		agno,
>  	struct xfs_buf		*agibp,
> -	unsigned int		bucket_index,
> +	xfs_agino_t		old_agino,
>  	xfs_agino_t		new_agino)
>  {
> +	struct xlog		*log = tp->t_mountp->m_log;
>  	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		old_value;
> -	int			offset;
> +	unsigned int		bucket_index;
> +	int                     offset;
>  
>  	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
>  
> +	bucket_index = 0;
> +	/* During recovery, the old multiple bucket index can be applied */
> +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> +		ASSERT(old_agino != NULLAGINO);
> +
> +		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
> +			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
> +	}

Ok, so you are doing this because you changed the function to pass
in an agino rather than a bucket index from the caller context. So
now you have to look up a structure to determine what the caller
context was to determine what the bucket index we need to use is.

Seems like we probably should have kept passing in the bucket index
from the caller because that's where the knowledge of what bucket we
need to update comes from?

And in that case, the higher level code should be checking for the
log recovery case when selecting the bucket, not hiding it deep in
the guts of the code here....

> +
>  	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
>  	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
>  			old_value, new_agino);
>  
> -	/*
> -	 * We should never find the head of the list already set to the value
> -	 * passed in because either we're adding or removing ourselves from the
> -	 * head of the list.
> -	 */
> -	if (old_value == new_agino) {
> +	/* check if the old agi_unlinked head is as expected */
> +	if (old_value != old_agino) {
>  		xfs_buf_mark_corrupt(agibp);
>  		return -EFSCORRUPTED;
>  	}

This looks like a change of behaviour - it no longer checks against
the inode we are about to add/remove from the list, but instead
checks that old inode is what we found on the list. We're not
concerned that what we found on the list matches what the caller
found on the list and passed us - we're concerned about doing a
double add/remove of the current inode...

> @@ -2001,14 +2009,13 @@ xfs_iunlink_insert_inode(
>  	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  
>  	/*
>  	 * Get the index into the agi hash table for the list this inode will
>  	 * go on.  Make sure the pointer isn't garbage and that this inode
>  	 * isn't already on the list.
>  	 */
> -	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
>  	if (next_agino == agino ||
>  	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
>  		xfs_buf_mark_corrupt(agibp);
> @@ -2036,7 +2043,7 @@ xfs_iunlink_insert_inode(
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
> +	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
>  }
>  
>  /*
> @@ -2051,27 +2058,20 @@ xfs_iunlink_remove_inode(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino = ip->i_next_unlinked;
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
>  	if (ip->i_prev_unlinked == NULLAGINO) {
>  		/* remove from head of list */
> -		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
> -			xfs_buf_mark_corrupt(agibp);
> -			return -EFSCORRUPTED;
> -		}
>  		if (next_agino == agino ||
>  		    !xfs_verify_agino_or_null(mp, agno, next_agino))
>  			return -EFSCORRUPTED;
>  
> -		error = xfs_iunlink_update_bucket(tp, agno, agibp,
> -					bucket_index, next_agino);
> +		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
>  		if (error)
> -			return -EFSCORRUPTED;
> +			return error;

i.e. this is the point where we know we need to remove from the head
of the unlinked list and all the bucket selection and verification
should probably remain here...

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
  2020-07-08 17:03   ` Darrick J. Wong
@ 2020-07-08 23:33   ` Dave Chinner
  2020-07-09  0:55     ` Gao Xiang
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2020-07-08 23:33 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> Currently, we use AGI buffer lock to protect in-memory linked list for
> unlinked inodes but since it's not necessary to modify AGI unless the
> head of the unlinked list is modified. So let's removing the AGI buffer
> modification dependency if possible, including 1) adding another per-AG
> dedicated lock to protect the whole list and 2) inserting unlinked
> inodes from tail.
> 
> For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> to use. xfs_iunlink_remove() still support old multiple short bucket
> lists for recovery code.

I would split this into two separate patches. One to move to a perag
based locking strategy, another to change from head to tail
addition as they are largely independent algorithmic changes.

> Note that some paths take AGI lock in its transaction in advance,
> so the proper locking order is only AGI lock -> unlinked list lock.

These paths should be documented in the commit message as well
as in code comments so the reviewer is aware of those code paths
and can verify that your assumptions about locking order are
correct.

> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
>  fs/xfs/xfs_log_recover.c |   6 +
>  fs/xfs/xfs_mount.c       |   3 +
>  fs/xfs/xfs_mount.h       |   3 +
>  4 files changed, 144 insertions(+), 119 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 10565fa5ace4..d33e5b198534 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
>  }
>  
>  /*
> - * Always insert at the head, so we only have to do a next inode lookup to
> - * update it's prev pointer. The AGI bucket will point at the one we are
> - * inserting.
> + * This is called when the inode's link count has gone to 0 or we are creating
> + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> + *
> + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> + * list when the inode is freed.
>   */
> -static int
> -xfs_iunlink_insert_inode(

Hmmm. Flattening the code also make the patch harder to follow as it
combines code movement/rearrangement with algorithmic changes.  We
try to separate out code movement/rearrangement into their own
patches so that the movement is easy to verify by itself.

Also, the helper functions help document the separation of the
unlinked list manipulations from the from setup and locking
requirements for the list manipulations, and this largely undoes all
that. I added these helpers because it completely untangled the mess
that was present before the RFC patchset I posted. THat is, I
couldn't easily modify the existing code because it interleaved
the locking, the backref hash manipulations and
the on-disk list manipulations in ways I found difficult to
understand and manage. Short, simple, clear functions are much
better than long, multiple operation functions...

i.e. this:

xfs_iunlink()
{

	get locks
	do list insert
	drop locks
}

Is better for understanding, maintenance and future modification
than:

xfs_iunlink()
{

	get perag
	lock perag
	look at tail of list
	if (empty) {
		unlock perag
		read/lock AGI
		lock perag
		look at tail of list
		if (empty)
			do head insert
			goto out
	}
	do tail insert
out:
	update inode/pag tails
	unlock
	drop perag
}

It's trivial for a reader to understand what the first version of
xfs_iunlink() is going to do without needing to understand the
intraccies of the locking strategies. However, it takes time and
effort to undestand exactly waht the second one is doing because
it's not clear where lock ends and list modifications start, nor
what the locking rules are for the different modifications that are
being made. Essentially, it goes back to the complex
locking-intertwined-with-modification-algorithm problem the current
TOT code has.

I'd much prefer to see something like this:

/*
 * Inode allocation in the O_TMPFILE path defines the AGI/unlinked
 * list lock order as being AGI->perag unlinked list lock. We are
 * inverting it here as the fast path tail addition does not need to
 * modify the AGI at all. Hence we only need the AGI lock if the
 * tail is empty, but if we fail to get it without blocking then we
 * need to fall back to the slower, correct lock order.
 */
xfs_iunlink_insert_lock()
{
	get perag;
	lock_perag();
	if (!tail empty)
		return;
	if (trylock AGI)
		return;

	/*
	 * Slow path, need to lock AGI first. Don't even bother
	 * rechecking tail pointers or trying to optimise for
	 * minimal AGI lock hold time as racing unlink list mods
	 * will all block on the perag lock once we take that. They
	 * will then hit the !tail empty fast path and not require
	 * the AGI lock at all.
	 */
	lock AGI
	lock_perag()
	return;
}

The non-AGI locking fast path is slightly different in the remove
case, so we'll have a slightly different helper function in that
case which checks where the inode being removed is in the list.

In both cases, though, the unlock should be the same:

xfs_iunlink_unlock()
{
	/* Does not unlock AGI, ever. commit does that. */
	unlock perag
	put perag
}

This keeps the list locking completely separate from the list
manipulations and allows us to document the locking constraints and
reasons for why it is or isn't optimised for specific conditions
without cluttering up the list manipulations code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-08 17:03   ` Darrick J. Wong
@ 2020-07-08 23:40     ` Gao Xiang
  0 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-08 23:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Brian Foster

Hi Darrick,

On Wed, Jul 08, 2020 at 10:03:07AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> > Currently, we use AGI buffer lock to protect in-memory linked list for
> > unlinked inodes but since it's not necessary to modify AGI unless the
> > head of the unlinked list is modified. So let's removing the AGI buffer
> > modification dependency if possible, including 1) adding another per-AG
> > dedicated lock to protect the whole list and 2) inserting unlinked
> > inodes from tail.
> > 
> > For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> > to use. xfs_iunlink_remove() still support old multiple short bucket
> > lists for recovery code.
> > 
> > Note that some paths take AGI lock in its transaction in advance,
> > so the proper locking order is only AGI lock -> unlinked list lock.
> 
> How much agi buffer lock contention does this eliminate?

Sorry, I haven't got the point.

Did you mean I should show some number in the commit message from some workload?
I'm not sure if you mean that, it depends on the specific workload indeed. I
could do that if needed (it's of much help if some more hints here...)

Or I'm not sure if you mean the deadlock. There are some exist path I've found
when running fsstress, for example, xfs_create_tmpfile() will reading/locking
AGI when preparing an orphan inode in this transaction. So if I reading AGI
again under a new lock unconditionally, it should cause ABBA deadlock (since
another path takes mutex first and locks AGI buffer again).

> 
> > @@ -372,6 +372,9 @@ typedef struct xfs_perag {
> >  	/* reference count */
> >  	uint8_t			pagf_refcount_level;
> >  
> > +	struct mutex		pag_unlinked_mutex;
> > +	struct xfs_inode	*pag_unlinked_tail;
> 
> What do these fields do?  There aren't any comments....

Will add some comments in the next version, Thanks for the advice!

Thanks,
Gao Xiang

> 
> --D
> 
> > +
> >  	/*
> >  	 * Unlinked inode information.  This incore information reflects
> >  	 * data stored in the AGI, so callers must hold the AGI buffer lock
> > -- 
> > 2.18.1
> > 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list
  2020-07-08 22:33   ` Dave Chinner
@ 2020-07-09  0:17     ` Gao Xiang
  0 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-09  0:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

Hi Dave,

On Thu, Jul 09, 2020 at 08:33:26AM +1000, Dave Chinner wrote:
> >  
> > +	bucket_index = 0;
> > +	/* During recovery, the old multiple bucket index can be applied */
> > +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> > +		ASSERT(old_agino != NULLAGINO);
> > +
> > +		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
> > +			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
> > +	}
> 
> Ok, so you are doing this because you changed the function to pass
> in an agino rather than a bucket index from the caller context. So
> now you have to look up a structure to determine what the caller
> context was to determine what the bucket index we need to use is.
> 
> Seems like we probably should have kept passing in the bucket index
> from the caller because that's where the knowledge of what bucket we
> need to update comes from?

My thought is since bucket_index is now fixed as 0 except for the
determinated recovery path, so I think there is no need for any
exist callers or future callers to specify some bucket number.

The old formula is only used for xfs_iunlink_remove_inode() when
recovering, so old_agino won't be NULLAGINO but indicate the old
bucket_index as well (since it removes the head unlinked inode.)
That is the only determinated path.

> 
> And in that case, the higher level code should be checking for the
> log recovery case when selecting the bucket, not hiding it deep in
> the guts of the code here....

I could do that instead, it means callers should obey more rules
to call this function. e.g bucket_index should obey "old_agino % XFS_AGI_UNLINKED_BUCKETS"
rule for the old way rather than passing any possible value.

but if we specify all callers do that, I think "if (old_value != old_agino)"
isn't needed here as well but instead add an ASSERT here since all callers
should know what they are doing now, and leave only some integral check here.

> 
> > +
> >  	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> >  	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
> >  			old_value, new_agino);
> >  
> > -	/*
> > -	 * We should never find the head of the list already set to the value
> > -	 * passed in because either we're adding or removing ourselves from the
> > -	 * head of the list.
> > -	 */
> > -	if (old_value == new_agino) {
> > +	/* check if the old agi_unlinked head is as expected */
> > +	if (old_value != old_agino) {
> >  		xfs_buf_mark_corrupt(agibp);
> >  		return -EFSCORRUPTED;
> >  	}
> 
> This looks like a change of behaviour - it no longer checks against
> the inode we are about to add/remove from the list, but instead
> checks that old inode is what we found on the list. We're not
> concerned that what we found on the list matches what the caller
> found on the list and passed us - we're concerned about doing a
> double add/remove of the current inode...

Just as I said above, this checks the bucket head validity instead
(since we get a bucket_index and the bucket head should be as
what we expect). It doesn't mean to check that old inode is what
we found on the list. So we could kill the original check in
xfs_iunlink_remove_inode() as well.

Anyway, I prefer this way since all callers won't know too much about
the bucket index. Since bucket index is much related to agino, it cannot
be specified without some rule, e.g. agino is 0 (just for a example,
should not possible), but a caller passes a bucket index 12. It doesn't
work so if we have a bucket_index argument, we have also to consider
bucket_index check in xfs_iunlink_update_bucket() just as the old
"if (old_value == new_agino) {" integral check as well. From this
perspective, xfs_iunlink_update_bucket() should also know the rule
as the proposed code does...

I could keep the old argument instead, that isn't too much for me.
I listed all my thoughts above.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-08 23:33   ` Dave Chinner
@ 2020-07-09  0:55     ` Gao Xiang
  2020-07-09  2:32       ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Gao Xiang @ 2020-07-09  0:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 09:33:11AM +1000, Dave Chinner wrote:
> On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> > Currently, we use AGI buffer lock to protect in-memory linked list for
> > unlinked inodes but since it's not necessary to modify AGI unless the
> > head of the unlinked list is modified. So let's removing the AGI buffer
> > modification dependency if possible, including 1) adding another per-AG
> > dedicated lock to protect the whole list and 2) inserting unlinked
> > inodes from tail.
> > 
> > For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> > to use. xfs_iunlink_remove() still support old multiple short bucket
> > lists for recovery code.
> 
> I would split this into two separate patches. One to move to a perag
> based locking strategy, another to change from head to tail
> addition as they are largely independent algorithmic changes.

Yes, that is much better from the perspective of spilting patches and
I thought that before. It seems that is like 2 steps but the proposed
target solution is as a whole (in other words, 2 steps are code-related)
and I'm not sure how large these code is sharable or can be inherited
but rather than introduce some code in patch 2 and then remove immediately
and turn into a new code in patch 3). I'm not sure how large logic could
be sharable between these 2 dependent steps so I didn't do that.

I will spilt patches in the next RFC version to make a try.

> 
> > Note that some paths take AGI lock in its transaction in advance,
> > so the proper locking order is only AGI lock -> unlinked list lock.
> 
> These paths should be documented in the commit message as well
> as in code comments so the reviewer is aware of those code paths
> and can verify that your assumptions about locking order are
> correct.

Yeah, It has been documented in the following.

+xfs_iunlink(
 	struct xfs_trans	*tp,
-	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
+	struct xfs_inode	*pip;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino;
+	int			error;
+
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
+
+	pag = xfs_perag_get(mp, agno);
 
 	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the pointer isn't garbage and that this inode
-	 * isn't already on the list.
+	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
+	 * in this transaction in advance and the only locking order
+	 * AGI buf lock -> pag_unlinked_mutex is safe.

> 
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > ---
> >  fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
> >  fs/xfs/xfs_log_recover.c |   6 +
> >  fs/xfs/xfs_mount.c       |   3 +
> >  fs/xfs/xfs_mount.h       |   3 +
> >  4 files changed, 144 insertions(+), 119 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 10565fa5ace4..d33e5b198534 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
> >  }
> >  
> >  /*
> > - * Always insert at the head, so we only have to do a next inode lookup to
> > - * update it's prev pointer. The AGI bucket will point at the one we are
> > - * inserting.
> > + * This is called when the inode's link count has gone to 0 or we are creating
> > + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> > + *
> > + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> > + * list when the inode is freed.
> >   */
> > -static int
> > -xfs_iunlink_insert_inode(
> 
> Hmmm. Flattening the code also make the patch harder to follow as it
> combines code movement/rearrangement with algorithmic changes.  We
> try to separate out code movement/rearrangement into their own
> patches so that the movement is easy to verify by itself.
> 
> Also, the helper functions help document the separation of the
> unlinked list manipulations from the from setup and locking
> requirements for the list manipulations, and this largely undoes all
> that. I added these helpers because it completely untangled the mess
> that was present before the RFC patchset I posted. THat is, I
> couldn't easily modify the existing code because it interleaved
> the locking, the backref hash manipulations and
> the on-disk list manipulations in ways I found difficult to
> understand and manage. Short, simple, clear functions are much
> better than long, multiple operation functions...
> 
> i.e. this:
> 
> xfs_iunlink()
> {
> 
> 	get locks
> 	do list insert
> 	drop locks
> }
> 
> Is better for understanding, maintenance and future modification
> than:
> 
> xfs_iunlink()
> {
> 
> 	get perag
> 	lock perag
> 	look at tail of list
> 	if (empty) {
> 		unlock perag
> 		read/lock AGI
> 		lock perag
> 		look at tail of list
> 		if (empty)
> 			do head insert
> 			goto out
> 	}
> 	do tail insert
> out:
> 	update inode/pag tails
> 	unlock
> 	drop perag
> }
> 
> It's trivial for a reader to understand what the first version of
> xfs_iunlink() is going to do without needing to understand the
> intraccies of the locking strategies. However, it takes time and
> effort to undestand exactly waht the second one is doing because
> it's not clear where lock ends and list modifications start, nor
> what the locking rules are for the different modifications that are
> being made. Essentially, it goes back to the complex
> locking-intertwined-with-modification-algorithm problem the current
> TOT code has.
> 
> I'd much prefer to see something like this:
> 
> /*
>  * Inode allocation in the O_TMPFILE path defines the AGI/unlinked
>  * list lock order as being AGI->perag unlinked list lock. We are
>  * inverting it here as the fast path tail addition does not need to
>  * modify the AGI at all. Hence we only need the AGI lock if the
>  * tail is empty, but if we fail to get it without blocking then we
>  * need to fall back to the slower, correct lock order.
>  */
> xfs_iunlink_insert_lock()
> {
> 	get perag;
> 	lock_perag();
> 	if (!tail empty)
> 		return;
> 	if (trylock AGI)
> 		return;

(adding some notes here, this patch doesn't use try lock here
 finally but unlock perag and take AGI and relock and recheck tail_empty....
 since the tail non-empty is rare...)

> 
> 	/*
> 	 * Slow path, need to lock AGI first. Don't even bother
> 	 * rechecking tail pointers or trying to optimise for
> 	 * minimal AGI lock hold time as racing unlink list mods
> 	 * will all block on the perag lock once we take that. They
> 	 * will then hit the !tail empty fast path and not require
> 	 * the AGI lock at all.
> 	 */
> 	lock AGI
> 	lock_perag()
> 	return;
> }
> 
> The non-AGI locking fast path is slightly different in the remove
> case, so we'll have a slightly different helper function in that
> case which checks where the inode being removed is in the list.
> 
> In both cases, though, the unlock should be the same:
> 
> xfs_iunlink_unlock()
> {
> 	/* Does not unlock AGI, ever. commit does that. */
> 	unlock perag
> 	put perag
> }
> 
> This keeps the list locking completely separate from the list
> manipulations and allows us to document the locking constraints and
> reasons for why it is or isn't optimised for specific conditions
> without cluttering up the list manipulations code.

Yeah, the _lock and _unlock pair is helpful, I will go for this direction.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09  0:55     ` Gao Xiang
@ 2020-07-09  2:32       ` Dave Chinner
  2020-07-09 10:36         ` Gao Xiang
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2020-07-09  2:32 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 08:55:26AM +0800, Gao Xiang wrote:
> On Thu, Jul 09, 2020 at 09:33:11AM +1000, Dave Chinner wrote:
> > On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> > > Currently, we use AGI buffer lock to protect in-memory linked list for
> > > unlinked inodes but since it's not necessary to modify AGI unless the
> > > head of the unlinked list is modified. So let's removing the AGI buffer
> > > modification dependency if possible, including 1) adding another per-AG
> > > dedicated lock to protect the whole list and 2) inserting unlinked
> > > inodes from tail.
> > > 
> > > For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> > > to use. xfs_iunlink_remove() still support old multiple short bucket
> > > lists for recovery code.
> > 
> > I would split this into two separate patches. One to move to a perag
> > based locking strategy, another to change from head to tail
> > addition as they are largely independent algorithmic changes.
> 
> Yes, that is much better from the perspective of spilting patches and
> I thought that before. It seems that is like 2 steps but the proposed
> target solution is as a whole (in other words, 2 steps are code-related)
> and I'm not sure how large these code is sharable or can be inherited
> but rather than introduce some code in patch 2 and then remove immediately
> and turn into a new code in patch 3). I'm not sure how large logic could
> be sharable between these 2 dependent steps so I didn't do that.
> 
> I will spilt patches in the next RFC version to make a try.

Thanks!

[snip]


> > i.e. this:
> > 
> > xfs_iunlink()
> > {
> > 
> > 	get locks
> > 	do list insert
> > 	drop locks
> > }
> > 
> > Is better for understanding, maintenance and future modification
> > than:
> > 
> > xfs_iunlink()
> > {
> > 
> > 	get perag
> > 	lock perag
> > 	look at tail of list
> > 	if (empty) {
> > 		unlock perag
> > 		read/lock AGI
> > 		lock perag
> > 		look at tail of list
> > 		if (empty)
> > 			do head insert
> > 			goto out
> > 	}
> > 	do tail insert
> > out:
> > 	update inode/pag tails
> > 	unlock
> > 	drop perag
> > }
> > 
> > It's trivial for a reader to understand what the first version of
> > xfs_iunlink() is going to do without needing to understand the
> > intraccies of the locking strategies. However, it takes time and
> > effort to undestand exactly waht the second one is doing because
> > it's not clear where lock ends and list modifications start, nor
> > what the locking rules are for the different modifications that are
> > being made. Essentially, it goes back to the complex
> > locking-intertwined-with-modification-algorithm problem the current
> > TOT code has.
> > 
> > I'd much prefer to see something like this:
> > 
> > /*
> >  * Inode allocation in the O_TMPFILE path defines the AGI/unlinked
> >  * list lock order as being AGI->perag unlinked list lock. We are
> >  * inverting it here as the fast path tail addition does not need to
> >  * modify the AGI at all. Hence we only need the AGI lock if the
> >  * tail is empty, but if we fail to get it without blocking then we
> >  * need to fall back to the slower, correct lock order.
> >  */
> > xfs_iunlink_insert_lock()
> > {
> > 	get perag;
> > 	lock_perag();
> > 	if (!tail empty)
> > 		return;
> > 	if (trylock AGI)
> > 		return;
> 
> (adding some notes here, this patch doesn't use try lock here
>  finally but unlock perag and take AGI and relock and recheck tail_empty....
>  since the tail non-empty is rare...)

*nod*

My point was largely that this sort of thing is really obvious and
easy to optimise once the locking is cleanly separated. Adding a
trylock rather than drop/relock is another patch for the series :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09  2:32       ` Dave Chinner
@ 2020-07-09 10:36         ` Gao Xiang
  2020-07-09 10:47           ` Gao Xiang
  2020-07-09 22:36           ` Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-09 10:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

Hi Dave,

On Thu, Jul 09, 2020 at 12:32:46PM +1000, Dave Chinner wrote:

...

> > > 	if (trylock AGI)
> > > 		return;
> > 
> > (adding some notes here, this patch doesn't use try lock here
> >  finally but unlock perag and take AGI and relock and recheck tail_empty....
> >  since the tail non-empty is rare...)
> 
> *nod*
> 
> My point was largely that this sort of thing is really obvious and
> easy to optimise once the locking is cleanly separated. Adding a
> trylock rather than drop/relock is another patch for the series :P

I thought trylock way yesterday as well...
Apart from that we don't have the AGI trylock method yet, it seems
not work as well...

Thinking about one thread is holding a AGI lock and wants to lock perag.
And another thread holds a perag, but the trylock will fail and the second
wait-lock will cause deadlock... like below:

Thread 1			Thread 2
 lock AGI
                                 lock perag
 lock perag
                                 trylock AGI
                                 lock AGI           <- dead lock here

And trylock perag instead seems not work well as well, since it fails
more frequently so we need lock AGI and then lock perag as a fallback.

So currently, I think it's better to use unlock, do something, regrab,
recheck method for now...

And yeah, that can be done in another patch if some better way. (e.g.
moving modifing AGI into pre-commit, so we can only take one per-ag
lock here...)

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09 10:36         ` Gao Xiang
@ 2020-07-09 10:47           ` Gao Xiang
  2020-07-09 22:36           ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-09 10:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 06:36:21PM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On Thu, Jul 09, 2020 at 12:32:46PM +1000, Dave Chinner wrote:
> 
> ...
> 
> > > > 	if (trylock AGI)
> > > > 		return;
> > > 
> > > (adding some notes here, this patch doesn't use try lock here
> > >  finally but unlock perag and take AGI and relock and recheck tail_empty....
> > >  since the tail non-empty is rare...)
> > 
> > *nod*
> > 
> > My point was largely that this sort of thing is really obvious and
> > easy to optimise once the locking is cleanly separated. Adding a
> > trylock rather than drop/relock is another patch for the series :P
> 
> I thought trylock way yesterday as well...
> Apart from that we don't have the AGI trylock method yet, it seems
> not work as well...
> 
> Thinking about one thread is holding a AGI lock and wants to lock perag.
> And another thread holds a perag, but the trylock will fail and the second
> wait-lock will cause deadlock... like below:
> 
> Thread 1			Thread 2
>  lock AGI
>                                  lock perag
>  lock perag
>                                  trylock AGI
>                                  lock AGI           <- dead lock here
> 
> And trylock perag instead seems not work well as well, since it fails
> more frequently so we need lock AGI and then lock perag as a fallback.
> 
> So currently, I think it's better to use unlock, do something, regrab,
> recheck method for now...
> 
> And yeah, that can be done in another patch if some better way. (e.g.
> moving modifing AGI into pre-commit, so we can only take one per-ag
> lock here...)

Oh, sorry, I didn't notice the words

> 	/*
> 	 * Slow path, need to lock AGI first. Don't even bother
> 	 * rechecking tail pointers or trying to optimise for
> 	 * minimal AGI lock hold time as racing unlink list mods
> 	 * will all block on the perag lock once we take that. They
> 	 * will then hit the !tail empty fast path and not require
> 	 * the AGI lock at all.
> 	 */

If we'd only try to trylock AGI and release perag and relock again,
and that's fine. Sorry about the noise :)

btw, also add some other notes here. For AGI pre-commits, I'm not
quite sure if some users take another locked buffer first (but
without AGI), so it would have some potential locking order concern
as well... But let us think about them later :)

add agi to precommit
lock other buffers

precommit --- lock AGI


Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09 10:36         ` Gao Xiang
  2020-07-09 10:47           ` Gao Xiang
@ 2020-07-09 22:36           ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2020-07-09 22:36 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 06:36:21PM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On Thu, Jul 09, 2020 at 12:32:46PM +1000, Dave Chinner wrote:
> 
> ...
> 
> > > > 	if (trylock AGI)
> > > > 		return;
> > > 
> > > (adding some notes here, this patch doesn't use try lock here
> > >  finally but unlock perag and take AGI and relock and recheck tail_empty....
> > >  since the tail non-empty is rare...)
> > 
> > *nod*
> > 
> > My point was largely that this sort of thing is really obvious and
> > easy to optimise once the locking is cleanly separated. Adding a
> > trylock rather than drop/relock is another patch for the series :P
> 
> I thought trylock way yesterday as well...
> Apart from that we don't have the AGI trylock method yet, it seems
> not work as well...
> 
> Thinking about one thread is holding a AGI lock and wants to lock perag.
> And another thread holds a perag, but the trylock will fail and the second
> wait-lock will cause deadlock... like below:
> 
> Thread 1			Thread 2
>  lock AGI
>                                  lock perag
>  lock perag
>                                  trylock AGI
>                                  lock AGI           <- dead lock here

You forgot to unlock the perag before locking the AGI, so of course
it will deadlock.

Yes, I know my psuedocode example didn't explicitly unlock the perag
because I simply forgot to write it in as I was going. It's pretty
common that quickly written psuedo code to explain a concept is not
perfect, so you still need to think about whether it is correct,
complete, etc.  i.e. if you do:

Thread 1			Thread 2
 lock AGI
                                 lock perag
 lock perag
                                 trylock AGI
				 unlock perag
                                 lock AGI
				 <blocks>

There is no deadlock in this algorithm.

> And trylock perag instead seems not work well as well, since it fails
> more frequently so we need lock AGI and then lock perag as a fallback.

That's kind of what I'd expect - if there is contention on the AGI,
the trylock will fail and we know we are about to sleep. The point
is that "trylock fails" now gives us a point at which we can measure
contention that puts us into the slow path. And because we are now
in the slow path, the overhead of backing out just doesn't matter
because we are going to sleep anyway.

We use this pattern all over XFS to attempt to get locks in reverse
order with minimal overhead in fast paths...

> So currently, I think it's better to use unlock, do something, regrab,
> recheck method for now...

I think that's a mistake - it's smells of trying to optimise the
implementation around immediately observed behaviour instead of
designing a clean, well structured locking scheme that will not need
to be re-optimised over and over again as the algorithm it protects
changes over time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2
  2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
@ 2020-07-24  6:12 ` Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
                     ` (3 more replies)
  2 siblings, 4 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Hi forks,

Sorry for long delay these days due to my work ineffective, this is RFC
v2 of this patchset addressing previous constructive comments...

This RFC patchset mainly addresses the thoughts [*] and [**] from Dave's
original patchset,
https://lore.kernel.org/r/20200623095015.1934171-1-david@fromorbit.com

In short, it focues on the following ideas mentioned by Dave:
 - use bucket 0 instead of multiple buckets since in-memory double
   linked list finally works;

 - avoid taking AGI buffer and unnecessary AGI update if possible, so
   1) add a new lock and keep proper locking order to avoid deadlock;
   2) insert a new unlinked inode from the tail instead of head;

In addition, it's worth noticing 3 things:
 - xfs_iunlink_remove() should support old multiple buckets in order
   to keep old inode unlinked list (old image) working when recovering.

 - (but) OTOH, the old kernel recovery _shouldn't_ work with new image
   since the bucket_index from old xfs_iunlink_remove() is generated
   by the old formula (rather than keep in xfs_inode), which is now
   fixed as 0. So this feature is not forward compatible without some
   extra backport patches;

 - a tail xfs_inode pointer is also added in the perag, which keeps 
   track of the tail of bucket 0 since it's mainly used for xfs_iunlink().

 - the old kernel recovery shouldn't work with new image since
     its bucket_index from old kernel is 

The git tree is also available at
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git tags/xfs/iunlink_opt_v2

Gitweb:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/log/?h=xfs/iunlink_opt_v2

Some limited test for debugging only is done (mainly fsstress and manual
power-cut tests) and it's worth noticing that when I tested fsstress for
much longer time, it showed the following kernel message twice and it
haven't been reproduced again for about 3 days, and I have no idea what's
wrong over the current logic, so it could be better to send out the patchset
first and go on reproducing with extra debugging prints added...

[ 7884.930106] XFS (sda): bad inode magic/vsn daddr 111232 #0 (magic=5050)
[ 7884.930988] XFS (sda): Metadata corruption detected at xfs_buf_ioend+0x11a/0x190, xfs_inode block 0x1b280 xfs_inode_buf_verify
[ 7884.932418] XFS (sda): Unmount and run xfs_repair
[ 7884.933154] XFS (sda): First 128 bytes of corrupted metadata buffer:
[ 7884.934253] 00000000: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.935495] 00000010: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.936752] 00000020: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.937761] 00000030: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.939336] 00000040: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.940914] 00000050: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.942382] 00000060: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.943513] 00000070: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.945006] XFS (sda): metadata I/O error in "xfs_imap_to_bp+0x61/0xd0" at daddr 0x1b280 len 32 error 117
[ 7884.952066] XFS (sda): xfs_do_force_shutdown(0x1) called from line 312 of file fs/xfs/xfs_trans_buf.c. Return address = ffffffff983ea7e0
[ 7884.952494] sda: writeback error on inode 20695, offset 770048, sector 1970864
[ 7884.954353] XFS (sda): I/O Error Detected. Shutting down filesystem
[ 7884.956566] sda: writeback error on inode 117946, offset 2347008, sector 612424
[ 7884.956802] XFS (sda): Please unmount the filesystem and rectify the problem(s)
[ 7884.958943] XFS (sda): xfs_inactive_ifree: xfs_trans_commit returned error -117
fsstress: check_cwd failure
[ 7884.963475] XFS (sda): xfs_imap_lookup: xfs_ialloc_read_agi() returned error -5, agno 0

Comments and directions are welcomed. :)

Thanks,
Gao Xiang


Changes since v1:
 - show some number of reduced AGI buffer modification (Darrick);
 - add comments about pag_unlinked_mutex, pag_unlinked_tail (Darrick);
 - avoid touching xfs_iunlink_update_bucket() logic (Dave);
 - spilt "xfs: don't access AGI on unlinked inodes if it can" into 2 patches (Dave);
 - add xfs_iunlink_insert_lock/xfs_iunlink_remove_lock/xfs_iunlink_unlock to
   make the code easy to follow (Dave).

Gao Xiang (3):
  xfs: arrange all unlinked inodes into one list
  xfs: introduce perag iunlink lock
  xfs: insert unlinked inodes from tail

 fs/xfs/xfs_inode.c       | 202 ++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_log_recover.c |   5 +
 fs/xfs/xfs_mount.c       |   2 +
 fs/xfs/xfs_mount.h       |   6 ++
 4 files changed, 159 insertions(+), 56 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
@ 2020-07-24  6:12   ` Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock Gao Xiang
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

There is no need to keep old multiple short unlink inode buckets
since we have an in-memory double linked list for all unlinked
inodes.

Apart from the perspective of the necessity, the main advantage
is the log and AGI update can be reduced since each AG has the
only one head now, which is implemented in the following patches.

Therefore, this patch applies the new way in xfs_iunlink() and
keep the old approach in xfs_iunlink_remove_inode() path as
well so inode eviction can still work properly in recovery.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ab288424764c..4994398373df 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -33,6 +33,7 @@
 #include "xfs_symlink.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_iunlink_item.h"
@@ -2001,14 +2002,13 @@ xfs_iunlink_insert_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 
 	/*
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the pointer isn't garbage and that this inode
 	 * isn't already on the list.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
 	if (next_agino == agino ||
 	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
 		xfs_buf_mark_corrupt(agibp);
@@ -2036,7 +2036,7 @@ xfs_iunlink_insert_inode(
 	}
 
 	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
+	return xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
 }
 
 /*
@@ -2055,10 +2055,17 @@ xfs_iunlink_remove_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
+		struct xlog	*log = mp->m_log;
+		short		bucket_index = 0;
+
+		/* During recovery, the old multiple bucket can be applied */
+		if ((!log || log->l_flags & XLOG_RECOVERY_NEEDED) &&
+		    be32_to_cpu(agi->agi_unlinked[0]) != agino)
+			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
+
 		/* remove from head of list */
 		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
 			xfs_buf_mark_corrupt(agibp);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
@ 2020-07-24  6:12   ` Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail Gao Xiang
  2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
  3 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, we use AGI buffer lock to protect in-memory linked list
for unlinked inodes but since it's not necessary to modify AGI
unless the head of the unlinked list is modified.

Therefore, let's add another per-AG dedicated lock to protect the
whole list, and removing AGI buffer lock in xfs_iunlink_remove()
if the head is untouched.

Note that some paths (e.g. xfs_create_tmpfile()) take AGI lock in
its transaction in advance, so the proper locking order is only
AGI lock -> unlinked list lock.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c | 79 ++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_mount.c |  1 +
 fs/xfs/xfs_mount.h |  3 ++
 3 files changed, 70 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4994398373df..d78aaa8ce252 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2039,6 +2039,46 @@ xfs_iunlink_insert_inode(
 	return xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
 }
 
+/*
+ * Lock the perag and take AGI lock if agi_unlinked is touched as well for
+ * xfs_iunlink_remove_inode().
+ *
+ * Inode allocation in the O_TMPFILE path defines the AGI/unlinked list lock
+ * order as being AGI->perag unlinked list lock. We are inverting it here as
+ * the fast path tail addition does not need to modify the AGI at all. Hence
+ * we only need the AGI lock if the tail is empty, correct lock order.
+ */
+static struct xfs_perag *
+xfs_iunlink_remove_lock(
+	xfs_agino_t		agno,
+	struct xfs_trans        *tp,
+	struct xfs_inode	*ip,
+	struct xfs_buf		**agibpp)
+{
+	struct xfs_mount        *mp = tp->t_mountp;
+	struct xfs_perag	*pag;
+	int			error;
+
+	pag = xfs_perag_get(mp, agno);
+	mutex_lock(&pag->pag_unlinked_mutex);
+
+	if (ip->i_prev_unlinked == NULLAGINO) {
+		mutex_unlock(&pag->pag_unlinked_mutex);
+		/*
+		 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
+		 * in this transaction in advance and the only locking order
+		 * AGI buf lock -> pag_unlinked_mutex is safe.
+		 */
+		error = xfs_read_agi(mp, tp, agno, agibpp);
+		if (error) {
+			xfs_perag_put(pag);
+			return ERR_PTR(error);
+		}
+		mutex_lock(&pag->pag_unlinked_mutex);
+	}
+	return pag;
+}
+
 /*
  * Remove can be from anywhere in the list, so we have to do two adjacent inode
  * lookups here so we can update list pointers. We may be at the head or the
@@ -2046,12 +2086,12 @@ xfs_iunlink_insert_inode(
  */
 static int
 xfs_iunlink_remove_inode(
+	struct xfs_perag	*pag,
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
@@ -2059,6 +2099,7 @@ xfs_iunlink_remove_inode(
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
 		struct xlog	*log = mp->m_log;
+		struct xfs_agi	*agi = agibp->b_addr;
 		short		bucket_index = 0;
 
 		/* During recovery, the old multiple bucket can be applied */
@@ -2083,7 +2124,7 @@ xfs_iunlink_remove_inode(
 		/* lookup previous inode and update to point at next */
 		struct xfs_inode	*pip;
 
-		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
+		pip = xfs_iunlink_lookup_prev(pag, ip);
 		if (IS_ERR_OR_NULL(pip))
 			return -EFSCORRUPTED;
 
@@ -2102,7 +2143,7 @@ xfs_iunlink_remove_inode(
 	if (ip->i_next_unlinked != NULLAGINO) {
 		struct xfs_inode	*nip;
 
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
+		nip = xfs_iunlink_lookup_next(pag, ip);
 		if (IS_ERR_OR_NULL(nip))
 			return -EFSCORRUPTED;
 
@@ -2126,6 +2167,15 @@ xfs_iunlink_remove_inode(
 	return 0;
 }
 
+static void
+xfs_iunlink_unlock(
+	struct xfs_perag	*pag)
+{
+	/* Does not unlock AGI, ever. xfs_trans_commit() does that. */
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
+}
+
 /*
  * This is called when the inode's link count has gone to 0 or we are creating
  * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
@@ -2143,7 +2193,6 @@ xfs_iunlink(
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	int			error;
 
-	ASSERT(ip->i_next_unlinked == NULLAGINO);
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 	trace_xfs_iunlink(ip);
@@ -2153,7 +2202,10 @@ xfs_iunlink(
 	if (error)
 		return error;
 
-	return xfs_iunlink_insert_inode(tp, agibp, ip);
+	mutex_lock(&agibp->b_pag->pag_unlinked_mutex);
+	error = xfs_iunlink_insert_inode(tp, agibp, ip);
+	mutex_unlock(&agibp->b_pag->pag_unlinked_mutex);
+	return error;
 }
 
 /*
@@ -2164,19 +2216,20 @@ xfs_iunlink_remove(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip)
 {
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_buf		*agibp = NULL;
+	struct xfs_perag	*pag;
 	int			error;
 
 	trace_xfs_iunlink_remove(ip);
 
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
+	pag = xfs_iunlink_remove_lock(XFS_INO_TO_AGNO(tp->t_mountp, ip->i_ino),
+		tp, ip, &agibp);
+	if (IS_ERR(pag))
+		return PTR_ERR(pag);
 
-	return xfs_iunlink_remove_inode(tp, agibp, ip);
+	error = xfs_iunlink_remove_inode(pag, tp, agibp, ip);
+	xfs_iunlink_unlock(pag);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 031e96ff022d..f94e14059e61 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -223,6 +223,7 @@ xfs_initialize_perag(
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
 		spin_lock_init(&pag->pag_state_lock);
+		mutex_init(&pag->pag_unlinked_mutex);
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a72cfcaa4ad1..9a1d0f239fa4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -372,6 +372,9 @@ typedef struct xfs_perag {
 	/* reference count */
 	uint8_t			pagf_refcount_level;
 
+	/* lock to protect unlinked inode list */
+	struct mutex            pag_unlinked_mutex;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock Gao Xiang
@ 2020-07-24  6:12   ` Gao Xiang
  2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
  3 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, AGI buffer is always touched since xfs_iunlink()
adds unlinked inodes from head unconditionally, but since we
now have the only one unlinked list and if we insert unlinked
inodes from tail instead and there're more than 1 inodes, AGI
buffer modification could be avoided.

In order to do that, let's keep track of the tail of unlinked
inode into the perag all the time in order for xfs_iunlink()
to use.

With this change, it shows that only 938 of 10000 operations
modifies the head of unlinked list with the following workload:
 seq 1 10000 | xargs touch
 find . | xargs -n3 -P100 rm

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c       | 120 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_log_recover.c |   5 ++
 fs/xfs/xfs_mount.c       |   1 +
 fs/xfs/xfs_mount.h       |   3 +
 4 files changed, 84 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d78aaa8ce252..3cfd84b76955 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1986,6 +1986,38 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
+/*
+ * Lock the perag and take AGI lock if agi_unlinked is touched as well
+ * for xfs_iunlink_insert_inode(). As for the details of locking order,
+ * refer to the comments of xfs_iunlink_remove_lock().
+ */
+static struct xfs_perag *
+xfs_iunlink_insert_lock(
+	xfs_agino_t		agno,
+	struct xfs_trans        *tp,
+	struct xfs_inode	*ip,
+	struct xfs_buf		**agibpp)
+{
+	struct xfs_mount        *mp = tp->t_mountp;
+	struct xfs_perag	*pag;
+	int			error;
+
+	pag = xfs_perag_get(mp, agno);
+	mutex_lock(&pag->pag_unlinked_mutex);
+
+	if (!pag->pag_unlinked_tail) {
+		mutex_unlock(&pag->pag_unlinked_mutex);
+
+		error = xfs_read_agi(mp, tp, agno, agibpp);
+		if (error) {
+			xfs_perag_put(pag);
+			return ERR_PTR(error);
+		}
+		mutex_lock(&pag->pag_unlinked_mutex);
+	}
+	return pag;
+}
+
 /*
  * Always insert at the head, so we only have to do a next inode lookup to
  * update it's prev pointer. The AGI bucket will point at the one we are
@@ -1993,50 +2025,47 @@ xfs_iunlink_update_bucket(
  */
 static int
 xfs_iunlink_insert_inode(
+	struct xfs_perag	*pag,
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_inode	*pip = pag->pag_unlinked_tail;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
 
-	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the pointer isn't garbage and that this inode
-	 * isn't already on the list.
-	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
-	if (next_agino == agino ||
-	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
-		xfs_buf_mark_corrupt(agibp);
-		return -EFSCORRUPTED;
-	}
-
-	ip->i_prev_unlinked = NULLAGINO;
-	ip->i_next_unlinked = next_agino;
-	if (ip->i_next_unlinked != NULLAGINO) {
-		struct xfs_inode	*nip;
-
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
-		if (IS_ERR_OR_NULL(nip))
-			return -EFSCORRUPTED;
+	if (!pip) {
+		xfs_agino_t	agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+		struct xfs_agi	*agi = agibp->b_addr;
+		int		error;
 
-		if (nip->i_prev_unlinked != NULLAGINO) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-						NULL, 0, __this_address);
+		ip->i_prev_unlinked = NULLAGINO;
+		/*
+		 * Get the index into the agi hash table for the list this
+		 * inode will go on.  Make sure the pointer isn't garbage
+		 * and that this inode isn't already on the list.
+		 */
+		next_agino = be32_to_cpu(agi->agi_unlinked[0]);
+		if (next_agino == agino ||
+		    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
+			xfs_buf_mark_corrupt(agibp);
 			return -EFSCORRUPTED;
 		}
-		nip->i_prev_unlinked = agino;
 
-		/* update the on disk inode now */
-		xfs_iunlink_log(tp, ip);
+		/* Point the head of the list to point to this inode. */
+		error = xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
+		if (error)
+			return error;
+	} else {
+		ip->i_prev_unlinked = XFS_INO_TO_AGINO(mp, pip->i_ino);
+		ASSERT(pip->i_next_unlinked == NULLAGINO);
+		pip->i_next_unlinked = agino;
+		xfs_iunlink_log(tp, pip);
 	}
-
-	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
+	ip->i_next_unlinked = NULLAGINO;
+	pag->pag_unlinked_tail = ip;
+	return 0;
 }
 
 /*
@@ -2095,6 +2124,7 @@ xfs_iunlink_remove_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
+	struct xfs_inode	*pip = NULL;
 	int			error;
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
@@ -2122,8 +2152,6 @@ xfs_iunlink_remove_inode(
 			return -EFSCORRUPTED;
 	} else {
 		/* lookup previous inode and update to point at next */
-		struct xfs_inode	*pip;
-
 		pip = xfs_iunlink_lookup_prev(pag, ip);
 		if (IS_ERR_OR_NULL(pip))
 			return -EFSCORRUPTED;
@@ -2139,8 +2167,12 @@ xfs_iunlink_remove_inode(
 		xfs_iunlink_log(tp, pip);
 	}
 
-	/* lookup the next inode and update to point at prev */
-	if (ip->i_next_unlinked != NULLAGINO) {
+	if (next_agino == NULLAGINO) {
+		/* only care about the tail of bucket 0 for xfs_iunlink() */
+		if (pag->pag_unlinked_tail == ip)
+			pag->pag_unlinked_tail = pip;
+	} else {
+		/* lookup the next inode and update to point at prev */
 		struct xfs_inode	*nip;
 
 		nip = xfs_iunlink_lookup_next(pag, ip);
@@ -2188,23 +2220,21 @@ xfs_iunlink(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip)
 {
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_buf		*agibp = NULL;
+	struct xfs_perag	*pag;
 	int			error;
 
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 	trace_xfs_iunlink(ip);
 
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
+	pag = xfs_iunlink_insert_lock(XFS_INO_TO_AGNO(tp->t_mountp, ip->i_ino),
+		tp, ip, &agibp);
+	if (IS_ERR(pag))
+		return PTR_ERR(pag);
 
-	mutex_lock(&agibp->b_pag->pag_unlinked_mutex);
-	error = xfs_iunlink_insert_inode(tp, agibp, ip);
-	mutex_unlock(&agibp->b_pag->pag_unlinked_mutex);
+	error = xfs_iunlink_insert_inode(pag, tp, agibp, ip);
+	xfs_iunlink_unlock(pag);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index d47eea31c165..30198aea7335 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2801,6 +2801,11 @@ xlog_recover_iunlink_ag(
 					prev_ip->i_next_unlinked = NULLAGINO;
 				break;
 			}
+
+			/* XXX: take pag_unlinked_mutex across the loop? */
+			if (!bucket)
+				agibp->b_pag->pag_unlinked_tail = ip;
+
 			if (prev_ip) {
 				ip->i_prev_unlinked = prev_agino;
 				xfs_irele(prev_ip);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index f94e14059e61..f1cd3e9c4da5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -224,6 +224,7 @@ xfs_initialize_perag(
 			first_initialised = index;
 		spin_lock_init(&pag->pag_state_lock);
 		mutex_init(&pag->pag_unlinked_mutex);
+		pag->pag_unlinked_tail = NULL;
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 9a1d0f239fa4..934e7c373042 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -375,6 +375,9 @@ typedef struct xfs_perag {
 	/* lock to protect unlinked inode list */
 	struct mutex            pag_unlinked_mutex;
 
+	/* point to the inode tail of AGI unlinked bucket 0 */
+	struct xfs_inode	*pag_unlinked_tail;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
                     ` (2 preceding siblings ...)
  2020-07-24  6:12   ` [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail Gao Xiang
@ 2020-08-18 13:30   ` Gao Xiang
  2020-08-18 13:30     ` [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash Gao Xiang
                       ` (4 more replies)
  3 siblings, 5 replies; 29+ messages in thread
From: Gao Xiang @ 2020-08-18 13:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Hi forks,

This is RFC v4 version which is based on Dave's latest patchset:
 https://lore.kernel.org/r/20200812092556.2567285-1-david@fromorbit.com

I didn't send out v3 because it was based on Dave's previous RFC
patchset, but I'm still not quite sure to drop RFC tag since this
version is different from the previous versions...

Changes since v2:
 - rebase on new patchset, and omit the original first patch
   "xfs: arrange all unlinked inodes into one list" since it now
   has better form in the base patchset;

 - a tail xfs_inode pointer is no longer needed since the original
   patchset introduced list_head iunlink infrastructure and it can
   be used to get the tail inode;

 - take pag_iunlink_mutex lock until all iunlink log items are
   committed. Otherwise, xfs_iunlink_log() order would not be equal
   to the trans commit order so it can mis-reorder and cause metadata
   corruption I mentioned in v2.

   In order to archive that, some recursive count is introduced since
   there could be several iunlink operations in one transaction,
   and introduce some per-AG fields as well since these operations
   in the transaction may not operate inodes in the same AG. we may
   also need to take AGI buffer lock in advance (e.g. whiteout rename
   path) due to iunlink operations and locking order constraint.
   For more details, see related inlined comments as well...

 - "xfs: get rid of unused pagi_unlinked_hash" would be better folded
   into original patchset since pagi_unlinked_hash is no longer needed.

============

[Original text]

This RFC patchset mainly addresses the thoughts [*] and [**] from Dave's
original patchset,
https://lore.kernel.org/r/20200623095015.1934171-1-david@fromorbit.com

In short, it focues on the following ideas mentioned by Dave:
 - use bucket 0 instead of multiple buckets since in-memory double
   linked list finally works;

 - avoid taking AGI buffer and unnecessary AGI update if possible, so
   1) add a new lock and keep proper locking order to avoid deadlock;
   2) insert a new unlinked inode from the tail instead of head;

In addition, it's worth noticing 3 things:
 - xfs_iunlink_remove() should support old multiple buckets in order
   to keep old inode unlinked list (old image) working when recovering.

 - (but) OTOH, the old kernel recovery _shouldn't_ work with new image
   since the bucket_index from old xfs_iunlink_remove() is generated
   by the old formula (rather than keep in xfs_inode), which is now
   fixed as 0. So this feature is not forward compatible without some
   extra backport patches;

 - a tail xfs_inode pointer is also added in the perag, which keeps 
   track of the tail of bucket 0 since it's mainly used for xfs_iunlink().


The git tree is also available at
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git tags/xfs/iunlink_opt_v4

Gitweb:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/log/?h=xfs/iunlink_opt_v4


Some preliminary tests are done (including fstests, but there seems
some pre-exist failures and I haven't looked into yet). And I confirmed
there was no previous metadata corruption mentioned in RFC v2 anymore.

To confirm that I'm in the right direction, I post the latest version
now since it haven't been updated for a while.

Comments and directions are welcomed. :)

Thanks,
Gao Xiang

Gao Xiang (3):
  xfs: get rid of unused pagi_unlinked_hash
  xfs: introduce perag iunlink lock
  xfs: insert unlinked inodes from tail

 fs/xfs/xfs_inode.c        | 194 ++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_inode.h        |   1 +
 fs/xfs/xfs_iunlink_item.c |  16 ++++
 fs/xfs/xfs_mount.c        |   4 +
 fs/xfs/xfs_mount.h        |  14 +--
 5 files changed, 193 insertions(+), 36 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash
  2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
@ 2020-08-18 13:30     ` Gao Xiang
  2020-08-19  0:54       ` Darrick J. Wong
  2020-08-18 13:30     ` [RFC PATCH v4 2/3] xfs: introduce perag iunlink lock Gao Xiang
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Gao Xiang @ 2020-08-18 13:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

pagi_unlinked_hash is unused since no backref infrastructure now.
(it's better to fold it into original patchset.)

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_mount.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c35a6c463529..98109801a995 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -372,13 +372,6 @@ typedef struct xfs_perag {
 
 	/* reference count */
 	uint8_t			pagf_refcount_level;
-
-	/*
-	 * Unlinked inode information.  This incore information reflects
-	 * data stored in the AGI, so callers must hold the AGI buffer lock
-	 * or have some other means to control concurrency.
-	 */
-	struct rhashtable	pagi_unlinked_hash;
 } xfs_perag_t;
 
 static inline struct xfs_ag_resv *
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v4 2/3] xfs: introduce perag iunlink lock
  2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
  2020-08-18 13:30     ` [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash Gao Xiang
@ 2020-08-18 13:30     ` Gao Xiang
  2020-08-19  1:06       ` Darrick J. Wong
  2020-08-18 13:30     ` [RFC PATCH v4 3/3] xfs: insert unlinked inodes from tail Gao Xiang
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Gao Xiang @ 2020-08-18 13:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, we use AGI buffer lock to protect in-memory linked
list for unlinked inodes but since it's not necessary to modify
AGI unless the head of the unlinked list is modified.

Therefore, let's add another per-AG dedicated lock to protect
the whole list, and get rid of taking AGI buffer lock in
xfs_iunlink_remove() if the head is untouched.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c        | 111 +++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_inode.h        |   1 +
 fs/xfs/xfs_iunlink_item.c |  16 ++++++
 fs/xfs/xfs_mount.c        |   4 ++
 fs/xfs/xfs_mount.h        |   9 ++++
 5 files changed, 128 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7ee778bcde06..f32a1172b5cd 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -47,7 +47,7 @@ kmem_zone_t *xfs_inode_zone;
 #define	XFS_ITRUNC_MAX_EXTENTS	2
 
 STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
-STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
+STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *, bool);
 
 /*
  * helper function to extract extent size hint from inode
@@ -1398,7 +1398,7 @@ xfs_link(
 	 * Handle initial link state of O_TMPFILE inode
 	 */
 	if (VFS_I(sip)->i_nlink == 0) {
-		error = xfs_iunlink_remove(tp, sip);
+		error = xfs_iunlink_remove(tp, sip, false);
 		if (error)
 			goto error_return;
 	}
@@ -2001,6 +2001,18 @@ xfs_iunlink_insert_inode(
 	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
 }
 
+void
+xfs_iunlink_unlock(
+	struct xfs_perag	*pag)
+{
+	/* Does not unlock AGI, ever. xfs_trans_commit() does that. */
+	if (!--pag->pag_iunlink_refcnt) {
+		smp_store_release(&pag->pag_iunlink_trans, NULL);
+		mutex_unlock(&pag->pag_iunlink_mutex);
+	}
+	xfs_perag_put(pag);
+}
+
 /*
  * This is called when the inode's link count has gone to 0 or we are creating
  * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
@@ -2017,6 +2029,7 @@ xfs_iunlink(
 	struct xfs_buf		*agibp;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	int			error;
+	struct xfs_perag	*pag;
 
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
@@ -2027,6 +2040,14 @@ xfs_iunlink(
 	if (error)
 		return error;
 
+	/* XXX: will be shortly removed instead in the next commit. */
+	pag = xfs_perag_get(mp, agno);
+	/* paired with smp_store_release() in xfs_iunlink_unlock() */
+	if (smp_load_acquire(&pag->pag_iunlink_trans) != tp)
+		mutex_lock(&pag->pag_iunlink_mutex);
+	WRITE_ONCE(pag->pag_iunlink_trans, tp);
+	++pag->pag_iunlink_refcnt;
+
 	/*
 	 * Insert the inode into the on disk unlinked list, and if that
 	 * succeeds, then insert it into the in memory list. We do it in this
@@ -2037,11 +2058,72 @@ xfs_iunlink(
 	if (!error)
 		list_add(&ip->i_unlink, &agibp->b_pag->pag_ici_unlink_list);
 
+	xfs_iunlink_unlock(pag);
 	return error;
 }
 
+/*
+ * Lock the perag and take AGI lock if agi_unlinked is touched as well for
+ * xfs_iunlink_remove_inode().
+ *
+ * Inode allocation in the O_TMPFILE path defines the AGI/unlinked list lock
+ * order as being AGI->perag unlinked list lock. We are inverting it here as
+ * the fast path tail addition does not need to modify the AGI at all. Hence
+ * AGI lock is only needed if the head is modified, correct locking order.
+ */
+static struct xfs_perag *
+xfs_iunlink_remove_lock(
+	xfs_agino_t		agno,
+	struct xfs_trans        *tp,
+	struct xfs_inode	*ip,
+	struct xfs_buf		**agibpp,
+	bool			force_agi)
+{
+	struct xfs_mount        *mp = tp->t_mountp;
+	struct xfs_perag	*pag;
+	int			error;
+
+	pag = xfs_perag_get(mp, agno);
+	/* paired with smp_store_release() in xfs_iunlink_unlock() */
+	if (smp_load_acquire(&pag->pag_iunlink_trans) == tp) {
+		/*
+		 * if pag_iunlink_trans is the current trans, we're
+		 * in the current process context, so it's safe here.
+		 */
+		ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
+		goto out;
+	}
+
+	if (!force_agi) {
+		mutex_lock(&pag->pag_iunlink_mutex);
+		if (ip != list_first_entry(&pag->pag_ici_unlink_list,
+				struct xfs_inode, i_unlink))
+			goto out;
+		mutex_unlock(&pag->pag_iunlink_mutex);
+	}
+
+	/*
+	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock in
+	 * this transaction in advance and we should keep the proper
+	 * locking order: AGI buf lock and then pag_iunlink_mutex.
+	 */
+	error = xfs_read_agi(mp, tp, agno, agibpp);
+	if (error) {
+		xfs_perag_put(pag);
+		return ERR_PTR(error);
+	}
+
+	mutex_lock(&pag->pag_iunlink_mutex);
+	ASSERT(!pag->pag_iunlink_refcnt);
+out:
+	WRITE_ONCE(pag->pag_iunlink_trans, tp);
+	++pag->pag_iunlink_refcnt;
+	return pag;
+}
+
 static int
 xfs_iunlink_remove_inode(
+	struct xfs_perag	*pag,
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
@@ -2055,7 +2137,7 @@ xfs_iunlink_remove_inode(
 	 * Get the next agino in the list. If we are at the end of the list,
 	 * then the previous inode's i_next_unlinked filed will get cleared.
 	 */
-	if (ip != list_last_entry(&agibp->b_pag->pag_ici_unlink_list,
+	if (ip != list_last_entry(&pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink)) {
 		struct xfs_inode *nip = list_next_entry(ip, i_unlink);
 
@@ -2065,7 +2147,7 @@ xfs_iunlink_remove_inode(
 	/* Clear the on disk next unlinked pointer for this inode. */
 	xfs_iunlink_log(tp, ip, next_agino, NULLAGINO);
 
-	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
+	if (ip != list_first_entry(&pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink)) {
 		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
 
@@ -2073,6 +2155,7 @@ xfs_iunlink_remove_inode(
 		return 0;
 	}
 
+	ASSERT(agibp);
 	/* Point the head of the list to the next unlinked inode. */
 	return xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
 }
@@ -2083,27 +2166,29 @@ xfs_iunlink_remove_inode(
 STATIC int
 xfs_iunlink_remove(
 	struct xfs_trans	*tp,
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	bool			force_agi)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
+	struct xfs_buf		*agibp = NULL;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
 	int			error;
 
 	trace_xfs_iunlink_remove(ip);
 
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
+	pag = xfs_iunlink_remove_lock(agno, tp, ip, &agibp, force_agi);
+	if (IS_ERR(pag))
+		return PTR_ERR(pag);
 
 	/*
 	 * Remove the inode from the on-disk list and then remove it from the
 	 * in-memory list. This order of operations ensures we can look up both
 	 * next and previous inode in the on-disk list via the in-memory list.
 	 */
-	error = xfs_iunlink_remove_inode(tp, agibp, ip);
+	error = xfs_iunlink_remove_inode(pag, tp, agibp, ip);
 	list_del(&ip->i_unlink);
+	xfs_iunlink_unlock(pag);
 	return error;
 }
 
@@ -2316,7 +2401,7 @@ xfs_ifree(
 	if (error)
 		return error;
 
-	error = xfs_iunlink_remove(tp, ip);
+	error = xfs_iunlink_remove(tp, ip, false);
 	if (error)
 		return error;
 
@@ -2895,7 +2980,7 @@ xfs_rename(
 	 */
 	if (wip) {
 		ASSERT(VFS_I(wip)->i_nlink == 0);
-		error = xfs_iunlink_remove(tp, wip);
+		error = xfs_iunlink_remove(tp, wip, true);
 		if (error)
 			goto out_trans_cancel;
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7f8fbb7c8594..d0a221af71db 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -468,5 +468,6 @@ void xfs_end_io(struct work_struct *work);
 
 int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
 void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
+void xfs_iunlink_unlock(struct xfs_perag *pag);
 
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
index 2ee05f98aa97..ae1e73e465b2 100644
--- a/fs/xfs/xfs_iunlink_item.c
+++ b/fs/xfs/xfs_iunlink_item.c
@@ -16,6 +16,7 @@
 #include "xfs_iunlink_item.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
+#include "xfs_sb.h"
 
 struct kmem_cache	*xfs_iunlink_zone;
 
@@ -28,6 +29,13 @@ static void
 xfs_iunlink_item_release(
 	struct xfs_log_item	*lip)
 {
+	struct xfs_mount	*mp = lip->li_mountp;
+	struct xfs_inode        *ip = IUL_ITEM(lip)->iu_ip;
+	struct xfs_perag	*pag;
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
+	xfs_iunlink_unlock(pag);
 	kmem_cache_free(xfs_iunlink_zone, IUL_ITEM(lip));
 }
 
@@ -150,7 +158,9 @@ xfs_iunlink_log(
 	xfs_agino_t		old_agino,
 	xfs_agino_t		next_agino)
 {
+	struct xfs_mount        *mp = tp->t_mountp;
 	struct xfs_iunlink_item	*iup;
+	struct xfs_perag	*pag;
 
 	iup = kmem_cache_zalloc(xfs_iunlink_zone, GFP_KERNEL | __GFP_NOFAIL);
 
@@ -164,5 +174,11 @@ xfs_iunlink_log(
 	xfs_trans_add_item(tp, &iup->iu_item);
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+	ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
+	ASSERT(pag->pag_iunlink_trans == tp);
+	++pag->pag_iunlink_refcnt;
+	xfs_perag_put(pag);
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index f28c969af272..82d264a3350d 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -224,6 +224,10 @@ xfs_initialize_perag(
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
 		spin_lock_init(&pag->pag_state_lock);
+
+		mutex_init(&pag->pag_iunlink_mutex);
+		pag->pag_iunlink_refcnt = 0;
+		pag->pag_iunlink_trans = NULL;
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 98109801a995..fca4c1d28d8e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -372,6 +372,15 @@ typedef struct xfs_perag {
 
 	/* reference count */
 	uint8_t			pagf_refcount_level;
+
+	/* lock to protect unlinked inode list and refcnt */
+	struct mutex            pag_iunlink_mutex;
+
+	/* recursive count of pag_iunlink_mutex */
+	unsigned int		pag_iunlink_refcnt;
+
+	/* (lockless) by which pag_iunlink_mutex is taken */
+	struct xfs_trans	*pag_iunlink_trans;
 } xfs_perag_t;
 
 static inline struct xfs_ag_resv *
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v4 3/3] xfs: insert unlinked inodes from tail
  2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
  2020-08-18 13:30     ` [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash Gao Xiang
  2020-08-18 13:30     ` [RFC PATCH v4 2/3] xfs: introduce perag iunlink lock Gao Xiang
@ 2020-08-18 13:30     ` Gao Xiang
  2020-08-19  0:53     ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Darrick J. Wong
  2020-08-20  2:46     ` Darrick J. Wong
  4 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-08-18 13:30 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, AGI buffer is always touched since xfs_iunlink()
adds unlinked inodes from head unconditionally, but since we
have the only one unlinked list now and if we insert unlinked
inodes from tail instead and there're more than 1 inode, AGI
buffer could be untouched.

With this change, it shows that only 938 of 10000 operations
modifies the head of unlinked list with the following workload:
 seq 1 10000 | xargs touch
 find . | xargs -n3 -P100 rm

Note that xfs_iunlink_insert_lock() is slightly different from
xfs_iunlink_remove_lock() due to whiteout path, refer to inlined
comments for more details.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c | 99 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 74 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f32a1172b5cd..0add263d21a8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1975,30 +1975,88 @@ xfs_iunlink_update_bucket(
 
 static int
 xfs_iunlink_insert_inode(
+	struct xfs_perag	*pag,
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_inode	*nip;
-	xfs_agino_t		next_agino = NULLAGINO;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 
-	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
-					struct xfs_inode, i_unlink);
-	if (nip) {
+	if (!list_empty(&pag->pag_ici_unlink_list)) {
+		struct xfs_inode	*pip;
 
 		/*
-		 * There is already another inode in the bucket, so point this
-		 * inode to the current head of the list.
+		 * There is already another inode in the bucket, so point
+		 * the last inode to this inode.
 		 */
-		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
-		xfs_iunlink_log(tp, ip, NULLAGINO, next_agino);
+		pip = list_last_entry(&pag->pag_ici_unlink_list,
+				struct xfs_inode, i_unlink);
+		xfs_iunlink_log(tp, pip, NULLAGINO, agino);
+		return 0;
 	}
 
+	ASSERT(agibp);
 	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
+	return xfs_iunlink_update_bucket(tp, agno, agibp, NULLAGINO, agino);
+}
+
+/*
+ * Lock the perag and take AGI lock if agi_unlinked is touched as well
+ * for xfs_iunlink_insert_inode(). As for the details of locking order,
+ * refer to the comments of xfs_iunlink_remove_lock().
+ */
+static struct xfs_perag *
+xfs_iunlink_insert_lock(
+	xfs_agino_t		agno,
+	struct xfs_trans        *tp,
+	struct xfs_inode	*ip,
+	struct xfs_buf		**agibpp)
+{
+	struct xfs_mount        *mp = tp->t_mountp;
+	bool			locked = true;
+	struct xfs_perag	*pag;
+	int			error;
+
+	pag = xfs_perag_get(mp, agno);
+	/* paired with smp_store_release() in xfs_iunlink_unlock() */
+	if (smp_load_acquire(&pag->pag_iunlink_trans) == tp) {
+		/*
+		 * if pag_iunlink_trans is the current trans, we're
+		 * in the current process context, so it's safe here.
+		 */
+		ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
+		/*
+		 * slightly different from xfs_iunlink_remove_lock(),
+		 * the path of xfs_iunlink_remove() and then xfs_iunlink()
+		 * on the same AG needs to be considered (e.g. whiteout
+		 * rename), so lock AGI first in xfs_iunlink_remove(),
+		 * and recursively get AGI safely below.
+		 */
+		if (!list_empty(&pag->pag_ici_unlink_list))
+			goto out;
+	} else {
+		mutex_lock(&pag->pag_iunlink_mutex);
+		if (!list_empty(&pag->pag_ici_unlink_list))
+			goto out;
+		mutex_unlock(&pag->pag_iunlink_mutex);
+		locked = false;
+	}
+
+	/* and see comments in xfs_iunlink_remove_lock() on locking order */
+	error = xfs_read_agi(mp, tp, agno, agibpp);
+	if (error) {
+		xfs_perag_put(pag);
+		return ERR_PTR(error);
+	}
+
+	if (!locked)
+		mutex_lock(&pag->pag_iunlink_mutex);
+out:
+	WRITE_ONCE(pag->pag_iunlink_trans, tp);
+	++pag->pag_iunlink_refcnt;
+	return pag;
 }
 
 void
@@ -2026,7 +2084,7 @@ xfs_iunlink(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
+	struct xfs_buf		*agibp = NULL;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	int			error;
 	struct xfs_perag	*pag;
@@ -2035,18 +2093,9 @@ xfs_iunlink(
 	ASSERT(VFS_I(ip)->i_mode != 0);
 	trace_xfs_iunlink(ip);
 
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
-
-	/* XXX: will be shortly removed instead in the next commit. */
-	pag = xfs_perag_get(mp, agno);
-	/* paired with smp_store_release() in xfs_iunlink_unlock() */
-	if (smp_load_acquire(&pag->pag_iunlink_trans) != tp)
-		mutex_lock(&pag->pag_iunlink_mutex);
-	WRITE_ONCE(pag->pag_iunlink_trans, tp);
-	++pag->pag_iunlink_refcnt;
+	pag = xfs_iunlink_insert_lock(agno, tp, ip, &agibp);
+	if (IS_ERR(pag))
+		return PTR_ERR(pag);
 
 	/*
 	 * Insert the inode into the on disk unlinked list, and if that
@@ -2054,9 +2103,9 @@ xfs_iunlink(
 	 * order so that the modifications required to the on disk list are not
 	 * impacted by already having this inode in the list.
 	 */
-	error = xfs_iunlink_insert_inode(tp, agibp, ip);
+	error = xfs_iunlink_insert_inode(pag, tp, agibp, ip);
 	if (!error)
-		list_add(&ip->i_unlink, &agibp->b_pag->pag_ici_unlink_list);
+		list_add_tail(&ip->i_unlink, &pag->pag_ici_unlink_list);
 
 	xfs_iunlink_unlock(pag);
 	return error;
-- 
2.18.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4
  2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
                       ` (2 preceding siblings ...)
  2020-08-18 13:30     ` [RFC PATCH v4 3/3] xfs: insert unlinked inodes from tail Gao Xiang
@ 2020-08-19  0:53     ` Darrick J. Wong
  2020-08-19  1:14       ` Gao Xiang
  2020-08-20  2:46     ` Darrick J. Wong
  4 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:53 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Tue, Aug 18, 2020 at 09:30:12PM +0800, Gao Xiang wrote:
> Hi forks,
> 
> This is RFC v4 version which is based on Dave's latest patchset:
>  https://lore.kernel.org/r/20200812092556.2567285-1-david@fromorbit.com

As we already discussed on IRC, please send new revisions of patchsets
as a separate thread from the old submission.

> I didn't send out v3 because it was based on Dave's previous RFC
> patchset, but I'm still not quite sure to drop RFC tag since this
> version is different from the previous versions...

Hm, this cover letter could use some tidying up, since it took me a bit
of digging to figure out that yes, this is the successor of the old
series that tried to get the AGI buffer lock out of the way if we're
adding a newly unlinked inode to the end of the unlinked list.

> Changes since v2:
>  - rebase on new patchset, and omit the original first patch
>    "xfs: arrange all unlinked inodes into one list" since it now
>    has better form in the base patchset;
> 
>  - a tail xfs_inode pointer is no longer needed since the original
>    patchset introduced list_head iunlink infrastructure and it can
>    be used to get the tail inode;
> 
>  - take pag_iunlink_mutex lock until all iunlink log items are
>    committed. Otherwise, xfs_iunlink_log() order would not be equal
>    to the trans commit order so it can mis-reorder and cause metadata
>    corruption I mentioned in v2.
> 
>    In order to archive that, some recursive count is introduced since
>    there could be several iunlink operations in one transaction,
>    and introduce some per-AG fields as well since these operations
>    in the transaction may not operate inodes in the same AG. we may
>    also need to take AGI buffer lock in advance (e.g. whiteout rename
>    path) due to iunlink operations and locking order constraint.
>    For more details, see related inlined comments as well...
> 
>  - "xfs: get rid of unused pagi_unlinked_hash" would be better folded
>    into original patchset since pagi_unlinked_hash is no longer needed.
> 
> ============
> 
> [Original text]
> 
> This RFC patchset mainly addresses the thoughts [*] and [**] from Dave's
> original patchset,
> https://lore.kernel.org/r/20200623095015.1934171-1-david@fromorbit.com
> 
> In short, it focues on the following ideas mentioned by Dave:
>  - use bucket 0 instead of multiple buckets since in-memory double
>    linked list finally works;
> 
>  - avoid taking AGI buffer and unnecessary AGI update if possible, so
>    1) add a new lock and keep proper locking order to avoid deadlock;
>    2) insert a new unlinked inode from the tail instead of head;
> 
> In addition, it's worth noticing 3 things:
>  - xfs_iunlink_remove() should support old multiple buckets in order
>    to keep old inode unlinked list (old image) working when recovering.
> 
>  - (but) OTOH, the old kernel recovery _shouldn't_ work with new image
>    since the bucket_index from old xfs_iunlink_remove() is generated
>    by the old formula (rather than keep in xfs_inode), which is now
>    fixed as 0. So this feature is not forward compatible without some
>    extra backport patches;

Oh?  These seem like serious limitations, are they still true?

--D

>  - a tail xfs_inode pointer is also added in the perag, which keeps 
>    track of the tail of bucket 0 since it's mainly used for xfs_iunlink().
> 
> 
> The git tree is also available at
> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git tags/xfs/iunlink_opt_v4
> 
> Gitweb:
> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/log/?h=xfs/iunlink_opt_v4
> 
> 
> Some preliminary tests are done (including fstests, but there seems
> some pre-exist failures and I haven't looked into yet). And I confirmed
> there was no previous metadata corruption mentioned in RFC v2 anymore.
> 
> To confirm that I'm in the right direction, I post the latest version
> now since it haven't been updated for a while.
> 
> Comments and directions are welcomed. :)
> 
> Thanks,
> Gao Xiang
> 
> Gao Xiang (3):
>   xfs: get rid of unused pagi_unlinked_hash
>   xfs: introduce perag iunlink lock
>   xfs: insert unlinked inodes from tail
> 
>  fs/xfs/xfs_inode.c        | 194 ++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_inode.h        |   1 +
>  fs/xfs/xfs_iunlink_item.c |  16 ++++
>  fs/xfs/xfs_mount.c        |   4 +
>  fs/xfs/xfs_mount.h        |  14 +--
>  5 files changed, 193 insertions(+), 36 deletions(-)
> 
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash
  2020-08-18 13:30     ` [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash Gao Xiang
@ 2020-08-19  0:54       ` Darrick J. Wong
  2020-08-21  1:09         ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:54 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Tue, Aug 18, 2020 at 09:30:13PM +0800, Gao Xiang wrote:
> pagi_unlinked_hash is unused since no backref infrastructure now.
> (it's better to fold it into original patchset.)
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>

Yes, this should be folded into the other patch that gets rid of the
rest of the rhash code.  Dave?

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_mount.h | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index c35a6c463529..98109801a995 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -372,13 +372,6 @@ typedef struct xfs_perag {
>  
>  	/* reference count */
>  	uint8_t			pagf_refcount_level;
> -
> -	/*
> -	 * Unlinked inode information.  This incore information reflects
> -	 * data stored in the AGI, so callers must hold the AGI buffer lock
> -	 * or have some other means to control concurrency.
> -	 */
> -	struct rhashtable	pagi_unlinked_hash;
>  } xfs_perag_t;
>  
>  static inline struct xfs_ag_resv *
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 2/3] xfs: introduce perag iunlink lock
  2020-08-18 13:30     ` [RFC PATCH v4 2/3] xfs: introduce perag iunlink lock Gao Xiang
@ 2020-08-19  1:06       ` Darrick J. Wong
  2020-08-19  1:23         ` Gao Xiang
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2020-08-19  1:06 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Tue, Aug 18, 2020 at 09:30:14PM +0800, Gao Xiang wrote:
> Currently, we use AGI buffer lock to protect in-memory linked
> list for unlinked inodes but since it's not necessary to modify
> AGI unless the head of the unlinked list is modified.
> 
> Therefore, let's add another per-AG dedicated lock to protect
> the whole list, and get rid of taking AGI buffer lock in
> xfs_iunlink_remove() if the head is untouched.
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/xfs_inode.c        | 111 +++++++++++++++++++++++++++++++++-----
>  fs/xfs/xfs_inode.h        |   1 +
>  fs/xfs/xfs_iunlink_item.c |  16 ++++++
>  fs/xfs/xfs_mount.c        |   4 ++
>  fs/xfs/xfs_mount.h        |   9 ++++
>  5 files changed, 128 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 7ee778bcde06..f32a1172b5cd 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -47,7 +47,7 @@ kmem_zone_t *xfs_inode_zone;
>  #define	XFS_ITRUNC_MAX_EXTENTS	2
>  
>  STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
> -STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
> +STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *, bool);
>  
>  /*
>   * helper function to extract extent size hint from inode
> @@ -1398,7 +1398,7 @@ xfs_link(
>  	 * Handle initial link state of O_TMPFILE inode
>  	 */
>  	if (VFS_I(sip)->i_nlink == 0) {
> -		error = xfs_iunlink_remove(tp, sip);
> +		error = xfs_iunlink_remove(tp, sip, false);
>  		if (error)
>  			goto error_return;
>  	}
> @@ -2001,6 +2001,18 @@ xfs_iunlink_insert_inode(
>  	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
>  }
>  
> +void
> +xfs_iunlink_unlock(
> +	struct xfs_perag	*pag)
> +{
> +	/* Does not unlock AGI, ever. xfs_trans_commit() does that. */
> +	if (!--pag->pag_iunlink_refcnt) {
> +		smp_store_release(&pag->pag_iunlink_trans, NULL);
> +		mutex_unlock(&pag->pag_iunlink_mutex);
> +	}
> +	xfs_perag_put(pag);

An unlink function drops the perag refcount??

> +}
> +
>  /*
>   * This is called when the inode's link count has gone to 0 or we are creating
>   * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> @@ -2017,6 +2029,7 @@ xfs_iunlink(
>  	struct xfs_buf		*agibp;
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	int			error;
> +	struct xfs_perag	*pag;
>  
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
>  	ASSERT(VFS_I(ip)->i_mode != 0);
> @@ -2027,6 +2040,14 @@ xfs_iunlink(
>  	if (error)
>  		return error;
>  
> +	/* XXX: will be shortly removed instead in the next commit. */
> +	pag = xfs_perag_get(mp, agno);
> +	/* paired with smp_store_release() in xfs_iunlink_unlock() */
> +	if (smp_load_acquire(&pag->pag_iunlink_trans) != tp)
> +		mutex_lock(&pag->pag_iunlink_mutex);
> +	WRITE_ONCE(pag->pag_iunlink_trans, tp);
> +	++pag->pag_iunlink_refcnt;

Errrgh, I detest reading lockless code.  Why not (a) introduce
xfs_iunlink_insert_lock here, and (b) introduce the per-AG locks with
standard locking devices (mutex?) so that you can (c) add a patch at the
end to change the mutexes into whatever lockless strategy you really
want to use?

It would be very helpful to keep (b) and (c) separate in case we have to
bisect through here; that way, we can use this patch as a testing point
to make sure that transition from always using the AGI lock to using the
perag lock was done correctly; and then later we can separately test
that the lockless conversion was done correctly.

Hm, I guess on second reading, you're using the smp_load_acquire here to
figure out if you even want to take the mutex?

<confused>

> +
>  	/*
>  	 * Insert the inode into the on disk unlinked list, and if that
>  	 * succeeds, then insert it into the in memory list. We do it in this
> @@ -2037,11 +2058,72 @@ xfs_iunlink(
>  	if (!error)
>  		list_add(&ip->i_unlink, &agibp->b_pag->pag_ici_unlink_list);
>  
> +	xfs_iunlink_unlock(pag);
>  	return error;
>  }
>  
> +/*
> + * Lock the perag and take AGI lock if agi_unlinked is touched as well for
> + * xfs_iunlink_remove_inode().
> + *
> + * Inode allocation in the O_TMPFILE path defines the AGI/unlinked list lock
> + * order as being AGI->perag unlinked list lock. We are inverting it here as
> + * the fast path tail addition does not need to modify the AGI at all. Hence
> + * AGI lock is only needed if the head is modified, correct locking order.

Er, I don't get it, what is the expected locking state of
pag_iunlink_mutex before and after this function?

--D

> + */
> +static struct xfs_perag *
> +xfs_iunlink_remove_lock(
> +	xfs_agino_t		agno,
> +	struct xfs_trans        *tp,
> +	struct xfs_inode	*ip,
> +	struct xfs_buf		**agibpp,
> +	bool			force_agi)
> +{
> +	struct xfs_mount        *mp = tp->t_mountp;
> +	struct xfs_perag	*pag;
> +	int			error;
> +
> +	pag = xfs_perag_get(mp, agno);
> +	/* paired with smp_store_release() in xfs_iunlink_unlock() */
> +	if (smp_load_acquire(&pag->pag_iunlink_trans) == tp) {
> +		/*
> +		 * if pag_iunlink_trans is the current trans, we're
> +		 * in the current process context, so it's safe here.
> +		 */
> +		ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
> +		goto out;
> +	}
> +
> +	if (!force_agi) {
> +		mutex_lock(&pag->pag_iunlink_mutex);
> +		if (ip != list_first_entry(&pag->pag_ici_unlink_list,
> +				struct xfs_inode, i_unlink))
> +			goto out;
> +		mutex_unlock(&pag->pag_iunlink_mutex);
> +	}
> +
> +	/*
> +	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock in
> +	 * this transaction in advance and we should keep the proper
> +	 * locking order: AGI buf lock and then pag_iunlink_mutex.
> +	 */
> +	error = xfs_read_agi(mp, tp, agno, agibpp);
> +	if (error) {
> +		xfs_perag_put(pag);
> +		return ERR_PTR(error);
> +	}
> +
> +	mutex_lock(&pag->pag_iunlink_mutex);
> +	ASSERT(!pag->pag_iunlink_refcnt);
> +out:
> +	WRITE_ONCE(pag->pag_iunlink_trans, tp);
> +	++pag->pag_iunlink_refcnt;
> +	return pag;
> +}
> +
>  static int
>  xfs_iunlink_remove_inode(
> +	struct xfs_perag	*pag,
>  	struct xfs_trans	*tp,
>  	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
> @@ -2055,7 +2137,7 @@ xfs_iunlink_remove_inode(
>  	 * Get the next agino in the list. If we are at the end of the list,
>  	 * then the previous inode's i_next_unlinked filed will get cleared.
>  	 */
> -	if (ip != list_last_entry(&agibp->b_pag->pag_ici_unlink_list,
> +	if (ip != list_last_entry(&pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink)) {
>  		struct xfs_inode *nip = list_next_entry(ip, i_unlink);
>  
> @@ -2065,7 +2147,7 @@ xfs_iunlink_remove_inode(
>  	/* Clear the on disk next unlinked pointer for this inode. */
>  	xfs_iunlink_log(tp, ip, next_agino, NULLAGINO);
>  
> -	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
> +	if (ip != list_first_entry(&pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink)) {
>  		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
>  
> @@ -2073,6 +2155,7 @@ xfs_iunlink_remove_inode(
>  		return 0;
>  	}
>  
> +	ASSERT(agibp);
>  	/* Point the head of the list to the next unlinked inode. */
>  	return xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
>  }
> @@ -2083,27 +2166,29 @@ xfs_iunlink_remove_inode(
>  STATIC int
>  xfs_iunlink_remove(
>  	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip)
> +	struct xfs_inode	*ip,
> +	bool			force_agi)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_buf		*agibp;
> +	struct xfs_buf		*agibp = NULL;
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_perag	*pag;
>  	int			error;
>  
>  	trace_xfs_iunlink_remove(ip);
>  
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
> +	pag = xfs_iunlink_remove_lock(agno, tp, ip, &agibp, force_agi);
> +	if (IS_ERR(pag))
> +		return PTR_ERR(pag);
>  
>  	/*
>  	 * Remove the inode from the on-disk list and then remove it from the
>  	 * in-memory list. This order of operations ensures we can look up both
>  	 * next and previous inode in the on-disk list via the in-memory list.
>  	 */
> -	error = xfs_iunlink_remove_inode(tp, agibp, ip);
> +	error = xfs_iunlink_remove_inode(pag, tp, agibp, ip);
>  	list_del(&ip->i_unlink);
> +	xfs_iunlink_unlock(pag);
>  	return error;
>  }
>  
> @@ -2316,7 +2401,7 @@ xfs_ifree(
>  	if (error)
>  		return error;
>  
> -	error = xfs_iunlink_remove(tp, ip);
> +	error = xfs_iunlink_remove(tp, ip, false);
>  	if (error)
>  		return error;
>  
> @@ -2895,7 +2980,7 @@ xfs_rename(
>  	 */
>  	if (wip) {
>  		ASSERT(VFS_I(wip)->i_nlink == 0);
> -		error = xfs_iunlink_remove(tp, wip);
> +		error = xfs_iunlink_remove(tp, wip, true);
>  		if (error)
>  			goto out_trans_cancel;
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 7f8fbb7c8594..d0a221af71db 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -468,5 +468,6 @@ void xfs_end_io(struct work_struct *work);
>  
>  int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
>  void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
> +void xfs_iunlink_unlock(struct xfs_perag *pag);
>  
>  #endif	/* __XFS_INODE_H__ */
> diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> index 2ee05f98aa97..ae1e73e465b2 100644
> --- a/fs/xfs/xfs_iunlink_item.c
> +++ b/fs/xfs/xfs_iunlink_item.c
> @@ -16,6 +16,7 @@
>  #include "xfs_iunlink_item.h"
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
> +#include "xfs_sb.h"
>  
>  struct kmem_cache	*xfs_iunlink_zone;
>  
> @@ -28,6 +29,13 @@ static void
>  xfs_iunlink_item_release(
>  	struct xfs_log_item	*lip)
>  {
> +	struct xfs_mount	*mp = lip->li_mountp;
> +	struct xfs_inode        *ip = IUL_ITEM(lip)->iu_ip;
> +	struct xfs_perag	*pag;
> +
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> +	ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
> +	xfs_iunlink_unlock(pag);
>  	kmem_cache_free(xfs_iunlink_zone, IUL_ITEM(lip));
>  }
>  
> @@ -150,7 +158,9 @@ xfs_iunlink_log(
>  	xfs_agino_t		old_agino,
>  	xfs_agino_t		next_agino)
>  {
> +	struct xfs_mount        *mp = tp->t_mountp;
>  	struct xfs_iunlink_item	*iup;
> +	struct xfs_perag	*pag;
>  
>  	iup = kmem_cache_zalloc(xfs_iunlink_zone, GFP_KERNEL | __GFP_NOFAIL);
>  
> @@ -164,5 +174,11 @@ xfs_iunlink_log(
>  	xfs_trans_add_item(tp, &iup->iu_item);
>  	tp->t_flags |= XFS_TRANS_DIRTY;
>  	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
> +
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> +	ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
> +	ASSERT(pag->pag_iunlink_trans == tp);
> +	++pag->pag_iunlink_refcnt;
> +	xfs_perag_put(pag);
>  }
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index f28c969af272..82d264a3350d 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -224,6 +224,10 @@ xfs_initialize_perag(
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
>  		spin_lock_init(&pag->pag_state_lock);
> +
> +		mutex_init(&pag->pag_iunlink_mutex);
> +		pag->pag_iunlink_refcnt = 0;
> +		pag->pag_iunlink_trans = NULL;
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, agcount);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 98109801a995..fca4c1d28d8e 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -372,6 +372,15 @@ typedef struct xfs_perag {
>  
>  	/* reference count */
>  	uint8_t			pagf_refcount_level;
> +
> +	/* lock to protect unlinked inode list and refcnt */
> +	struct mutex            pag_iunlink_mutex;
> +
> +	/* recursive count of pag_iunlink_mutex */
> +	unsigned int		pag_iunlink_refcnt;
> +
> +	/* (lockless) by which pag_iunlink_mutex is taken */
> +	struct xfs_trans	*pag_iunlink_trans;
>  } xfs_perag_t;
>  
>  static inline struct xfs_ag_resv *
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4
  2020-08-19  0:53     ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Darrick J. Wong
@ 2020-08-19  1:14       ` Gao Xiang
  0 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-08-19  1:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Brian Foster

Hi Darrick,

On Tue, Aug 18, 2020 at 05:53:34PM -0700, Darrick J. Wong wrote:
> On Tue, Aug 18, 2020 at 09:30:12PM +0800, Gao Xiang wrote:
> > Hi forks,
> > 
> > This is RFC v4 version which is based on Dave's latest patchset:
> >  https://lore.kernel.org/r/20200812092556.2567285-1-david@fromorbit.com
> 
> As we already discussed on IRC, please send new revisions of patchsets
> as a separate thread from the old submission.

Okay, will definitely do later.

> 
> > I didn't send out v3 because it was based on Dave's previous RFC
> > patchset, but I'm still not quite sure to drop RFC tag since this
> > version is different from the previous versions...
> 
> Hm, this cover letter could use some tidying up, since it took me a bit
> of digging to figure out that yes, this is the successor of the old
> series that tried to get the AGI buffer lock out of the way if we're
> adding a newly unlinked inode to the end of the unlinked list.

I'm trying to sort out in the next revision if something shows unclear
in the code.

I talked with Dave person-to-person weeks ago about these constraints
on IRC, but I'm not sure I can write out some fluent formal words...

I think there are 2 independent things:
 1) avoiding taking AGI buffer lock if AGI buffer is untouched;
 2) adding a newly unlinked inode to the end of the unlinked list.

So, 2) can be achieved as well without 1) since AGI buffer lock is a
powerful lock and can be recursively taken, but if we'd like to add a
new per-AG iunlink lock, there are some new constraints (locking order
and deadlock concerns) than the current approach.

In summary, due to many exist paths (e.g. tmpfile path), we need to take
lock in the following order:
  AGI buffer lock -> per-AG iunlink lock.

Otherwise it could cause dead lock. And we cannot release per-AG iunlink
lock before all iunlink operations in this trans are commited, or it could
cause iunlink fs corruption...

> 
> > Changes since v2:
> >  - rebase on new patchset, and omit the original first patch
> >    "xfs: arrange all unlinked inodes into one list" since it now
> >    has better form in the base patchset;
> > 
> >  - a tail xfs_inode pointer is no longer needed since the original
> >    patchset introduced list_head iunlink infrastructure and it can
> >    be used to get the tail inode;
> > 
> >  - take pag_iunlink_mutex lock until all iunlink log items are
> >    committed. Otherwise, xfs_iunlink_log() order would not be equal
> >    to the trans commit order so it can mis-reorder and cause metadata
> >    corruption I mentioned in v2.
> > 
> >    In order to archive that, some recursive count is introduced since
> >    there could be several iunlink operations in one transaction,
> >    and introduce some per-AG fields as well since these operations
> >    in the transaction may not operate inodes in the same AG. we may
> >    also need to take AGI buffer lock in advance (e.g. whiteout rename
> >    path) due to iunlink operations and locking order constraint.
> >    For more details, see related inlined comments as well...
> > 
> >  - "xfs: get rid of unused pagi_unlinked_hash" would be better folded
> >    into original patchset since pagi_unlinked_hash is no longer needed.
> > 
> > ============
> > 
> > [Original text]
> > 
> > This RFC patchset mainly addresses the thoughts [*] and [**] from Dave's
> > original patchset,
> > https://lore.kernel.org/r/20200623095015.1934171-1-david@fromorbit.com
> > 
> > In short, it focues on the following ideas mentioned by Dave:
> >  - use bucket 0 instead of multiple buckets since in-memory double
> >    linked list finally works;
> > 
> >  - avoid taking AGI buffer and unnecessary AGI update if possible, so
> >    1) add a new lock and keep proper locking order to avoid deadlock;
> >    2) insert a new unlinked inode from the tail instead of head;
> > 
> > In addition, it's worth noticing 3 things:
> >  - xfs_iunlink_remove() should support old multiple buckets in order
> >    to keep old inode unlinked list (old image) working when recovering.
> > 
> >  - (but) OTOH, the old kernel recovery _shouldn't_ work with new image
> >    since the bucket_index from old xfs_iunlink_remove() is generated
> >    by the old formula (rather than keep in xfs_inode), which is now
> >    fixed as 0. So this feature is not forward compatible without some
> >    extra backport patches;
> 
> Oh?  These seem like serious limitations, are they still true?

Yeah, I think that's still true (I tested on my VM before).

Thanks,
Gao Xiang

> 
> --D
> 
> >  - a tail xfs_inode pointer is also added in the perag, which keeps 
> >    track of the tail of bucket 0 since it's mainly used for xfs_iunlink().
> > 
> > 
> > The git tree is also available at
> > git://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git tags/xfs/iunlink_opt_v4
> > 
> > Gitweb:
> > https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/log/?h=xfs/iunlink_opt_v4
> > 
> > 
> > Some preliminary tests are done (including fstests, but there seems
> > some pre-exist failures and I haven't looked into yet). And I confirmed
> > there was no previous metadata corruption mentioned in RFC v2 anymore.
> > 
> > To confirm that I'm in the right direction, I post the latest version
> > now since it haven't been updated for a while.
> > 
> > Comments and directions are welcomed. :)
> > 
> > Thanks,
> > Gao Xiang
> > 
> > Gao Xiang (3):
> >   xfs: get rid of unused pagi_unlinked_hash
> >   xfs: introduce perag iunlink lock
> >   xfs: insert unlinked inodes from tail
> > 
> >  fs/xfs/xfs_inode.c        | 194 ++++++++++++++++++++++++++++++++------
> >  fs/xfs/xfs_inode.h        |   1 +
> >  fs/xfs/xfs_iunlink_item.c |  16 ++++
> >  fs/xfs/xfs_mount.c        |   4 +
> >  fs/xfs/xfs_mount.h        |  14 +--
> >  5 files changed, 193 insertions(+), 36 deletions(-)
> > 
> > -- 
> > 2.18.1
> > 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 2/3] xfs: introduce perag iunlink lock
  2020-08-19  1:06       ` Darrick J. Wong
@ 2020-08-19  1:23         ` Gao Xiang
  0 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-08-19  1:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Tue, Aug 18, 2020 at 06:06:15PM -0700, Darrick J. Wong wrote:
> On Tue, Aug 18, 2020 at 09:30:14PM +0800, Gao Xiang wrote:
> > Currently, we use AGI buffer lock to protect in-memory linked
> > list for unlinked inodes but since it's not necessary to modify
> > AGI unless the head of the unlinked list is modified.
> > 
> > Therefore, let's add another per-AG dedicated lock to protect
> > the whole list, and get rid of taking AGI buffer lock in
> > xfs_iunlink_remove() if the head is untouched.
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > ---
> >  fs/xfs/xfs_inode.c        | 111 +++++++++++++++++++++++++++++++++-----
> >  fs/xfs/xfs_inode.h        |   1 +
> >  fs/xfs/xfs_iunlink_item.c |  16 ++++++
> >  fs/xfs/xfs_mount.c        |   4 ++
> >  fs/xfs/xfs_mount.h        |   9 ++++
> >  5 files changed, 128 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 7ee778bcde06..f32a1172b5cd 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -47,7 +47,7 @@ kmem_zone_t *xfs_inode_zone;
> >  #define	XFS_ITRUNC_MAX_EXTENTS	2
> >  
> >  STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
> > -STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
> > +STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *, bool);
> >  
> >  /*
> >   * helper function to extract extent size hint from inode
> > @@ -1398,7 +1398,7 @@ xfs_link(
> >  	 * Handle initial link state of O_TMPFILE inode
> >  	 */
> >  	if (VFS_I(sip)->i_nlink == 0) {
> > -		error = xfs_iunlink_remove(tp, sip);
> > +		error = xfs_iunlink_remove(tp, sip, false);
> >  		if (error)
> >  			goto error_return;
> >  	}
> > @@ -2001,6 +2001,18 @@ xfs_iunlink_insert_inode(
> >  	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
> >  }
> >  
> > +void
> > +xfs_iunlink_unlock(
> > +	struct xfs_perag	*pag)
> > +{
> > +	/* Does not unlock AGI, ever. xfs_trans_commit() does that. */
> > +	if (!--pag->pag_iunlink_refcnt) {
> > +		smp_store_release(&pag->pag_iunlink_trans, NULL);
> > +		mutex_unlock(&pag->pag_iunlink_mutex);
> > +	}
> > +	xfs_perag_put(pag);
> 
> An unlink function drops the perag refcount??

Er... It may need a better naming though. Maybe some
xfs_iunlink_release() ?

> 
> > +}
> > +
> >  /*
> >   * This is called when the inode's link count has gone to 0 or we are creating
> >   * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> > @@ -2017,6 +2029,7 @@ xfs_iunlink(
> >  	struct xfs_buf		*agibp;
> >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> >  	int			error;
> > +	struct xfs_perag	*pag;
> >  
> >  	ASSERT(VFS_I(ip)->i_nlink == 0);
> >  	ASSERT(VFS_I(ip)->i_mode != 0);
> > @@ -2027,6 +2040,14 @@ xfs_iunlink(
> >  	if (error)
> >  		return error;
> >  
> > +	/* XXX: will be shortly removed instead in the next commit. */
> > +	pag = xfs_perag_get(mp, agno);
> > +	/* paired with smp_store_release() in xfs_iunlink_unlock() */
> > +	if (smp_load_acquire(&pag->pag_iunlink_trans) != tp)
> > +		mutex_lock(&pag->pag_iunlink_mutex);
> > +	WRITE_ONCE(pag->pag_iunlink_trans, tp);
> > +	++pag->pag_iunlink_refcnt;
> 
> Errrgh, I detest reading lockless code.  Why not (a) introduce
> xfs_iunlink_insert_lock here, and (b) introduce the per-AG locks with
> standard locking devices (mutex?) so that you can (c) add a patch at the
> end to change the mutexes into whatever lockless strategy you really
> want to use?
> 
> It would be very helpful to keep (b) and (c) separate in case we have to
> bisect through here; that way, we can use this patch as a testing point
> to make sure that transition from always using the AGI lock to using the
> perag lock was done correctly; and then later we can separately test
> that the lockless conversion was done correctly.

Yeah, it's better to introduce xfs_iunlink_insert_lock in this series,
will update in the next version.

Unfortunately, it's not easy to seperate (b) and (c) and it's not a real
lockless code strictly. It's just to deal with recursive lock of
pag_iunlink_mutex.

Due to some control dependency, I think it's safe to use smp_load_acquire
rather than pure READ_ONCE.

> 
> Hm, I guess on second reading, you're using the smp_load_acquire here to
> figure out if you even want to take the mutex?

Yes.

> 
> <confused>
> 
> > +
> >  	/*
> >  	 * Insert the inode into the on disk unlinked list, and if that
> >  	 * succeeds, then insert it into the in memory list. We do it in this
> > @@ -2037,11 +2058,72 @@ xfs_iunlink(
> >  	if (!error)
> >  		list_add(&ip->i_unlink, &agibp->b_pag->pag_ici_unlink_list);
> >  
> > +	xfs_iunlink_unlock(pag);
> >  	return error;
> >  }
> >  
> > +/*
> > + * Lock the perag and take AGI lock if agi_unlinked is touched as well for
> > + * xfs_iunlink_remove_inode().
> > + *
> > + * Inode allocation in the O_TMPFILE path defines the AGI/unlinked list lock
> > + * order as being AGI->perag unlinked list lock. We are inverting it here as
> > + * the fast path tail addition does not need to modify the AGI at all. Hence
> > + * AGI lock is only needed if the head is modified, correct locking order.
> 
> Er, I don't get it, what is the expected locking state of
> pag_iunlink_mutex before and after this function?

pag_iunlink_mutex can be taken before this function, but it should be taken
after this function with recursive count added.

Thanks,
Gao Xiang

> 
> --D
> 
> > + */
> > +static struct xfs_perag *
> > +xfs_iunlink_remove_lock(
> > +	xfs_agino_t		agno,
> > +	struct xfs_trans        *tp,
> > +	struct xfs_inode	*ip,
> > +	struct xfs_buf		**agibpp,
> > +	bool			force_agi)
> > +{
> > +	struct xfs_mount        *mp = tp->t_mountp;
> > +	struct xfs_perag	*pag;
> > +	int			error;
> > +
> > +	pag = xfs_perag_get(mp, agno);
> > +	/* paired with smp_store_release() in xfs_iunlink_unlock() */
> > +	if (smp_load_acquire(&pag->pag_iunlink_trans) == tp) {
> > +		/*
> > +		 * if pag_iunlink_trans is the current trans, we're
> > +		 * in the current process context, so it's safe here.
> > +		 */
> > +		ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
> > +		goto out;
> > +	}
> > +
> > +	if (!force_agi) {
> > +		mutex_lock(&pag->pag_iunlink_mutex);
> > +		if (ip != list_first_entry(&pag->pag_ici_unlink_list,
> > +				struct xfs_inode, i_unlink))
> > +			goto out;
> > +		mutex_unlock(&pag->pag_iunlink_mutex);
> > +	}
> > +
> > +	/*
> > +	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock in
> > +	 * this transaction in advance and we should keep the proper
> > +	 * locking order: AGI buf lock and then pag_iunlink_mutex.
> > +	 */
> > +	error = xfs_read_agi(mp, tp, agno, agibpp);
> > +	if (error) {
> > +		xfs_perag_put(pag);
> > +		return ERR_PTR(error);
> > +	}
> > +
> > +	mutex_lock(&pag->pag_iunlink_mutex);
> > +	ASSERT(!pag->pag_iunlink_refcnt);
> > +out:
> > +	WRITE_ONCE(pag->pag_iunlink_trans, tp);
> > +	++pag->pag_iunlink_refcnt;
> > +	return pag;
> > +}
> > +
> >  static int
> >  xfs_iunlink_remove_inode(
> > +	struct xfs_perag	*pag,
> >  	struct xfs_trans	*tp,
> >  	struct xfs_buf		*agibp,
> >  	struct xfs_inode	*ip)
> > @@ -2055,7 +2137,7 @@ xfs_iunlink_remove_inode(
> >  	 * Get the next agino in the list. If we are at the end of the list,
> >  	 * then the previous inode's i_next_unlinked filed will get cleared.
> >  	 */
> > -	if (ip != list_last_entry(&agibp->b_pag->pag_ici_unlink_list,
> > +	if (ip != list_last_entry(&pag->pag_ici_unlink_list,
> >  					struct xfs_inode, i_unlink)) {
> >  		struct xfs_inode *nip = list_next_entry(ip, i_unlink);
> >  
> > @@ -2065,7 +2147,7 @@ xfs_iunlink_remove_inode(
> >  	/* Clear the on disk next unlinked pointer for this inode. */
> >  	xfs_iunlink_log(tp, ip, next_agino, NULLAGINO);
> >  
> > -	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
> > +	if (ip != list_first_entry(&pag->pag_ici_unlink_list,
> >  					struct xfs_inode, i_unlink)) {
> >  		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
> >  
> > @@ -2073,6 +2155,7 @@ xfs_iunlink_remove_inode(
> >  		return 0;
> >  	}
> >  
> > +	ASSERT(agibp);
> >  	/* Point the head of the list to the next unlinked inode. */
> >  	return xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
> >  }
> > @@ -2083,27 +2166,29 @@ xfs_iunlink_remove_inode(
> >  STATIC int
> >  xfs_iunlink_remove(
> >  	struct xfs_trans	*tp,
> > -	struct xfs_inode	*ip)
> > +	struct xfs_inode	*ip,
> > +	bool			force_agi)
> >  {
> >  	struct xfs_mount	*mp = tp->t_mountp;
> > -	struct xfs_buf		*agibp;
> > +	struct xfs_buf		*agibp = NULL;
> >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> > +	struct xfs_perag	*pag;
> >  	int			error;
> >  
> >  	trace_xfs_iunlink_remove(ip);
> >  
> > -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> > -	error = xfs_read_agi(mp, tp, agno, &agibp);
> > -	if (error)
> > -		return error;
> > +	pag = xfs_iunlink_remove_lock(agno, tp, ip, &agibp, force_agi);
> > +	if (IS_ERR(pag))
> > +		return PTR_ERR(pag);
> >  
> >  	/*
> >  	 * Remove the inode from the on-disk list and then remove it from the
> >  	 * in-memory list. This order of operations ensures we can look up both
> >  	 * next and previous inode in the on-disk list via the in-memory list.
> >  	 */
> > -	error = xfs_iunlink_remove_inode(tp, agibp, ip);
> > +	error = xfs_iunlink_remove_inode(pag, tp, agibp, ip);
> >  	list_del(&ip->i_unlink);
> > +	xfs_iunlink_unlock(pag);
> >  	return error;
> >  }
> >  
> > @@ -2316,7 +2401,7 @@ xfs_ifree(
> >  	if (error)
> >  		return error;
> >  
> > -	error = xfs_iunlink_remove(tp, ip);
> > +	error = xfs_iunlink_remove(tp, ip, false);
> >  	if (error)
> >  		return error;
> >  
> > @@ -2895,7 +2980,7 @@ xfs_rename(
> >  	 */
> >  	if (wip) {
> >  		ASSERT(VFS_I(wip)->i_nlink == 0);
> > -		error = xfs_iunlink_remove(tp, wip);
> > +		error = xfs_iunlink_remove(tp, wip, true);
> >  		if (error)
> >  			goto out_trans_cancel;
> >  
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 7f8fbb7c8594..d0a221af71db 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -468,5 +468,6 @@ void xfs_end_io(struct work_struct *work);
> >  
> >  int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
> >  void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
> > +void xfs_iunlink_unlock(struct xfs_perag *pag);
> >  
> >  #endif	/* __XFS_INODE_H__ */
> > diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> > index 2ee05f98aa97..ae1e73e465b2 100644
> > --- a/fs/xfs/xfs_iunlink_item.c
> > +++ b/fs/xfs/xfs_iunlink_item.c
> > @@ -16,6 +16,7 @@
> >  #include "xfs_iunlink_item.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_error.h"
> > +#include "xfs_sb.h"
> >  
> >  struct kmem_cache	*xfs_iunlink_zone;
> >  
> > @@ -28,6 +29,13 @@ static void
> >  xfs_iunlink_item_release(
> >  	struct xfs_log_item	*lip)
> >  {
> > +	struct xfs_mount	*mp = lip->li_mountp;
> > +	struct xfs_inode        *ip = IUL_ITEM(lip)->iu_ip;
> > +	struct xfs_perag	*pag;
> > +
> > +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> > +	ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
> > +	xfs_iunlink_unlock(pag);
> >  	kmem_cache_free(xfs_iunlink_zone, IUL_ITEM(lip));
> >  }
> >  
> > @@ -150,7 +158,9 @@ xfs_iunlink_log(
> >  	xfs_agino_t		old_agino,
> >  	xfs_agino_t		next_agino)
> >  {
> > +	struct xfs_mount        *mp = tp->t_mountp;
> >  	struct xfs_iunlink_item	*iup;
> > +	struct xfs_perag	*pag;
> >  
> >  	iup = kmem_cache_zalloc(xfs_iunlink_zone, GFP_KERNEL | __GFP_NOFAIL);
> >  
> > @@ -164,5 +174,11 @@ xfs_iunlink_log(
> >  	xfs_trans_add_item(tp, &iup->iu_item);
> >  	tp->t_flags |= XFS_TRANS_DIRTY;
> >  	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
> > +
> > +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> > +	ASSERT(mutex_is_locked(&pag->pag_iunlink_mutex));
> > +	ASSERT(pag->pag_iunlink_trans == tp);
> > +	++pag->pag_iunlink_refcnt;
> > +	xfs_perag_put(pag);
> >  }
> >  
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index f28c969af272..82d264a3350d 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -224,6 +224,10 @@ xfs_initialize_perag(
> >  		if (first_initialised == NULLAGNUMBER)
> >  			first_initialised = index;
> >  		spin_lock_init(&pag->pag_state_lock);
> > +
> > +		mutex_init(&pag->pag_iunlink_mutex);
> > +		pag->pag_iunlink_refcnt = 0;
> > +		pag->pag_iunlink_trans = NULL;
> >  	}
> >  
> >  	index = xfs_set_inode_alloc(mp, agcount);
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 98109801a995..fca4c1d28d8e 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -372,6 +372,15 @@ typedef struct xfs_perag {
> >  
> >  	/* reference count */
> >  	uint8_t			pagf_refcount_level;
> > +
> > +	/* lock to protect unlinked inode list and refcnt */
> > +	struct mutex            pag_iunlink_mutex;
> > +
> > +	/* recursive count of pag_iunlink_mutex */
> > +	unsigned int		pag_iunlink_refcnt;
> > +
> > +	/* (lockless) by which pag_iunlink_mutex is taken */
> > +	struct xfs_trans	*pag_iunlink_trans;
> >  } xfs_perag_t;
> >  
> >  static inline struct xfs_ag_resv *
> > -- 
> > 2.18.1
> > 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4
  2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
                       ` (3 preceding siblings ...)
  2020-08-19  0:53     ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Darrick J. Wong
@ 2020-08-20  2:46     ` Darrick J. Wong
  2020-08-20  4:01       ` Gao Xiang
  4 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2020-08-20  2:46 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Dave Chinner, Brian Foster

Hm.  I saw the following warning from lockdep when running generic/078 with:

MKFS_OPTIONS --  -m reflink=1,rmapbt=1 -i sparse=1
MOUNT_OPTIONS --  -o usrquota,grpquota,prjquota,
CHECK_OPTIONS -- -g auto
XFS_MKFS_OPTIONS -- -bsize=4096

The test kernel is 5.9-rc1 + inobtcount + y2038 + dave's iunlink series + yours

+[  516.166575] run fstests generic/078 at 2020-08-19 15:35:28
+[  516.659584] XFS (sda): Mounting V5 Filesystem
+[  516.846982] XFS (sda): Ending clean mount
+[  516.849669] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
+
+[  517.341920] ============================================
+[  517.342849] WARNING: possible recursive locking detected
+[  517.343832] 5.9.0-rc1-djw #rc1 Tainted: G        W        
+[  517.344862] --------------------------------------------
+[  517.345886] renameat2/107505 is trying to acquire lock:
+[  517.346830] ffff88803ca48468 (&pag->pag_iunlink_mutex){+.+.}-{3:3}, at: xfs_iunlink+0xb8/0x3f0 [xfs]
+[  517.348460] 
+               but task is already holding lock:
+[  517.349107] ffff88803ca4bc68 (&pag->pag_iunlink_mutex){+.+.}-{3:3}, at: xfs_iunlink_remove+0x239/0x3f0 [xfs]
+[  517.350240] 
+               other info that might help us debug this:
+[  517.351166]  Possible unsafe locking scenario:
+
+[  517.351902]        CPU0
+[  517.352399]        ----
+[  517.352895]   lock(&pag->pag_iunlink_mutex);
+[  517.353572]   lock(&pag->pag_iunlink_mutex);
+[  517.354387] 
+                *** DEADLOCK ***
+
+[  517.355394]  May be due to missing lock nesting notation
+
+[  517.356132] 9 locks held by renameat2/107505:
+[  517.356615]  #0: ffff888032b78468 (sb_writers#12){.+.+}-{0:0}, at: mnt_want_write+0x24/0x60
+[  517.357561]  #1: ffff88803d00bfe8 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3}, at: lock_rename+0xf5/0x100
+[  517.358912]  #2: ffff88803d00fbe8 (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: vfs_rename+0x17d/0x950
+[  517.360011]  #3: ffff888032b78688 (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc+0x18b/0x250 [xfs]
+[  517.361018]  #4: ffff88803d00f8f0 (&xfs_dir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xcf/0x2a0 [xfs]
+[  517.362063]  #5: ffff88803d00bcf0 (&xfs_dir_ilock_class){++++}-{3:3}, at: xfs_ilock_nowait+0xcf/0x340 [xfs]
+[  517.363173]  #6: ffff88803bf82df0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock_nowait+0xcf/0x340 [xfs]
+[  517.364685]  #7: ffff88803d00cbf0 (&xfs_dir_ilock_class){++++}-{3:3}, at: xfs_ilock_nowait+0xcf/0x340 [xfs]
+[  517.366414]  #8: ffff88803ca4bc68 (&pag->pag_iunlink_mutex){+.+.}-{3:3}, at: xfs_iunlink_remove+0x239/0x3f0 [xfs]
+[  517.367787] 
+               stack backtrace:
+[  517.368285] CPU: 1 PID: 107505 Comm: renameat2 Tainted: G        W         5.9.0-rc1-djw #rc1
+[  517.369196] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
+[  517.370401] Call Trace:
+[  517.370703]  dump_stack+0x7c/0xac
+[  517.371091]  __lock_acquire.cold+0x168/0x2ab
+[  517.371571]  lock_acquire+0xa2/0x370
+[  517.372025]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
+[  517.372505]  __mutex_lock+0xa1/0xa00
+[  517.372955]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
+[  517.373455]  ? kvm_clock_read+0x14/0x30
+[  517.373898]  ? kvm_sched_clock_read+0x9/0x20
+[  517.374379]  ? sched_clock_cpu+0x14/0xe0
+[  517.374897]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
+[  517.375405]  ? xfs_iunlink+0x94/0x3f0 [xfs]
+[  517.376185]  ? rcu_read_lock_sched_held+0x56/0x80
+[  517.377055]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
+[  517.377896]  xfs_iunlink+0xb8/0x3f0 [xfs]
+[  517.378703]  xfs_rename+0xdb8/0x1030 [xfs]
+[  517.379261]  xfs_vn_rename+0xd5/0x140 [xfs]
+[  517.379740]  vfs_rename+0x1bc/0x950
+[  517.380146]  ? lookup_dcache+0x18/0x60
+[  517.380572]  ? do_renameat2+0x343/0x4d0
+[  517.381013]  do_renameat2+0x343/0x4d0
+[  517.381672]  __x64_sys_renameat2+0x25/0x30
+[  517.382145]  do_syscall_64+0x31/0x40
+[  517.382554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
+[  517.383330] RIP: 0033:0x7fb9be244083
+[  517.384005] Code: 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 49 89 ca 45 85 c0 74 44 b8 3c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3d 41 89 c0 83 f8 ff 74 0d 44 89 c0 c3 66 0f
+[  517.387257] RSP: 002b:00007ffc56416ed8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
+[  517.388148] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb9be244083
+[  517.389183] RDX: 00000000ffffff9c RSI: 00007ffc5641968c RDI: 00000000ffffff9c
+[  517.390190] RBP: 0000000000000000 R08: 0000000000000004 R09: 00007ffc56416fd8
+[  517.391440] R10: 00007ffc5641969c R11: 0000000000000202 R12: 000056494b1d6220
+[  517.392747] R13: 00007ffc56416fd0 R14: 0000000000000000 R15: 0000000000000000
+[  517.831548] XFS (sda): Unmounting Filesystem
+[  518.026539] XFS (sda): Mounting V5 Filesystem
+[  518.158574] XFS (sda): Ending clean mount
+[  518.160484] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)

Dunno what this is about, but I'll have a look in the morning...

--D

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4
  2020-08-20  2:46     ` Darrick J. Wong
@ 2020-08-20  4:01       ` Gao Xiang
  0 siblings, 0 replies; 29+ messages in thread
From: Gao Xiang @ 2020-08-20  4:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Brian Foster

Hi Darrick,

On Wed, Aug 19, 2020 at 07:46:18PM -0700, Darrick J. Wong wrote:
> Hm.  I saw the following warning from lockdep when running generic/078 with:
> 
> MKFS_OPTIONS --  -m reflink=1,rmapbt=1 -i sparse=1
> MOUNT_OPTIONS --  -o usrquota,grpquota,prjquota,
> CHECK_OPTIONS -- -g auto
> XFS_MKFS_OPTIONS -- -bsize=4096
> 
> The test kernel is 5.9-rc1 + inobtcount + y2038 + dave's iunlink series + yours

Thanks for the report. I didn't enable lockdep before, will confirm it soon.

Thanks!
Gao Xiang

> 
> +[  516.166575] run fstests generic/078 at 2020-08-19 15:35:28
> +[  516.659584] XFS (sda): Mounting V5 Filesystem
> +[  516.846982] XFS (sda): Ending clean mount
> +[  516.849669] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
> +
> +[  517.341920] ============================================
> +[  517.342849] WARNING: possible recursive locking detected
> +[  517.343832] 5.9.0-rc1-djw #rc1 Tainted: G        W        
> +[  517.344862] --------------------------------------------
> +[  517.345886] renameat2/107505 is trying to acquire lock:
> +[  517.346830] ffff88803ca48468 (&pag->pag_iunlink_mutex){+.+.}-{3:3}, at: xfs_iunlink+0xb8/0x3f0 [xfs]
> +[  517.348460] 
> +               but task is already holding lock:
> +[  517.349107] ffff88803ca4bc68 (&pag->pag_iunlink_mutex){+.+.}-{3:3}, at: xfs_iunlink_remove+0x239/0x3f0 [xfs]
> +[  517.350240] 
> +               other info that might help us debug this:
> +[  517.351166]  Possible unsafe locking scenario:
> +
> +[  517.351902]        CPU0
> +[  517.352399]        ----
> +[  517.352895]   lock(&pag->pag_iunlink_mutex);
> +[  517.353572]   lock(&pag->pag_iunlink_mutex);
> +[  517.354387] 
> +                *** DEADLOCK ***
> +
> +[  517.355394]  May be due to missing lock nesting notation
> +
> +[  517.356132] 9 locks held by renameat2/107505:
> +[  517.356615]  #0: ffff888032b78468 (sb_writers#12){.+.+}-{0:0}, at: mnt_want_write+0x24/0x60
> +[  517.357561]  #1: ffff88803d00bfe8 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3}, at: lock_rename+0xf5/0x100
> +[  517.358912]  #2: ffff88803d00fbe8 (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: vfs_rename+0x17d/0x950
> +[  517.360011]  #3: ffff888032b78688 (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc+0x18b/0x250 [xfs]
> +[  517.361018]  #4: ffff88803d00f8f0 (&xfs_dir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xcf/0x2a0 [xfs]
> +[  517.362063]  #5: ffff88803d00bcf0 (&xfs_dir_ilock_class){++++}-{3:3}, at: xfs_ilock_nowait+0xcf/0x340 [xfs]
> +[  517.363173]  #6: ffff88803bf82df0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock_nowait+0xcf/0x340 [xfs]
> +[  517.364685]  #7: ffff88803d00cbf0 (&xfs_dir_ilock_class){++++}-{3:3}, at: xfs_ilock_nowait+0xcf/0x340 [xfs]
> +[  517.366414]  #8: ffff88803ca4bc68 (&pag->pag_iunlink_mutex){+.+.}-{3:3}, at: xfs_iunlink_remove+0x239/0x3f0 [xfs]
> +[  517.367787] 
> +               stack backtrace:
> +[  517.368285] CPU: 1 PID: 107505 Comm: renameat2 Tainted: G        W         5.9.0-rc1-djw #rc1
> +[  517.369196] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
> +[  517.370401] Call Trace:
> +[  517.370703]  dump_stack+0x7c/0xac
> +[  517.371091]  __lock_acquire.cold+0x168/0x2ab
> +[  517.371571]  lock_acquire+0xa2/0x370
> +[  517.372025]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
> +[  517.372505]  __mutex_lock+0xa1/0xa00
> +[  517.372955]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
> +[  517.373455]  ? kvm_clock_read+0x14/0x30
> +[  517.373898]  ? kvm_sched_clock_read+0x9/0x20
> +[  517.374379]  ? sched_clock_cpu+0x14/0xe0
> +[  517.374897]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
> +[  517.375405]  ? xfs_iunlink+0x94/0x3f0 [xfs]
> +[  517.376185]  ? rcu_read_lock_sched_held+0x56/0x80
> +[  517.377055]  ? xfs_iunlink+0xb8/0x3f0 [xfs]
> +[  517.377896]  xfs_iunlink+0xb8/0x3f0 [xfs]
> +[  517.378703]  xfs_rename+0xdb8/0x1030 [xfs]
> +[  517.379261]  xfs_vn_rename+0xd5/0x140 [xfs]
> +[  517.379740]  vfs_rename+0x1bc/0x950
> +[  517.380146]  ? lookup_dcache+0x18/0x60
> +[  517.380572]  ? do_renameat2+0x343/0x4d0
> +[  517.381013]  do_renameat2+0x343/0x4d0
> +[  517.381672]  __x64_sys_renameat2+0x25/0x30
> +[  517.382145]  do_syscall_64+0x31/0x40
> +[  517.382554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> +[  517.383330] RIP: 0033:0x7fb9be244083
> +[  517.384005] Code: 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 49 89 ca 45 85 c0 74 44 b8 3c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3d 41 89 c0 83 f8 ff 74 0d 44 89 c0 c3 66 0f
> +[  517.387257] RSP: 002b:00007ffc56416ed8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
> +[  517.388148] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fb9be244083
> +[  517.389183] RDX: 00000000ffffff9c RSI: 00007ffc5641968c RDI: 00000000ffffff9c
> +[  517.390190] RBP: 0000000000000000 R08: 0000000000000004 R09: 00007ffc56416fd8
> +[  517.391440] R10: 00007ffc5641969c R11: 0000000000000202 R12: 000056494b1d6220
> +[  517.392747] R13: 00007ffc56416fd0 R14: 0000000000000000 R15: 0000000000000000
> +[  517.831548] XFS (sda): Unmounting Filesystem
> +[  518.026539] XFS (sda): Mounting V5 Filesystem
> +[  518.158574] XFS (sda): Ending clean mount
> +[  518.160484] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
> 
> Dunno what this is about, but I'll have a look in the morning...
> 
> --D
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash
  2020-08-19  0:54       ` Darrick J. Wong
@ 2020-08-21  1:09         ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2020-08-21  1:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Gao Xiang, linux-xfs, Brian Foster

On Tue, Aug 18, 2020 at 05:54:03PM -0700, Darrick J. Wong wrote:
> On Tue, Aug 18, 2020 at 09:30:13PM +0800, Gao Xiang wrote:
> > pagi_unlinked_hash is unused since no backref infrastructure now.
> > (it's better to fold it into original patchset.)
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> 
> Yes, this should be folded into the other patch that gets rid of the
> rest of the rhash code.  Dave?

Oh, I must have missed that in a rebase conflict resolution...

I'll fold it into the appropriate patch....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
2020-07-08 22:33   ` Dave Chinner
2020-07-09  0:17     ` Gao Xiang
2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
2020-07-08 17:03   ` Darrick J. Wong
2020-07-08 23:40     ` Gao Xiang
2020-07-08 23:33   ` Dave Chinner
2020-07-09  0:55     ` Gao Xiang
2020-07-09  2:32       ` Dave Chinner
2020-07-09 10:36         ` Gao Xiang
2020-07-09 10:47           ` Gao Xiang
2020-07-09 22:36           ` Dave Chinner
2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
2020-07-24  6:12   ` [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock Gao Xiang
2020-07-24  6:12   ` [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail Gao Xiang
2020-08-18 13:30   ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Gao Xiang
2020-08-18 13:30     ` [RFC PATCH v4 1/3] xfs: get rid of unused pagi_unlinked_hash Gao Xiang
2020-08-19  0:54       ` Darrick J. Wong
2020-08-21  1:09         ` Dave Chinner
2020-08-18 13:30     ` [RFC PATCH v4 2/3] xfs: introduce perag iunlink lock Gao Xiang
2020-08-19  1:06       ` Darrick J. Wong
2020-08-19  1:23         ` Gao Xiang
2020-08-18 13:30     ` [RFC PATCH v4 3/3] xfs: insert unlinked inodes from tail Gao Xiang
2020-08-19  0:53     ` [RFC PATCH v4 0/3] xfs: more unlinked inode list optimization v4 Darrick J. Wong
2020-08-19  1:14       ` Gao Xiang
2020-08-20  2:46     ` Darrick J. Wong
2020-08-20  4:01       ` Gao Xiang

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git