Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1
@ 2020-07-07 13:57 Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-07 13:57 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Hi forks,

This RFC patchset mainly addresses the thoughts [*] and [**] from Dave's
original patchset,
https://lore.kernel.org/r/20200623095015.1934171-1-david@fromorbit.com

In short, it focues on the following ideas mentioned by Dave:
 - use bucket 0 instead of multiple buckets since in-memory double
   linked list finally works;

 - avoid taking AGI buffer and unnecessary AGI update if possible, so
   1) add a new lock and keep proper locking order to avoid deadlock;
   2) insert a new unlinked inode from the tail instead of head;

In addition, it's worth noticing 3 things:
 - xfs_iunlink_remove() should support old multiple buckets in order
   to keep old inode unlinked list (old image) working when recovering.

 - (but) OTOH, the old kernel recovery _shouldn't_ work with new image
   since the bucket_index from old xfs_iunlink_remove() is generated
   by the old formula (rather than keep in xfs_inode), which is now
   fixed as 0. So this feature is not forward compatible without some
   extra backport patches;

 - a tail xfs_inode pointer is also added in the perag, which keeps 
   track of the tail of bucket 0 since it's mainly used for xfs_iunlink().

 - the old kernel recovery shouldn't work with new image since
     its bucket_index from old kernel is 

The git tree is also available at
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git tags/xfs/iunlink_opt_v1

Some limited test for debugging only is done (mainly fsstress and
manual power-cut tests), so it could not work as expected just like
my limited broken xfs knowledge. But I will go on improving this
patchset recently. 

Any comments and directions are welcome. :)

Thanks,
Gao Xiang

Gao Xiang (2):
  xfs: arrange all unlinked inodes into one list
  xfs: don't access AGI on unlinked inodes if it can

 fs/xfs/xfs_inode.c       | 283 ++++++++++++++++++++-------------------
 fs/xfs/xfs_log_recover.c |   6 +
 fs/xfs/xfs_mount.c       |   3 +
 fs/xfs/xfs_mount.h       |   3 +
 4 files changed, 160 insertions(+), 135 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list
  2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
@ 2020-07-07 13:57 ` Gao Xiang
  2020-07-08 22:33   ` Dave Chinner
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2 siblings, 1 reply; 17+ messages in thread
From: Gao Xiang @ 2020-07-07 13:57 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

There is no need to keep old multiple short unlink inode buckets
since we have an in-memory double linked list for all unlinked
inodes.

Apart from the perspective of the necessity, the main advantage
is that the log and AGI update can be reduced since each AG has
the only one head now, which is implemented in the following patch.

Therefore, this patch applies the new way in xfs_iunlink() and
keep the old approach in xfs_iunlink_remove_inode() path as well
so inode eviction can still work properly in recovery.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ab288424764c..10565fa5ace4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -33,6 +33,7 @@
 #include "xfs_symlink.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_iunlink_item.h"
@@ -1955,25 +1956,32 @@ xfs_iunlink_update_bucket(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	struct xfs_buf		*agibp,
-	unsigned int		bucket_index,
+	xfs_agino_t		old_agino,
 	xfs_agino_t		new_agino)
 {
+	struct xlog		*log = tp->t_mountp->m_log;
 	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		old_value;
-	int			offset;
+	unsigned int		bucket_index;
+	int                     offset;
 
 	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
 
+	bucket_index = 0;
+	/* During recovery, the old multiple bucket index can be applied */
+	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
+		ASSERT(old_agino != NULLAGINO);
+
+		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
+			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
+	}
+
 	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
 	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
 			old_value, new_agino);
 
-	/*
-	 * We should never find the head of the list already set to the value
-	 * passed in because either we're adding or removing ourselves from the
-	 * head of the list.
-	 */
-	if (old_value == new_agino) {
+	/* check if the old agi_unlinked head is as expected */
+	if (old_value != old_agino) {
 		xfs_buf_mark_corrupt(agibp);
 		return -EFSCORRUPTED;
 	}
@@ -2001,14 +2009,13 @@ xfs_iunlink_insert_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 
 	/*
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the pointer isn't garbage and that this inode
 	 * isn't already on the list.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
 	if (next_agino == agino ||
 	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
 		xfs_buf_mark_corrupt(agibp);
@@ -2036,7 +2043,7 @@ xfs_iunlink_insert_inode(
 	}
 
 	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
+	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
 }
 
 /*
@@ -2051,27 +2058,20 @@ xfs_iunlink_remove_inode(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
 		/* remove from head of list */
-		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
-			xfs_buf_mark_corrupt(agibp);
-			return -EFSCORRUPTED;
-		}
 		if (next_agino == agino ||
 		    !xfs_verify_agino_or_null(mp, agno, next_agino))
 			return -EFSCORRUPTED;
 
-		error = xfs_iunlink_update_bucket(tp, agno, agibp,
-					bucket_index, next_agino);
+		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
 		if (error)
-			return -EFSCORRUPTED;
+			return error;
 	} else {
 		/* lookup previous inode and update to point at next */
 		struct xfs_inode	*pip;
-- 
2.18.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
@ 2020-07-07 13:57 ` Gao Xiang
  2020-07-08 17:03   ` Darrick J. Wong
  2020-07-08 23:33   ` Dave Chinner
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2 siblings, 2 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-07 13:57 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, we use AGI buffer lock to protect in-memory linked list for
unlinked inodes but since it's not necessary to modify AGI unless the
head of the unlinked list is modified. So let's removing the AGI buffer
modification dependency if possible, including 1) adding another per-AG
dedicated lock to protect the whole list and 2) inserting unlinked
inodes from tail.

For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
to use. xfs_iunlink_remove() still support old multiple short bucket
lists for recovery code.

Note that some paths take AGI lock in its transaction in advance,
so the proper locking order is only AGI lock -> unlinked list lock.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
 fs/xfs/xfs_log_recover.c |   6 +
 fs/xfs/xfs_mount.c       |   3 +
 fs/xfs/xfs_mount.h       |   3 +
 4 files changed, 144 insertions(+), 119 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 10565fa5ace4..d33e5b198534 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
 }
 
 /*
- * Always insert at the head, so we only have to do a next inode lookup to
- * update it's prev pointer. The AGI bucket will point at the one we are
- * inserting.
+ * This is called when the inode's link count has gone to 0 or we are creating
+ * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
+ *
+ * We place the on-disk inode on a list in the AGI.  It will be pulled from this
+ * list when the inode is freed.
  */
-static int
-xfs_iunlink_insert_inode(
+STATIC int
+xfs_iunlink(
 	struct xfs_trans	*tp,
-	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
+	struct xfs_inode	*pip;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino;
+	int			error;
+
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
+
+	pag = xfs_perag_get(mp, agno);
 
 	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the pointer isn't garbage and that this inode
-	 * isn't already on the list.
+	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
+	 * in this transaction in advance and the only locking order
+	 * AGI buf lock -> pag_unlinked_mutex is safe.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
-	if (next_agino == agino ||
-	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
-		xfs_buf_mark_corrupt(agibp);
-		return -EFSCORRUPTED;
-	}
-
-	ip->i_prev_unlinked = NULLAGINO;
-	ip->i_next_unlinked = next_agino;
-	if (ip->i_next_unlinked != NULLAGINO) {
-		struct xfs_inode	*nip;
+	mutex_lock(&pag->pag_unlinked_mutex);
+	pip = pag->pag_unlinked_tail;
+	if (!pip) {
+		struct xfs_buf	*agibp;
 
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
-		if (IS_ERR_OR_NULL(nip))
-			return -EFSCORRUPTED;
+		mutex_unlock(&pag->pag_unlinked_mutex);
 
-		if (nip->i_prev_unlinked != NULLAGINO) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-						NULL, 0, __this_address);
-			return -EFSCORRUPTED;
+		/*
+		 * there could be some race, but it doesn't matter though
+		 * since !pip doesn't happen frequently.
+		 */
+		error = xfs_read_agi(mp, tp, agno, &agibp);
+		if (error) {
+			xfs_perag_put(pag);
+			return error;
 		}
-		nip->i_prev_unlinked = agino;
 
-		/* update the on disk inode now */
-		xfs_iunlink_log(tp, ip);
-	}
+		mutex_lock(&pag->pag_unlinked_mutex);
+		pip = pag->pag_unlinked_tail;
+		if (!pip) {
+			struct xfs_agi	*agi;
+
+			ip->i_prev_unlinked = NULLAGINO;
 
-	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
+			agi = agibp->b_addr;
+			ASSERT(be32_to_cpu(agi->agi_unlinked[0]) == NULLAGINO);
+
+			/* Point the head of the list to point to this inode. */
+			error = xfs_iunlink_update_bucket(tp, agno,
+					agibp, NULLAGINO, agino);
+			goto out;
+		}
+	}
+	ip->i_prev_unlinked = XFS_INO_TO_AGINO(mp, pip->i_ino);
+	ASSERT(pip->i_next_unlinked == NULLAGINO);
+	pip->i_next_unlinked = agino;
+	xfs_iunlink_log(tp, pip);
+out:
+	ip->i_next_unlinked = NULLAGINO;
+	pag->pag_unlinked_tail = ip;
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
+	return 0;
 }
 
 /*
+ * Pull the on-disk inode from the AGI unlinked list.
+ *
  * Remove can be from anywhere in the list, so we have to do two adjacent inode
  * lookups here so we can update list pointers. We may be at the head or the
  * tail of the list, so we have to handle those cases as well.
  */
-static int
-xfs_iunlink_remove_inode(
+STATIC int
+xfs_iunlink_remove(
 	struct xfs_trans	*tp,
-	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
+	struct xfs_inode	*pip = NULL;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino = ip->i_next_unlinked;
+	xfs_agino_t		next_agino;
 	int			error;
 
+	trace_xfs_iunlink_remove(ip);
+
+	pag = xfs_perag_get(mp, agno);
+	mutex_lock(&pag->pag_unlinked_mutex);
+
+	/* see the comment in xfs_iunlink() on the only proper locking order */
 	if (ip->i_prev_unlinked == NULLAGINO) {
-		/* remove from head of list */
-		if (next_agino == agino ||
-		    !xfs_verify_agino_or_null(mp, agno, next_agino))
-			return -EFSCORRUPTED;
+		struct xfs_buf	*agibp;
 
-		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
-		if (error)
+		mutex_unlock(&pag->pag_unlinked_mutex);
+
+		error = xfs_read_agi(mp, tp, agno, &agibp);
+		if (error) {
+			xfs_perag_put(pag);
 			return error;
-	} else {
-		/* lookup previous inode and update to point at next */
-		struct xfs_inode	*pip;
+		}
 
-		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
-		if (IS_ERR_OR_NULL(pip))
-			return -EFSCORRUPTED;
+		mutex_lock(&pag->pag_unlinked_mutex);
+		if (ip->i_prev_unlinked == NULLAGINO) {
+			struct xfs_agi	*agi;
 
-		if (pip->i_next_unlinked != agino) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-						NULL, 0, __this_address);
-			return -EFSCORRUPTED;
+			next_agino = ip->i_next_unlinked;
+			agi = agibp->b_addr;
+
+			/* remove from head of list */
+			if (next_agino == agino ||
+			    !xfs_verify_agino_or_null(mp, agno, next_agino))
+				goto err_fscorrupted;
+
+			error = xfs_iunlink_update_bucket(tp, agno,
+				agibp, agino, next_agino);
+			if (error)
+				goto err_out;
+
+			goto next_fixup;
 		}
+	}
 
-		/* update the on disk inode now */
-		pip->i_next_unlinked = next_agino;
-		xfs_iunlink_log(tp, pip);
+	next_agino = ip->i_next_unlinked;
+
+	/* lookup previous inode and update to point at next */
+	pip = xfs_iunlink_lookup_prev(pag, ip);
+	if (IS_ERR_OR_NULL(pip))
+		goto err_fscorrupted;
+
+	if (pip->i_next_unlinked != agino) {
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
+					NULL, 0, __this_address);
+		goto err_fscorrupted;
 	}
+	pip->i_next_unlinked = next_agino;
+	xfs_iunlink_log(tp, pip);
 
-	/* lookup the next inode and update to point at prev */
-	if (ip->i_next_unlinked != NULLAGINO) {
+next_fixup:
+	if (next_agino == NULLAGINO) {
+		/* so iunlink recovery can work here */
+		if (pag->pag_unlinked_tail == ip)
+			pag->pag_unlinked_tail = pip;
+	} else {
+		/* lookup the next inode and update to point at prev */
 		struct xfs_inode	*nip;
 
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
+		nip = xfs_iunlink_lookup_next(pag, ip);
 		if (IS_ERR_OR_NULL(nip))
-			return -EFSCORRUPTED;
+			goto err_fscorrupted;
 
 		if (nip->i_prev_unlinked != agino) {
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
 						NULL, 0, __this_address);
-			return -EFSCORRUPTED;
+			goto err_fscorrupted;
 		}
 		/* in memory update only */
 		nip->i_prev_unlinked = ip->i_prev_unlinked;
 	}
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
 
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	/*
 	 * Now clear prev/next from this inode and update on disk if we
 	 * need to clear the on-disk link.
 	 */
 	ip->i_prev_unlinked = NULLAGINO;
-	ip->i_next_unlinked = NULLAGINO;
-	if (next_agino != NULLAGINO)
+	if (next_agino != NULLAGINO) {
+		ip->i_next_unlinked = NULLAGINO;
 		xfs_iunlink_log(tp, ip);
+	}
 	return 0;
-}
 
-/*
- * This is called when the inode's link count has gone to 0 or we are creating
- * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
- *
- * We place the on-disk inode on a list in the AGI.  It will be pulled from this
- * list when the inode is freed.
- */
-STATIC int
-xfs_iunlink(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	int			error;
-
-	ASSERT(ip->i_next_unlinked == NULLAGINO);
-	ASSERT(VFS_I(ip)->i_nlink == 0);
-	ASSERT(VFS_I(ip)->i_mode != 0);
-	trace_xfs_iunlink(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
-
-	return xfs_iunlink_insert_inode(tp, agibp, ip);
-}
-
-/*
- * Pull the on-disk inode from the AGI unlinked list.
- */
-STATIC int
-xfs_iunlink_remove(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	int			error;
-
-	trace_xfs_iunlink_remove(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
-
-	return xfs_iunlink_remove_inode(tp, agibp, ip);
+err_fscorrupted:
+	error = -EFSCORRUPTED;
+err_out:
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index d47eea31c165..3f2739316424 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2801,6 +2801,11 @@ xlog_recover_iunlink_ag(
 					prev_ip->i_next_unlinked = NULLAGINO;
 				break;
 			}
+
+			/* XXX: take pag_unlinked_mutex across the loop? */
+			if (!bucket)
+				agibp->b_pag->pag_unlinked_tail = ip;
+
 			if (prev_ip) {
 				ip->i_prev_unlinked = prev_agino;
 				xfs_irele(prev_ip);
@@ -2812,6 +2817,7 @@ xlog_recover_iunlink_ag(
 		}
 		if (prev_ip)
 			xfs_irele(prev_ip);
+
 		if (error) {
 			/*
 			 * We can't read an inode this bucket points to, or an
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 031e96ff022d..a9701f863e84 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -223,6 +223,9 @@ xfs_initialize_perag(
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
 		spin_lock_init(&pag->pag_state_lock);
+
+		mutex_init(&pag->pag_unlinked_mutex);
+		pag->pag_unlinked_tail = NULL;
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a72cfcaa4ad1..107a634a34f7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -372,6 +372,9 @@ typedef struct xfs_perag {
 	/* reference count */
 	uint8_t			pagf_refcount_level;
 
+	struct mutex		pag_unlinked_mutex;
+	struct xfs_inode	*pag_unlinked_tail;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
-- 
2.18.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
@ 2020-07-08 17:03   ` Darrick J. Wong
  2020-07-08 23:40     ` Gao Xiang
  2020-07-08 23:33   ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-07-08 17:03 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Dave Chinner, Brian Foster

On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> Currently, we use AGI buffer lock to protect in-memory linked list for
> unlinked inodes but since it's not necessary to modify AGI unless the
> head of the unlinked list is modified. So let's removing the AGI buffer
> modification dependency if possible, including 1) adding another per-AG
> dedicated lock to protect the whole list and 2) inserting unlinked
> inodes from tail.
> 
> For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> to use. xfs_iunlink_remove() still support old multiple short bucket
> lists for recovery code.
> 
> Note that some paths take AGI lock in its transaction in advance,
> so the proper locking order is only AGI lock -> unlinked list lock.

How much agi buffer lock contention does this eliminate?

> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
>  fs/xfs/xfs_log_recover.c |   6 +
>  fs/xfs/xfs_mount.c       |   3 +
>  fs/xfs/xfs_mount.h       |   3 +
>  4 files changed, 144 insertions(+), 119 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 10565fa5ace4..d33e5b198534 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
>  }
>  
>  /*
> - * Always insert at the head, so we only have to do a next inode lookup to
> - * update it's prev pointer. The AGI bucket will point at the one we are
> - * inserting.
> + * This is called when the inode's link count has gone to 0 or we are creating
> + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> + *
> + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> + * list when the inode is freed.
>   */
> -static int
> -xfs_iunlink_insert_inode(
> +STATIC int
> +xfs_iunlink(
>  	struct xfs_trans	*tp,
> -	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi = agibp->b_addr;
> -	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_perag	*pag;
> +	struct xfs_inode	*pip;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	xfs_agino_t		next_agino;
> +	int			error;
> +
> +	ASSERT(VFS_I(ip)->i_nlink == 0);
> +	ASSERT(VFS_I(ip)->i_mode != 0);
> +	trace_xfs_iunlink(ip);
> +
> +	pag = xfs_perag_get(mp, agno);
>  
>  	/*
> -	 * Get the index into the agi hash table for the list this inode will
> -	 * go on.  Make sure the pointer isn't garbage and that this inode
> -	 * isn't already on the list.
> +	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
> +	 * in this transaction in advance and the only locking order
> +	 * AGI buf lock -> pag_unlinked_mutex is safe.
>  	 */
> -	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
> -	if (next_agino == agino ||
> -	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
> -		xfs_buf_mark_corrupt(agibp);
> -		return -EFSCORRUPTED;
> -	}
> -
> -	ip->i_prev_unlinked = NULLAGINO;
> -	ip->i_next_unlinked = next_agino;
> -	if (ip->i_next_unlinked != NULLAGINO) {
> -		struct xfs_inode	*nip;
> +	mutex_lock(&pag->pag_unlinked_mutex);
> +	pip = pag->pag_unlinked_tail;
> +	if (!pip) {
> +		struct xfs_buf	*agibp;
>  
> -		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
> -		if (IS_ERR_OR_NULL(nip))
> -			return -EFSCORRUPTED;
> +		mutex_unlock(&pag->pag_unlinked_mutex);
>  
> -		if (nip->i_prev_unlinked != NULLAGINO) {
> -			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> -						NULL, 0, __this_address);
> -			return -EFSCORRUPTED;
> +		/*
> +		 * there could be some race, but it doesn't matter though
> +		 * since !pip doesn't happen frequently.
> +		 */
> +		error = xfs_read_agi(mp, tp, agno, &agibp);
> +		if (error) {
> +			xfs_perag_put(pag);
> +			return error;
>  		}
> -		nip->i_prev_unlinked = agino;
>  
> -		/* update the on disk inode now */
> -		xfs_iunlink_log(tp, ip);
> -	}
> +		mutex_lock(&pag->pag_unlinked_mutex);
> +		pip = pag->pag_unlinked_tail;
> +		if (!pip) {
> +			struct xfs_agi	*agi;
> +
> +			ip->i_prev_unlinked = NULLAGINO;
>  
> -	/* Point the head of the list to point to this inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
> +			agi = agibp->b_addr;
> +			ASSERT(be32_to_cpu(agi->agi_unlinked[0]) == NULLAGINO);
> +
> +			/* Point the head of the list to point to this inode. */
> +			error = xfs_iunlink_update_bucket(tp, agno,
> +					agibp, NULLAGINO, agino);
> +			goto out;
> +		}
> +	}
> +	ip->i_prev_unlinked = XFS_INO_TO_AGINO(mp, pip->i_ino);
> +	ASSERT(pip->i_next_unlinked == NULLAGINO);
> +	pip->i_next_unlinked = agino;
> +	xfs_iunlink_log(tp, pip);
> +out:
> +	ip->i_next_unlinked = NULLAGINO;
> +	pag->pag_unlinked_tail = ip;
> +	mutex_unlock(&pag->pag_unlinked_mutex);
> +	xfs_perag_put(pag);
> +	return 0;
>  }
>  
>  /*
> + * Pull the on-disk inode from the AGI unlinked list.
> + *
>   * Remove can be from anywhere in the list, so we have to do two adjacent inode
>   * lookups here so we can update list pointers. We may be at the head or the
>   * tail of the list, so we have to handle those cases as well.
>   */
> -static int
> -xfs_iunlink_remove_inode(
> +STATIC int
> +xfs_iunlink_remove(
>  	struct xfs_trans	*tp,
> -	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_perag	*pag;
> +	struct xfs_inode	*pip = NULL;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	xfs_agino_t		next_agino = ip->i_next_unlinked;
> +	xfs_agino_t		next_agino;
>  	int			error;
>  
> +	trace_xfs_iunlink_remove(ip);
> +
> +	pag = xfs_perag_get(mp, agno);
> +	mutex_lock(&pag->pag_unlinked_mutex);
> +
> +	/* see the comment in xfs_iunlink() on the only proper locking order */
>  	if (ip->i_prev_unlinked == NULLAGINO) {
> -		/* remove from head of list */
> -		if (next_agino == agino ||
> -		    !xfs_verify_agino_or_null(mp, agno, next_agino))
> -			return -EFSCORRUPTED;
> +		struct xfs_buf	*agibp;
>  
> -		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
> -		if (error)
> +		mutex_unlock(&pag->pag_unlinked_mutex);
> +
> +		error = xfs_read_agi(mp, tp, agno, &agibp);
> +		if (error) {
> +			xfs_perag_put(pag);
>  			return error;
> -	} else {
> -		/* lookup previous inode and update to point at next */
> -		struct xfs_inode	*pip;
> +		}
>  
> -		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
> -		if (IS_ERR_OR_NULL(pip))
> -			return -EFSCORRUPTED;
> +		mutex_lock(&pag->pag_unlinked_mutex);
> +		if (ip->i_prev_unlinked == NULLAGINO) {
> +			struct xfs_agi	*agi;
>  
> -		if (pip->i_next_unlinked != agino) {
> -			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> -						NULL, 0, __this_address);
> -			return -EFSCORRUPTED;
> +			next_agino = ip->i_next_unlinked;
> +			agi = agibp->b_addr;
> +
> +			/* remove from head of list */
> +			if (next_agino == agino ||
> +			    !xfs_verify_agino_or_null(mp, agno, next_agino))
> +				goto err_fscorrupted;
> +
> +			error = xfs_iunlink_update_bucket(tp, agno,
> +				agibp, agino, next_agino);
> +			if (error)
> +				goto err_out;
> +
> +			goto next_fixup;
>  		}
> +	}
>  
> -		/* update the on disk inode now */
> -		pip->i_next_unlinked = next_agino;
> -		xfs_iunlink_log(tp, pip);
> +	next_agino = ip->i_next_unlinked;
> +
> +	/* lookup previous inode and update to point at next */
> +	pip = xfs_iunlink_lookup_prev(pag, ip);
> +	if (IS_ERR_OR_NULL(pip))
> +		goto err_fscorrupted;
> +
> +	if (pip->i_next_unlinked != agino) {
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> +					NULL, 0, __this_address);
> +		goto err_fscorrupted;
>  	}
> +	pip->i_next_unlinked = next_agino;
> +	xfs_iunlink_log(tp, pip);
>  
> -	/* lookup the next inode and update to point at prev */
> -	if (ip->i_next_unlinked != NULLAGINO) {
> +next_fixup:
> +	if (next_agino == NULLAGINO) {
> +		/* so iunlink recovery can work here */
> +		if (pag->pag_unlinked_tail == ip)
> +			pag->pag_unlinked_tail = pip;
> +	} else {
> +		/* lookup the next inode and update to point at prev */
>  		struct xfs_inode	*nip;
>  
> -		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
> +		nip = xfs_iunlink_lookup_next(pag, ip);
>  		if (IS_ERR_OR_NULL(nip))
> -			return -EFSCORRUPTED;
> +			goto err_fscorrupted;
>  
>  		if (nip->i_prev_unlinked != agino) {
>  			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
>  						NULL, 0, __this_address);
> -			return -EFSCORRUPTED;
> +			goto err_fscorrupted;
>  		}
>  		/* in memory update only */
>  		nip->i_prev_unlinked = ip->i_prev_unlinked;
>  	}
> +	mutex_unlock(&pag->pag_unlinked_mutex);
> +	xfs_perag_put(pag);
>  
> +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	/*
>  	 * Now clear prev/next from this inode and update on disk if we
>  	 * need to clear the on-disk link.
>  	 */
>  	ip->i_prev_unlinked = NULLAGINO;
> -	ip->i_next_unlinked = NULLAGINO;
> -	if (next_agino != NULLAGINO)
> +	if (next_agino != NULLAGINO) {
> +		ip->i_next_unlinked = NULLAGINO;
>  		xfs_iunlink_log(tp, ip);
> +	}
>  	return 0;
> -}
>  
> -/*
> - * This is called when the inode's link count has gone to 0 or we are creating
> - * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> - *
> - * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> - * list when the inode is freed.
> - */
> -STATIC int
> -xfs_iunlink(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_buf		*agibp;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	int			error;
> -
> -	ASSERT(ip->i_next_unlinked == NULLAGINO);
> -	ASSERT(VFS_I(ip)->i_nlink == 0);
> -	ASSERT(VFS_I(ip)->i_mode != 0);
> -	trace_xfs_iunlink(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
> -
> -	return xfs_iunlink_insert_inode(tp, agibp, ip);
> -}
> -
> -/*
> - * Pull the on-disk inode from the AGI unlinked list.
> - */
> -STATIC int
> -xfs_iunlink_remove(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_buf		*agibp;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	int			error;
> -
> -	trace_xfs_iunlink_remove(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
> -
> -	return xfs_iunlink_remove_inode(tp, agibp, ip);
> +err_fscorrupted:
> +	error = -EFSCORRUPTED;
> +err_out:
> +	mutex_unlock(&pag->pag_unlinked_mutex);
> +	xfs_perag_put(pag);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index d47eea31c165..3f2739316424 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2801,6 +2801,11 @@ xlog_recover_iunlink_ag(
>  					prev_ip->i_next_unlinked = NULLAGINO;
>  				break;
>  			}
> +
> +			/* XXX: take pag_unlinked_mutex across the loop? */
> +			if (!bucket)
> +				agibp->b_pag->pag_unlinked_tail = ip;
> +
>  			if (prev_ip) {
>  				ip->i_prev_unlinked = prev_agino;
>  				xfs_irele(prev_ip);
> @@ -2812,6 +2817,7 @@ xlog_recover_iunlink_ag(
>  		}
>  		if (prev_ip)
>  			xfs_irele(prev_ip);
> +
>  		if (error) {
>  			/*
>  			 * We can't read an inode this bucket points to, or an
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 031e96ff022d..a9701f863e84 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -223,6 +223,9 @@ xfs_initialize_perag(
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
>  		spin_lock_init(&pag->pag_state_lock);
> +
> +		mutex_init(&pag->pag_unlinked_mutex);
> +		pag->pag_unlinked_tail = NULL;
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, agcount);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index a72cfcaa4ad1..107a634a34f7 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -372,6 +372,9 @@ typedef struct xfs_perag {
>  	/* reference count */
>  	uint8_t			pagf_refcount_level;
>  
> +	struct mutex		pag_unlinked_mutex;
> +	struct xfs_inode	*pag_unlinked_tail;

What do these fields do?  There aren't any comments....

--D

> +
>  	/*
>  	 * Unlinked inode information.  This incore information reflects
>  	 * data stored in the AGI, so callers must hold the AGI buffer lock
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
@ 2020-07-08 22:33   ` Dave Chinner
  2020-07-09  0:17     ` Gao Xiang
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2020-07-08 22:33 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Tue, Jul 07, 2020 at 09:57:40PM +0800, Gao Xiang wrote:
> There is no need to keep old multiple short unlink inode buckets
> since we have an in-memory double linked list for all unlinked
> inodes.
> 
> Apart from the perspective of the necessity, the main advantage
> is that the log and AGI update can be reduced since each AG has
> the only one head now, which is implemented in the following patch.
> 
> Therefore, this patch applies the new way in xfs_iunlink() and
> keep the old approach in xfs_iunlink_remove_inode() path as well
> so inode eviction can still work properly in recovery.
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 40 ++++++++++++++++++++--------------------
>  1 file changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ab288424764c..10565fa5ace4 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -33,6 +33,7 @@
>  #include "xfs_symlink.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
>  #include "xfs_iunlink_item.h"
> @@ -1955,25 +1956,32 @@ xfs_iunlink_update_bucket(
>  	struct xfs_trans	*tp,
>  	xfs_agnumber_t		agno,
>  	struct xfs_buf		*agibp,
> -	unsigned int		bucket_index,
> +	xfs_agino_t		old_agino,
>  	xfs_agino_t		new_agino)
>  {
> +	struct xlog		*log = tp->t_mountp->m_log;
>  	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		old_value;
> -	int			offset;
> +	unsigned int		bucket_index;
> +	int                     offset;
>  
>  	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
>  
> +	bucket_index = 0;
> +	/* During recovery, the old multiple bucket index can be applied */
> +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> +		ASSERT(old_agino != NULLAGINO);
> +
> +		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
> +			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
> +	}

Ok, so you are doing this because you changed the function to pass
in an agino rather than a bucket index from the caller context. So
now you have to look up a structure to determine what the caller
context was to determine what the bucket index we need to use is.

Seems like we probably should have kept passing in the bucket index
from the caller because that's where the knowledge of what bucket we
need to update comes from?

And in that case, the higher level code should be checking for the
log recovery case when selecting the bucket, not hiding it deep in
the guts of the code here....

> +
>  	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
>  	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
>  			old_value, new_agino);
>  
> -	/*
> -	 * We should never find the head of the list already set to the value
> -	 * passed in because either we're adding or removing ourselves from the
> -	 * head of the list.
> -	 */
> -	if (old_value == new_agino) {
> +	/* check if the old agi_unlinked head is as expected */
> +	if (old_value != old_agino) {
>  		xfs_buf_mark_corrupt(agibp);
>  		return -EFSCORRUPTED;
>  	}

This looks like a change of behaviour - it no longer checks against
the inode we are about to add/remove from the list, but instead
checks that old inode is what we found on the list. We're not
concerned that what we found on the list matches what the caller
found on the list and passed us - we're concerned about doing a
double add/remove of the current inode...

> @@ -2001,14 +2009,13 @@ xfs_iunlink_insert_inode(
>  	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  
>  	/*
>  	 * Get the index into the agi hash table for the list this inode will
>  	 * go on.  Make sure the pointer isn't garbage and that this inode
>  	 * isn't already on the list.
>  	 */
> -	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
>  	if (next_agino == agino ||
>  	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
>  		xfs_buf_mark_corrupt(agibp);
> @@ -2036,7 +2043,7 @@ xfs_iunlink_insert_inode(
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
> +	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
>  }
>  
>  /*
> @@ -2051,27 +2058,20 @@ xfs_iunlink_remove_inode(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino = ip->i_next_unlinked;
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
>  	if (ip->i_prev_unlinked == NULLAGINO) {
>  		/* remove from head of list */
> -		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
> -			xfs_buf_mark_corrupt(agibp);
> -			return -EFSCORRUPTED;
> -		}
>  		if (next_agino == agino ||
>  		    !xfs_verify_agino_or_null(mp, agno, next_agino))
>  			return -EFSCORRUPTED;
>  
> -		error = xfs_iunlink_update_bucket(tp, agno, agibp,
> -					bucket_index, next_agino);
> +		error = xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
>  		if (error)
> -			return -EFSCORRUPTED;
> +			return error;

i.e. this is the point where we know we need to remove from the head
of the unlinked list and all the bucket selection and verification
should probably remain here...

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
  2020-07-08 17:03   ` Darrick J. Wong
@ 2020-07-08 23:33   ` Dave Chinner
  2020-07-09  0:55     ` Gao Xiang
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2020-07-08 23:33 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> Currently, we use AGI buffer lock to protect in-memory linked list for
> unlinked inodes but since it's not necessary to modify AGI unless the
> head of the unlinked list is modified. So let's removing the AGI buffer
> modification dependency if possible, including 1) adding another per-AG
> dedicated lock to protect the whole list and 2) inserting unlinked
> inodes from tail.
> 
> For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> to use. xfs_iunlink_remove() still support old multiple short bucket
> lists for recovery code.

I would split this into two separate patches. One to move to a perag
based locking strategy, another to change from head to tail
addition as they are largely independent algorithmic changes.

> Note that some paths take AGI lock in its transaction in advance,
> so the proper locking order is only AGI lock -> unlinked list lock.

These paths should be documented in the commit message as well
as in code comments so the reviewer is aware of those code paths
and can verify that your assumptions about locking order are
correct.

> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
>  fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
>  fs/xfs/xfs_log_recover.c |   6 +
>  fs/xfs/xfs_mount.c       |   3 +
>  fs/xfs/xfs_mount.h       |   3 +
>  4 files changed, 144 insertions(+), 119 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 10565fa5ace4..d33e5b198534 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
>  }
>  
>  /*
> - * Always insert at the head, so we only have to do a next inode lookup to
> - * update it's prev pointer. The AGI bucket will point at the one we are
> - * inserting.
> + * This is called when the inode's link count has gone to 0 or we are creating
> + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> + *
> + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> + * list when the inode is freed.
>   */
> -static int
> -xfs_iunlink_insert_inode(

Hmmm. Flattening the code also make the patch harder to follow as it
combines code movement/rearrangement with algorithmic changes.  We
try to separate out code movement/rearrangement into their own
patches so that the movement is easy to verify by itself.

Also, the helper functions help document the separation of the
unlinked list manipulations from the from setup and locking
requirements for the list manipulations, and this largely undoes all
that. I added these helpers because it completely untangled the mess
that was present before the RFC patchset I posted. THat is, I
couldn't easily modify the existing code because it interleaved
the locking, the backref hash manipulations and
the on-disk list manipulations in ways I found difficult to
understand and manage. Short, simple, clear functions are much
better than long, multiple operation functions...

i.e. this:

xfs_iunlink()
{

	get locks
	do list insert
	drop locks
}

Is better for understanding, maintenance and future modification
than:

xfs_iunlink()
{

	get perag
	lock perag
	look at tail of list
	if (empty) {
		unlock perag
		read/lock AGI
		lock perag
		look at tail of list
		if (empty)
			do head insert
			goto out
	}
	do tail insert
out:
	update inode/pag tails
	unlock
	drop perag
}

It's trivial for a reader to understand what the first version of
xfs_iunlink() is going to do without needing to understand the
intraccies of the locking strategies. However, it takes time and
effort to undestand exactly waht the second one is doing because
it's not clear where lock ends and list modifications start, nor
what the locking rules are for the different modifications that are
being made. Essentially, it goes back to the complex
locking-intertwined-with-modification-algorithm problem the current
TOT code has.

I'd much prefer to see something like this:

/*
 * Inode allocation in the O_TMPFILE path defines the AGI/unlinked
 * list lock order as being AGI->perag unlinked list lock. We are
 * inverting it here as the fast path tail addition does not need to
 * modify the AGI at all. Hence we only need the AGI lock if the
 * tail is empty, but if we fail to get it without blocking then we
 * need to fall back to the slower, correct lock order.
 */
xfs_iunlink_insert_lock()
{
	get perag;
	lock_perag();
	if (!tail empty)
		return;
	if (trylock AGI)
		return;

	/*
	 * Slow path, need to lock AGI first. Don't even bother
	 * rechecking tail pointers or trying to optimise for
	 * minimal AGI lock hold time as racing unlink list mods
	 * will all block on the perag lock once we take that. They
	 * will then hit the !tail empty fast path and not require
	 * the AGI lock at all.
	 */
	lock AGI
	lock_perag()
	return;
}

The non-AGI locking fast path is slightly different in the remove
case, so we'll have a slightly different helper function in that
case which checks where the inode being removed is in the list.

In both cases, though, the unlock should be the same:

xfs_iunlink_unlock()
{
	/* Does not unlock AGI, ever. commit does that. */
	unlock perag
	put perag
}

This keeps the list locking completely separate from the list
manipulations and allows us to document the locking constraints and
reasons for why it is or isn't optimised for specific conditions
without cluttering up the list manipulations code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-08 17:03   ` Darrick J. Wong
@ 2020-07-08 23:40     ` Gao Xiang
  0 siblings, 0 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-08 23:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Brian Foster

Hi Darrick,

On Wed, Jul 08, 2020 at 10:03:07AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> > Currently, we use AGI buffer lock to protect in-memory linked list for
> > unlinked inodes but since it's not necessary to modify AGI unless the
> > head of the unlinked list is modified. So let's removing the AGI buffer
> > modification dependency if possible, including 1) adding another per-AG
> > dedicated lock to protect the whole list and 2) inserting unlinked
> > inodes from tail.
> > 
> > For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> > to use. xfs_iunlink_remove() still support old multiple short bucket
> > lists for recovery code.
> > 
> > Note that some paths take AGI lock in its transaction in advance,
> > so the proper locking order is only AGI lock -> unlinked list lock.
> 
> How much agi buffer lock contention does this eliminate?

Sorry, I haven't got the point.

Did you mean I should show some number in the commit message from some workload?
I'm not sure if you mean that, it depends on the specific workload indeed. I
could do that if needed (it's of much help if some more hints here...)

Or I'm not sure if you mean the deadlock. There are some exist path I've found
when running fsstress, for example, xfs_create_tmpfile() will reading/locking
AGI when preparing an orphan inode in this transaction. So if I reading AGI
again under a new lock unconditionally, it should cause ABBA deadlock (since
another path takes mutex first and locks AGI buffer again).

> 
> > @@ -372,6 +372,9 @@ typedef struct xfs_perag {
> >  	/* reference count */
> >  	uint8_t			pagf_refcount_level;
> >  
> > +	struct mutex		pag_unlinked_mutex;
> > +	struct xfs_inode	*pag_unlinked_tail;
> 
> What do these fields do?  There aren't any comments....

Will add some comments in the next version, Thanks for the advice!

Thanks,
Gao Xiang

> 
> --D
> 
> > +
> >  	/*
> >  	 * Unlinked inode information.  This incore information reflects
> >  	 * data stored in the AGI, so callers must hold the AGI buffer lock
> > -- 
> > 2.18.1
> > 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list
  2020-07-08 22:33   ` Dave Chinner
@ 2020-07-09  0:17     ` Gao Xiang
  0 siblings, 0 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-09  0:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

Hi Dave,

On Thu, Jul 09, 2020 at 08:33:26AM +1000, Dave Chinner wrote:
> >  
> > +	bucket_index = 0;
> > +	/* During recovery, the old multiple bucket index can be applied */
> > +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> > +		ASSERT(old_agino != NULLAGINO);
> > +
> > +		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
> > +			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
> > +	}
> 
> Ok, so you are doing this because you changed the function to pass
> in an agino rather than a bucket index from the caller context. So
> now you have to look up a structure to determine what the caller
> context was to determine what the bucket index we need to use is.
> 
> Seems like we probably should have kept passing in the bucket index
> from the caller because that's where the knowledge of what bucket we
> need to update comes from?

My thought is since bucket_index is now fixed as 0 except for the
determinated recovery path, so I think there is no need for any
exist callers or future callers to specify some bucket number.

The old formula is only used for xfs_iunlink_remove_inode() when
recovering, so old_agino won't be NULLAGINO but indicate the old
bucket_index as well (since it removes the head unlinked inode.)
That is the only determinated path.

> 
> And in that case, the higher level code should be checking for the
> log recovery case when selecting the bucket, not hiding it deep in
> the guts of the code here....

I could do that instead, it means callers should obey more rules
to call this function. e.g bucket_index should obey "old_agino % XFS_AGI_UNLINKED_BUCKETS"
rule for the old way rather than passing any possible value.

but if we specify all callers do that, I think "if (old_value != old_agino)"
isn't needed here as well but instead add an ASSERT here since all callers
should know what they are doing now, and leave only some integral check here.

> 
> > +
> >  	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> >  	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
> >  			old_value, new_agino);
> >  
> > -	/*
> > -	 * We should never find the head of the list already set to the value
> > -	 * passed in because either we're adding or removing ourselves from the
> > -	 * head of the list.
> > -	 */
> > -	if (old_value == new_agino) {
> > +	/* check if the old agi_unlinked head is as expected */
> > +	if (old_value != old_agino) {
> >  		xfs_buf_mark_corrupt(agibp);
> >  		return -EFSCORRUPTED;
> >  	}
> 
> This looks like a change of behaviour - it no longer checks against
> the inode we are about to add/remove from the list, but instead
> checks that old inode is what we found on the list. We're not
> concerned that what we found on the list matches what the caller
> found on the list and passed us - we're concerned about doing a
> double add/remove of the current inode...

Just as I said above, this checks the bucket head validity instead
(since we get a bucket_index and the bucket head should be as
what we expect). It doesn't mean to check that old inode is what
we found on the list. So we could kill the original check in
xfs_iunlink_remove_inode() as well.

Anyway, I prefer this way since all callers won't know too much about
the bucket index. Since bucket index is much related to agino, it cannot
be specified without some rule, e.g. agino is 0 (just for a example,
should not possible), but a caller passes a bucket index 12. It doesn't
work so if we have a bucket_index argument, we have also to consider
bucket_index check in xfs_iunlink_update_bucket() just as the old
"if (old_value == new_agino) {" integral check as well. From this
perspective, xfs_iunlink_update_bucket() should also know the rule
as the proposed code does...

I could keep the old argument instead, that isn't too much for me.
I listed all my thoughts above.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-08 23:33   ` Dave Chinner
@ 2020-07-09  0:55     ` Gao Xiang
  2020-07-09  2:32       ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Gao Xiang @ 2020-07-09  0:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 09:33:11AM +1000, Dave Chinner wrote:
> On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> > Currently, we use AGI buffer lock to protect in-memory linked list for
> > unlinked inodes but since it's not necessary to modify AGI unless the
> > head of the unlinked list is modified. So let's removing the AGI buffer
> > modification dependency if possible, including 1) adding another per-AG
> > dedicated lock to protect the whole list and 2) inserting unlinked
> > inodes from tail.
> > 
> > For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> > to use. xfs_iunlink_remove() still support old multiple short bucket
> > lists for recovery code.
> 
> I would split this into two separate patches. One to move to a perag
> based locking strategy, another to change from head to tail
> addition as they are largely independent algorithmic changes.

Yes, that is much better from the perspective of spilting patches and
I thought that before. It seems that is like 2 steps but the proposed
target solution is as a whole (in other words, 2 steps are code-related)
and I'm not sure how large these code is sharable or can be inherited
but rather than introduce some code in patch 2 and then remove immediately
and turn into a new code in patch 3). I'm not sure how large logic could
be sharable between these 2 dependent steps so I didn't do that.

I will spilt patches in the next RFC version to make a try.

> 
> > Note that some paths take AGI lock in its transaction in advance,
> > so the proper locking order is only AGI lock -> unlinked list lock.
> 
> These paths should be documented in the commit message as well
> as in code comments so the reviewer is aware of those code paths
> and can verify that your assumptions about locking order are
> correct.

Yeah, It has been documented in the following.

+xfs_iunlink(
 	struct xfs_trans	*tp,
-	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_perag	*pag;
+	struct xfs_inode	*pip;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino;
+	int			error;
+
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
+
+	pag = xfs_perag_get(mp, agno);
 
 	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the pointer isn't garbage and that this inode
-	 * isn't already on the list.
+	 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
+	 * in this transaction in advance and the only locking order
+	 * AGI buf lock -> pag_unlinked_mutex is safe.

> 
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > ---
> >  fs/xfs/xfs_inode.c       | 251 ++++++++++++++++++++-------------------
> >  fs/xfs/xfs_log_recover.c |   6 +
> >  fs/xfs/xfs_mount.c       |   3 +
> >  fs/xfs/xfs_mount.h       |   3 +
> >  4 files changed, 144 insertions(+), 119 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 10565fa5ace4..d33e5b198534 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1994,182 +1994,195 @@ xfs_iunlink_update_bucket(
> >  }
> >  
> >  /*
> > - * Always insert at the head, so we only have to do a next inode lookup to
> > - * update it's prev pointer. The AGI bucket will point at the one we are
> > - * inserting.
> > + * This is called when the inode's link count has gone to 0 or we are creating
> > + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> > + *
> > + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> > + * list when the inode is freed.
> >   */
> > -static int
> > -xfs_iunlink_insert_inode(
> 
> Hmmm. Flattening the code also make the patch harder to follow as it
> combines code movement/rearrangement with algorithmic changes.  We
> try to separate out code movement/rearrangement into their own
> patches so that the movement is easy to verify by itself.
> 
> Also, the helper functions help document the separation of the
> unlinked list manipulations from the from setup and locking
> requirements for the list manipulations, and this largely undoes all
> that. I added these helpers because it completely untangled the mess
> that was present before the RFC patchset I posted. THat is, I
> couldn't easily modify the existing code because it interleaved
> the locking, the backref hash manipulations and
> the on-disk list manipulations in ways I found difficult to
> understand and manage. Short, simple, clear functions are much
> better than long, multiple operation functions...
> 
> i.e. this:
> 
> xfs_iunlink()
> {
> 
> 	get locks
> 	do list insert
> 	drop locks
> }
> 
> Is better for understanding, maintenance and future modification
> than:
> 
> xfs_iunlink()
> {
> 
> 	get perag
> 	lock perag
> 	look at tail of list
> 	if (empty) {
> 		unlock perag
> 		read/lock AGI
> 		lock perag
> 		look at tail of list
> 		if (empty)
> 			do head insert
> 			goto out
> 	}
> 	do tail insert
> out:
> 	update inode/pag tails
> 	unlock
> 	drop perag
> }
> 
> It's trivial for a reader to understand what the first version of
> xfs_iunlink() is going to do without needing to understand the
> intraccies of the locking strategies. However, it takes time and
> effort to undestand exactly waht the second one is doing because
> it's not clear where lock ends and list modifications start, nor
> what the locking rules are for the different modifications that are
> being made. Essentially, it goes back to the complex
> locking-intertwined-with-modification-algorithm problem the current
> TOT code has.
> 
> I'd much prefer to see something like this:
> 
> /*
>  * Inode allocation in the O_TMPFILE path defines the AGI/unlinked
>  * list lock order as being AGI->perag unlinked list lock. We are
>  * inverting it here as the fast path tail addition does not need to
>  * modify the AGI at all. Hence we only need the AGI lock if the
>  * tail is empty, but if we fail to get it without blocking then we
>  * need to fall back to the slower, correct lock order.
>  */
> xfs_iunlink_insert_lock()
> {
> 	get perag;
> 	lock_perag();
> 	if (!tail empty)
> 		return;
> 	if (trylock AGI)
> 		return;

(adding some notes here, this patch doesn't use try lock here
 finally but unlock perag and take AGI and relock and recheck tail_empty....
 since the tail non-empty is rare...)

> 
> 	/*
> 	 * Slow path, need to lock AGI first. Don't even bother
> 	 * rechecking tail pointers or trying to optimise for
> 	 * minimal AGI lock hold time as racing unlink list mods
> 	 * will all block on the perag lock once we take that. They
> 	 * will then hit the !tail empty fast path and not require
> 	 * the AGI lock at all.
> 	 */
> 	lock AGI
> 	lock_perag()
> 	return;
> }
> 
> The non-AGI locking fast path is slightly different in the remove
> case, so we'll have a slightly different helper function in that
> case which checks where the inode being removed is in the list.
> 
> In both cases, though, the unlock should be the same:
> 
> xfs_iunlink_unlock()
> {
> 	/* Does not unlock AGI, ever. commit does that. */
> 	unlock perag
> 	put perag
> }
> 
> This keeps the list locking completely separate from the list
> manipulations and allows us to document the locking constraints and
> reasons for why it is or isn't optimised for specific conditions
> without cluttering up the list manipulations code.

Yeah, the _lock and _unlock pair is helpful, I will go for this direction.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09  0:55     ` Gao Xiang
@ 2020-07-09  2:32       ` Dave Chinner
  2020-07-09 10:36         ` Gao Xiang
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2020-07-09  2:32 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 08:55:26AM +0800, Gao Xiang wrote:
> On Thu, Jul 09, 2020 at 09:33:11AM +1000, Dave Chinner wrote:
> > On Tue, Jul 07, 2020 at 09:57:41PM +0800, Gao Xiang wrote:
> > > Currently, we use AGI buffer lock to protect in-memory linked list for
> > > unlinked inodes but since it's not necessary to modify AGI unless the
> > > head of the unlinked list is modified. So let's removing the AGI buffer
> > > modification dependency if possible, including 1) adding another per-AG
> > > dedicated lock to protect the whole list and 2) inserting unlinked
> > > inodes from tail.
> > > 
> > > For 2), the tail of bucket 0 is now recorded in perag for xfs_iunlink()
> > > to use. xfs_iunlink_remove() still support old multiple short bucket
> > > lists for recovery code.
> > 
> > I would split this into two separate patches. One to move to a perag
> > based locking strategy, another to change from head to tail
> > addition as they are largely independent algorithmic changes.
> 
> Yes, that is much better from the perspective of spilting patches and
> I thought that before. It seems that is like 2 steps but the proposed
> target solution is as a whole (in other words, 2 steps are code-related)
> and I'm not sure how large these code is sharable or can be inherited
> but rather than introduce some code in patch 2 and then remove immediately
> and turn into a new code in patch 3). I'm not sure how large logic could
> be sharable between these 2 dependent steps so I didn't do that.
> 
> I will spilt patches in the next RFC version to make a try.

Thanks!

[snip]


> > i.e. this:
> > 
> > xfs_iunlink()
> > {
> > 
> > 	get locks
> > 	do list insert
> > 	drop locks
> > }
> > 
> > Is better for understanding, maintenance and future modification
> > than:
> > 
> > xfs_iunlink()
> > {
> > 
> > 	get perag
> > 	lock perag
> > 	look at tail of list
> > 	if (empty) {
> > 		unlock perag
> > 		read/lock AGI
> > 		lock perag
> > 		look at tail of list
> > 		if (empty)
> > 			do head insert
> > 			goto out
> > 	}
> > 	do tail insert
> > out:
> > 	update inode/pag tails
> > 	unlock
> > 	drop perag
> > }
> > 
> > It's trivial for a reader to understand what the first version of
> > xfs_iunlink() is going to do without needing to understand the
> > intraccies of the locking strategies. However, it takes time and
> > effort to undestand exactly waht the second one is doing because
> > it's not clear where lock ends and list modifications start, nor
> > what the locking rules are for the different modifications that are
> > being made. Essentially, it goes back to the complex
> > locking-intertwined-with-modification-algorithm problem the current
> > TOT code has.
> > 
> > I'd much prefer to see something like this:
> > 
> > /*
> >  * Inode allocation in the O_TMPFILE path defines the AGI/unlinked
> >  * list lock order as being AGI->perag unlinked list lock. We are
> >  * inverting it here as the fast path tail addition does not need to
> >  * modify the AGI at all. Hence we only need the AGI lock if the
> >  * tail is empty, but if we fail to get it without blocking then we
> >  * need to fall back to the slower, correct lock order.
> >  */
> > xfs_iunlink_insert_lock()
> > {
> > 	get perag;
> > 	lock_perag();
> > 	if (!tail empty)
> > 		return;
> > 	if (trylock AGI)
> > 		return;
> 
> (adding some notes here, this patch doesn't use try lock here
>  finally but unlock perag and take AGI and relock and recheck tail_empty....
>  since the tail non-empty is rare...)

*nod*

My point was largely that this sort of thing is really obvious and
easy to optimise once the locking is cleanly separated. Adding a
trylock rather than drop/relock is another patch for the series :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09  2:32       ` Dave Chinner
@ 2020-07-09 10:36         ` Gao Xiang
  2020-07-09 10:47           ` Gao Xiang
  2020-07-09 22:36           ` Dave Chinner
  0 siblings, 2 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-09 10:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

Hi Dave,

On Thu, Jul 09, 2020 at 12:32:46PM +1000, Dave Chinner wrote:

...

> > > 	if (trylock AGI)
> > > 		return;
> > 
> > (adding some notes here, this patch doesn't use try lock here
> >  finally but unlock perag and take AGI and relock and recheck tail_empty....
> >  since the tail non-empty is rare...)
> 
> *nod*
> 
> My point was largely that this sort of thing is really obvious and
> easy to optimise once the locking is cleanly separated. Adding a
> trylock rather than drop/relock is another patch for the series :P

I thought trylock way yesterday as well...
Apart from that we don't have the AGI trylock method yet, it seems
not work as well...

Thinking about one thread is holding a AGI lock and wants to lock perag.
And another thread holds a perag, but the trylock will fail and the second
wait-lock will cause deadlock... like below:

Thread 1			Thread 2
 lock AGI
                                 lock perag
 lock perag
                                 trylock AGI
                                 lock AGI           <- dead lock here

And trylock perag instead seems not work well as well, since it fails
more frequently so we need lock AGI and then lock perag as a fallback.

So currently, I think it's better to use unlock, do something, regrab,
recheck method for now...

And yeah, that can be done in another patch if some better way. (e.g.
moving modifing AGI into pre-commit, so we can only take one per-ag
lock here...)

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09 10:36         ` Gao Xiang
@ 2020-07-09 10:47           ` Gao Xiang
  2020-07-09 22:36           ` Dave Chinner
  1 sibling, 0 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-09 10:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 06:36:21PM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On Thu, Jul 09, 2020 at 12:32:46PM +1000, Dave Chinner wrote:
> 
> ...
> 
> > > > 	if (trylock AGI)
> > > > 		return;
> > > 
> > > (adding some notes here, this patch doesn't use try lock here
> > >  finally but unlock perag and take AGI and relock and recheck tail_empty....
> > >  since the tail non-empty is rare...)
> > 
> > *nod*
> > 
> > My point was largely that this sort of thing is really obvious and
> > easy to optimise once the locking is cleanly separated. Adding a
> > trylock rather than drop/relock is another patch for the series :P
> 
> I thought trylock way yesterday as well...
> Apart from that we don't have the AGI trylock method yet, it seems
> not work as well...
> 
> Thinking about one thread is holding a AGI lock and wants to lock perag.
> And another thread holds a perag, but the trylock will fail and the second
> wait-lock will cause deadlock... like below:
> 
> Thread 1			Thread 2
>  lock AGI
>                                  lock perag
>  lock perag
>                                  trylock AGI
>                                  lock AGI           <- dead lock here
> 
> And trylock perag instead seems not work well as well, since it fails
> more frequently so we need lock AGI and then lock perag as a fallback.
> 
> So currently, I think it's better to use unlock, do something, regrab,
> recheck method for now...
> 
> And yeah, that can be done in another patch if some better way. (e.g.
> moving modifing AGI into pre-commit, so we can only take one per-ag
> lock here...)

Oh, sorry, I didn't notice the words

> 	/*
> 	 * Slow path, need to lock AGI first. Don't even bother
> 	 * rechecking tail pointers or trying to optimise for
> 	 * minimal AGI lock hold time as racing unlink list mods
> 	 * will all block on the perag lock once we take that. They
> 	 * will then hit the !tail empty fast path and not require
> 	 * the AGI lock at all.
> 	 */

If we'd only try to trylock AGI and release perag and relock again,
and that's fine. Sorry about the noise :)

btw, also add some other notes here. For AGI pre-commits, I'm not
quite sure if some users take another locked buffer first (but
without AGI), so it would have some potential locking order concern
as well... But let us think about them later :)

add agi to precommit
lock other buffers

precommit --- lock AGI


Thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can
  2020-07-09 10:36         ` Gao Xiang
  2020-07-09 10:47           ` Gao Xiang
@ 2020-07-09 22:36           ` Dave Chinner
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2020-07-09 22:36 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Darrick J. Wong, Brian Foster

On Thu, Jul 09, 2020 at 06:36:21PM +0800, Gao Xiang wrote:
> Hi Dave,
> 
> On Thu, Jul 09, 2020 at 12:32:46PM +1000, Dave Chinner wrote:
> 
> ...
> 
> > > > 	if (trylock AGI)
> > > > 		return;
> > > 
> > > (adding some notes here, this patch doesn't use try lock here
> > >  finally but unlock perag and take AGI and relock and recheck tail_empty....
> > >  since the tail non-empty is rare...)
> > 
> > *nod*
> > 
> > My point was largely that this sort of thing is really obvious and
> > easy to optimise once the locking is cleanly separated. Adding a
> > trylock rather than drop/relock is another patch for the series :P
> 
> I thought trylock way yesterday as well...
> Apart from that we don't have the AGI trylock method yet, it seems
> not work as well...
> 
> Thinking about one thread is holding a AGI lock and wants to lock perag.
> And another thread holds a perag, but the trylock will fail and the second
> wait-lock will cause deadlock... like below:
> 
> Thread 1			Thread 2
>  lock AGI
>                                  lock perag
>  lock perag
>                                  trylock AGI
>                                  lock AGI           <- dead lock here

You forgot to unlock the perag before locking the AGI, so of course
it will deadlock.

Yes, I know my psuedocode example didn't explicitly unlock the perag
because I simply forgot to write it in as I was going. It's pretty
common that quickly written psuedo code to explain a concept is not
perfect, so you still need to think about whether it is correct,
complete, etc.  i.e. if you do:

Thread 1			Thread 2
 lock AGI
                                 lock perag
 lock perag
                                 trylock AGI
				 unlock perag
                                 lock AGI
				 <blocks>

There is no deadlock in this algorithm.

> And trylock perag instead seems not work well as well, since it fails
> more frequently so we need lock AGI and then lock perag as a fallback.

That's kind of what I'd expect - if there is contention on the AGI,
the trylock will fail and we know we are about to sleep. The point
is that "trylock fails" now gives us a point at which we can measure
contention that puts us into the slow path. And because we are now
in the slow path, the overhead of backing out just doesn't matter
because we are going to sleep anyway.

We use this pattern all over XFS to attempt to get locks in reverse
order with minimal overhead in fast paths...

> So currently, I think it's better to use unlock, do something, regrab,
> recheck method for now...

I think that's a mistake - it's smells of trying to optimise the
implementation around immediately observed behaviour instead of
designing a clean, well structured locking scheme that will not need
to be re-optimised over and over again as the algorithm it protects
changes over time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2
  2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
  2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
@ 2020-07-24  6:12 ` Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
                     ` (2 more replies)
  2 siblings, 3 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Hi forks,

Sorry for long delay these days due to my work ineffective, this is RFC
v2 of this patchset addressing previous constructive comments...

This RFC patchset mainly addresses the thoughts [*] and [**] from Dave's
original patchset,
https://lore.kernel.org/r/20200623095015.1934171-1-david@fromorbit.com

In short, it focues on the following ideas mentioned by Dave:
 - use bucket 0 instead of multiple buckets since in-memory double
   linked list finally works;

 - avoid taking AGI buffer and unnecessary AGI update if possible, so
   1) add a new lock and keep proper locking order to avoid deadlock;
   2) insert a new unlinked inode from the tail instead of head;

In addition, it's worth noticing 3 things:
 - xfs_iunlink_remove() should support old multiple buckets in order
   to keep old inode unlinked list (old image) working when recovering.

 - (but) OTOH, the old kernel recovery _shouldn't_ work with new image
   since the bucket_index from old xfs_iunlink_remove() is generated
   by the old formula (rather than keep in xfs_inode), which is now
   fixed as 0. So this feature is not forward compatible without some
   extra backport patches;

 - a tail xfs_inode pointer is also added in the perag, which keeps 
   track of the tail of bucket 0 since it's mainly used for xfs_iunlink().

 - the old kernel recovery shouldn't work with new image since
     its bucket_index from old kernel is 

The git tree is also available at
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git tags/xfs/iunlink_opt_v2

Gitweb:
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/log/?h=xfs/iunlink_opt_v2

Some limited test for debugging only is done (mainly fsstress and manual
power-cut tests) and it's worth noticing that when I tested fsstress for
much longer time, it showed the following kernel message twice and it
haven't been reproduced again for about 3 days, and I have no idea what's
wrong over the current logic, so it could be better to send out the patchset
first and go on reproducing with extra debugging prints added...

[ 7884.930106] XFS (sda): bad inode magic/vsn daddr 111232 #0 (magic=5050)
[ 7884.930988] XFS (sda): Metadata corruption detected at xfs_buf_ioend+0x11a/0x190, xfs_inode block 0x1b280 xfs_inode_buf_verify
[ 7884.932418] XFS (sda): Unmount and run xfs_repair
[ 7884.933154] XFS (sda): First 128 bytes of corrupted metadata buffer:
[ 7884.934253] 00000000: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.935495] 00000010: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.936752] 00000020: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.937761] 00000030: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.939336] 00000040: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.940914] 00000050: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.942382] 00000060: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.943513] 00000070: 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50  PPPPPPPPPPPPPPPP
[ 7884.945006] XFS (sda): metadata I/O error in "xfs_imap_to_bp+0x61/0xd0" at daddr 0x1b280 len 32 error 117
[ 7884.952066] XFS (sda): xfs_do_force_shutdown(0x1) called from line 312 of file fs/xfs/xfs_trans_buf.c. Return address = ffffffff983ea7e0
[ 7884.952494] sda: writeback error on inode 20695, offset 770048, sector 1970864
[ 7884.954353] XFS (sda): I/O Error Detected. Shutting down filesystem
[ 7884.956566] sda: writeback error on inode 117946, offset 2347008, sector 612424
[ 7884.956802] XFS (sda): Please unmount the filesystem and rectify the problem(s)
[ 7884.958943] XFS (sda): xfs_inactive_ifree: xfs_trans_commit returned error -117
fsstress: check_cwd failure
[ 7884.963475] XFS (sda): xfs_imap_lookup: xfs_ialloc_read_agi() returned error -5, agno 0

Comments and directions are welcomed. :)

Thanks,
Gao Xiang


Changes since v1:
 - show some number of reduced AGI buffer modification (Darrick);
 - add comments about pag_unlinked_mutex, pag_unlinked_tail (Darrick);
 - avoid touching xfs_iunlink_update_bucket() logic (Dave);
 - spilt "xfs: don't access AGI on unlinked inodes if it can" into 2 patches (Dave);
 - add xfs_iunlink_insert_lock/xfs_iunlink_remove_lock/xfs_iunlink_unlock to
   make the code easy to follow (Dave).

Gao Xiang (3):
  xfs: arrange all unlinked inodes into one list
  xfs: introduce perag iunlink lock
  xfs: insert unlinked inodes from tail

 fs/xfs/xfs_inode.c       | 202 ++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_log_recover.c |   5 +
 fs/xfs/xfs_mount.c       |   2 +
 fs/xfs/xfs_mount.h       |   6 ++
 4 files changed, 159 insertions(+), 56 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
@ 2020-07-24  6:12   ` Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail Gao Xiang
  2 siblings, 0 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

There is no need to keep old multiple short unlink inode buckets
since we have an in-memory double linked list for all unlinked
inodes.

Apart from the perspective of the necessity, the main advantage
is the log and AGI update can be reduced since each AG has the
only one head now, which is implemented in the following patches.

Therefore, this patch applies the new way in xfs_iunlink() and
keep the old approach in xfs_iunlink_remove_inode() path as
well so inode eviction can still work properly in recovery.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ab288424764c..4994398373df 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -33,6 +33,7 @@
 #include "xfs_symlink.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_iunlink_item.h"
@@ -2001,14 +2002,13 @@ xfs_iunlink_insert_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 
 	/*
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the pointer isn't garbage and that this inode
 	 * isn't already on the list.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
 	if (next_agino == agino ||
 	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
 		xfs_buf_mark_corrupt(agibp);
@@ -2036,7 +2036,7 @@ xfs_iunlink_insert_inode(
 	}
 
 	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
+	return xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
 }
 
 /*
@@ -2055,10 +2055,17 @@ xfs_iunlink_remove_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
+		struct xlog	*log = mp->m_log;
+		short		bucket_index = 0;
+
+		/* During recovery, the old multiple bucket can be applied */
+		if ((!log || log->l_flags & XLOG_RECOVERY_NEEDED) &&
+		    be32_to_cpu(agi->agi_unlinked[0]) != agino)
+			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
+
 		/* remove from head of list */
 		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
 			xfs_buf_mark_corrupt(agibp);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
@ 2020-07-24  6:12   ` Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail Gao Xiang
  2 siblings, 0 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, we use AGI buffer lock to protect in-memory linked list
for unlinked inodes but since it's not necessary to modify AGI
unless the head of the unlinked list is modified.

Therefore, let's add another per-AG dedicated lock to protect the
whole list, and removing AGI buffer lock in xfs_iunlink_remove()
if the head is untouched.

Note that some paths (e.g. xfs_create_tmpfile()) take AGI lock in
its transaction in advance, so the proper locking order is only
AGI lock -> unlinked list lock.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c | 79 ++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_mount.c |  1 +
 fs/xfs/xfs_mount.h |  3 ++
 3 files changed, 70 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4994398373df..d78aaa8ce252 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2039,6 +2039,46 @@ xfs_iunlink_insert_inode(
 	return xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
 }
 
+/*
+ * Lock the perag and take AGI lock if agi_unlinked is touched as well for
+ * xfs_iunlink_remove_inode().
+ *
+ * Inode allocation in the O_TMPFILE path defines the AGI/unlinked list lock
+ * order as being AGI->perag unlinked list lock. We are inverting it here as
+ * the fast path tail addition does not need to modify the AGI at all. Hence
+ * we only need the AGI lock if the tail is empty, correct lock order.
+ */
+static struct xfs_perag *
+xfs_iunlink_remove_lock(
+	xfs_agino_t		agno,
+	struct xfs_trans        *tp,
+	struct xfs_inode	*ip,
+	struct xfs_buf		**agibpp)
+{
+	struct xfs_mount        *mp = tp->t_mountp;
+	struct xfs_perag	*pag;
+	int			error;
+
+	pag = xfs_perag_get(mp, agno);
+	mutex_lock(&pag->pag_unlinked_mutex);
+
+	if (ip->i_prev_unlinked == NULLAGINO) {
+		mutex_unlock(&pag->pag_unlinked_mutex);
+		/*
+		 * some paths (e.g. xfs_create_tmpfile) could take AGI lock
+		 * in this transaction in advance and the only locking order
+		 * AGI buf lock -> pag_unlinked_mutex is safe.
+		 */
+		error = xfs_read_agi(mp, tp, agno, agibpp);
+		if (error) {
+			xfs_perag_put(pag);
+			return ERR_PTR(error);
+		}
+		mutex_lock(&pag->pag_unlinked_mutex);
+	}
+	return pag;
+}
+
 /*
  * Remove can be from anywhere in the list, so we have to do two adjacent inode
  * lookups here so we can update list pointers. We may be at the head or the
@@ -2046,12 +2086,12 @@ xfs_iunlink_insert_inode(
  */
 static int
 xfs_iunlink_remove_inode(
+	struct xfs_perag	*pag,
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
@@ -2059,6 +2099,7 @@ xfs_iunlink_remove_inode(
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
 		struct xlog	*log = mp->m_log;
+		struct xfs_agi	*agi = agibp->b_addr;
 		short		bucket_index = 0;
 
 		/* During recovery, the old multiple bucket can be applied */
@@ -2083,7 +2124,7 @@ xfs_iunlink_remove_inode(
 		/* lookup previous inode and update to point at next */
 		struct xfs_inode	*pip;
 
-		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
+		pip = xfs_iunlink_lookup_prev(pag, ip);
 		if (IS_ERR_OR_NULL(pip))
 			return -EFSCORRUPTED;
 
@@ -2102,7 +2143,7 @@ xfs_iunlink_remove_inode(
 	if (ip->i_next_unlinked != NULLAGINO) {
 		struct xfs_inode	*nip;
 
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
+		nip = xfs_iunlink_lookup_next(pag, ip);
 		if (IS_ERR_OR_NULL(nip))
 			return -EFSCORRUPTED;
 
@@ -2126,6 +2167,15 @@ xfs_iunlink_remove_inode(
 	return 0;
 }
 
+static void
+xfs_iunlink_unlock(
+	struct xfs_perag	*pag)
+{
+	/* Does not unlock AGI, ever. xfs_trans_commit() does that. */
+	mutex_unlock(&pag->pag_unlinked_mutex);
+	xfs_perag_put(pag);
+}
+
 /*
  * This is called when the inode's link count has gone to 0 or we are creating
  * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
@@ -2143,7 +2193,6 @@ xfs_iunlink(
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	int			error;
 
-	ASSERT(ip->i_next_unlinked == NULLAGINO);
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 	trace_xfs_iunlink(ip);
@@ -2153,7 +2202,10 @@ xfs_iunlink(
 	if (error)
 		return error;
 
-	return xfs_iunlink_insert_inode(tp, agibp, ip);
+	mutex_lock(&agibp->b_pag->pag_unlinked_mutex);
+	error = xfs_iunlink_insert_inode(tp, agibp, ip);
+	mutex_unlock(&agibp->b_pag->pag_unlinked_mutex);
+	return error;
 }
 
 /*
@@ -2164,19 +2216,20 @@ xfs_iunlink_remove(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip)
 {
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_buf		*agibp = NULL;
+	struct xfs_perag	*pag;
 	int			error;
 
 	trace_xfs_iunlink_remove(ip);
 
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
+	pag = xfs_iunlink_remove_lock(XFS_INO_TO_AGNO(tp->t_mountp, ip->i_ino),
+		tp, ip, &agibp);
+	if (IS_ERR(pag))
+		return PTR_ERR(pag);
 
-	return xfs_iunlink_remove_inode(tp, agibp, ip);
+	error = xfs_iunlink_remove_inode(pag, tp, agibp, ip);
+	xfs_iunlink_unlock(pag);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 031e96ff022d..f94e14059e61 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -223,6 +223,7 @@ xfs_initialize_perag(
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
 		spin_lock_init(&pag->pag_state_lock);
+		mutex_init(&pag->pag_unlinked_mutex);
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a72cfcaa4ad1..9a1d0f239fa4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -372,6 +372,9 @@ typedef struct xfs_perag {
 	/* reference count */
 	uint8_t			pagf_refcount_level;
 
+	/* lock to protect unlinked inode list */
+	struct mutex            pag_unlinked_mutex;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
-- 
2.18.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail
  2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
  2020-07-24  6:12   ` [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock Gao Xiang
@ 2020-07-24  6:12   ` Gao Xiang
  2 siblings, 0 replies; 17+ messages in thread
From: Gao Xiang @ 2020-07-24  6:12 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Brian Foster, Gao Xiang

Currently, AGI buffer is always touched since xfs_iunlink()
adds unlinked inodes from head unconditionally, but since we
now have the only one unlinked list and if we insert unlinked
inodes from tail instead and there're more than 1 inodes, AGI
buffer modification could be avoided.

In order to do that, let's keep track of the tail of unlinked
inode into the perag all the time in order for xfs_iunlink()
to use.

With this change, it shows that only 938 of 10000 operations
modifies the head of unlinked list with the following workload:
 seq 1 10000 | xargs touch
 find . | xargs -n3 -P100 rm

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
 fs/xfs/xfs_inode.c       | 120 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_log_recover.c |   5 ++
 fs/xfs/xfs_mount.c       |   1 +
 fs/xfs/xfs_mount.h       |   3 +
 4 files changed, 84 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d78aaa8ce252..3cfd84b76955 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1986,6 +1986,38 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
+/*
+ * Lock the perag and take AGI lock if agi_unlinked is touched as well
+ * for xfs_iunlink_insert_inode(). As for the details of locking order,
+ * refer to the comments of xfs_iunlink_remove_lock().
+ */
+static struct xfs_perag *
+xfs_iunlink_insert_lock(
+	xfs_agino_t		agno,
+	struct xfs_trans        *tp,
+	struct xfs_inode	*ip,
+	struct xfs_buf		**agibpp)
+{
+	struct xfs_mount        *mp = tp->t_mountp;
+	struct xfs_perag	*pag;
+	int			error;
+
+	pag = xfs_perag_get(mp, agno);
+	mutex_lock(&pag->pag_unlinked_mutex);
+
+	if (!pag->pag_unlinked_tail) {
+		mutex_unlock(&pag->pag_unlinked_mutex);
+
+		error = xfs_read_agi(mp, tp, agno, agibpp);
+		if (error) {
+			xfs_perag_put(pag);
+			return ERR_PTR(error);
+		}
+		mutex_lock(&pag->pag_unlinked_mutex);
+	}
+	return pag;
+}
+
 /*
  * Always insert at the head, so we only have to do a next inode lookup to
  * update it's prev pointer. The AGI bucket will point at the one we are
@@ -1993,50 +2025,47 @@ xfs_iunlink_update_bucket(
  */
 static int
 xfs_iunlink_insert_inode(
+	struct xfs_perag	*pag,
 	struct xfs_trans	*tp,
 	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi = agibp->b_addr;
-	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_inode	*pip = pag->pag_unlinked_tail;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
 
-	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the pointer isn't garbage and that this inode
-	 * isn't already on the list.
-	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
-	if (next_agino == agino ||
-	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
-		xfs_buf_mark_corrupt(agibp);
-		return -EFSCORRUPTED;
-	}
-
-	ip->i_prev_unlinked = NULLAGINO;
-	ip->i_next_unlinked = next_agino;
-	if (ip->i_next_unlinked != NULLAGINO) {
-		struct xfs_inode	*nip;
-
-		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
-		if (IS_ERR_OR_NULL(nip))
-			return -EFSCORRUPTED;
+	if (!pip) {
+		xfs_agino_t	agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+		struct xfs_agi	*agi = agibp->b_addr;
+		int		error;
 
-		if (nip->i_prev_unlinked != NULLAGINO) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-						NULL, 0, __this_address);
+		ip->i_prev_unlinked = NULLAGINO;
+		/*
+		 * Get the index into the agi hash table for the list this
+		 * inode will go on.  Make sure the pointer isn't garbage
+		 * and that this inode isn't already on the list.
+		 */
+		next_agino = be32_to_cpu(agi->agi_unlinked[0]);
+		if (next_agino == agino ||
+		    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
+			xfs_buf_mark_corrupt(agibp);
 			return -EFSCORRUPTED;
 		}
-		nip->i_prev_unlinked = agino;
 
-		/* update the on disk inode now */
-		xfs_iunlink_log(tp, ip);
+		/* Point the head of the list to point to this inode. */
+		error = xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
+		if (error)
+			return error;
+	} else {
+		ip->i_prev_unlinked = XFS_INO_TO_AGINO(mp, pip->i_ino);
+		ASSERT(pip->i_next_unlinked == NULLAGINO);
+		pip->i_next_unlinked = agino;
+		xfs_iunlink_log(tp, pip);
 	}
-
-	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, 0, agino);
+	ip->i_next_unlinked = NULLAGINO;
+	pag->pag_unlinked_tail = ip;
+	return 0;
 }
 
 /*
@@ -2095,6 +2124,7 @@ xfs_iunlink_remove_inode(
 	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = ip->i_next_unlinked;
+	struct xfs_inode	*pip = NULL;
 	int			error;
 
 	if (ip->i_prev_unlinked == NULLAGINO) {
@@ -2122,8 +2152,6 @@ xfs_iunlink_remove_inode(
 			return -EFSCORRUPTED;
 	} else {
 		/* lookup previous inode and update to point at next */
-		struct xfs_inode	*pip;
-
 		pip = xfs_iunlink_lookup_prev(pag, ip);
 		if (IS_ERR_OR_NULL(pip))
 			return -EFSCORRUPTED;
@@ -2139,8 +2167,12 @@ xfs_iunlink_remove_inode(
 		xfs_iunlink_log(tp, pip);
 	}
 
-	/* lookup the next inode and update to point at prev */
-	if (ip->i_next_unlinked != NULLAGINO) {
+	if (next_agino == NULLAGINO) {
+		/* only care about the tail of bucket 0 for xfs_iunlink() */
+		if (pag->pag_unlinked_tail == ip)
+			pag->pag_unlinked_tail = pip;
+	} else {
+		/* lookup the next inode and update to point at prev */
 		struct xfs_inode	*nip;
 
 		nip = xfs_iunlink_lookup_next(pag, ip);
@@ -2188,23 +2220,21 @@ xfs_iunlink(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip)
 {
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_buf		*agibp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_buf		*agibp = NULL;
+	struct xfs_perag	*pag;
 	int			error;
 
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 	trace_xfs_iunlink(ip);
 
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
+	pag = xfs_iunlink_insert_lock(XFS_INO_TO_AGNO(tp->t_mountp, ip->i_ino),
+		tp, ip, &agibp);
+	if (IS_ERR(pag))
+		return PTR_ERR(pag);
 
-	mutex_lock(&agibp->b_pag->pag_unlinked_mutex);
-	error = xfs_iunlink_insert_inode(tp, agibp, ip);
-	mutex_unlock(&agibp->b_pag->pag_unlinked_mutex);
+	error = xfs_iunlink_insert_inode(pag, tp, agibp, ip);
+	xfs_iunlink_unlock(pag);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index d47eea31c165..30198aea7335 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2801,6 +2801,11 @@ xlog_recover_iunlink_ag(
 					prev_ip->i_next_unlinked = NULLAGINO;
 				break;
 			}
+
+			/* XXX: take pag_unlinked_mutex across the loop? */
+			if (!bucket)
+				agibp->b_pag->pag_unlinked_tail = ip;
+
 			if (prev_ip) {
 				ip->i_prev_unlinked = prev_agino;
 				xfs_irele(prev_ip);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index f94e14059e61..f1cd3e9c4da5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -224,6 +224,7 @@ xfs_initialize_perag(
 			first_initialised = index;
 		spin_lock_init(&pag->pag_state_lock);
 		mutex_init(&pag->pag_unlinked_mutex);
+		pag->pag_unlinked_tail = NULL;
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 9a1d0f239fa4..934e7c373042 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -375,6 +375,9 @@ typedef struct xfs_perag {
 	/* lock to protect unlinked inode list */
 	struct mutex            pag_unlinked_mutex;
 
+	/* point to the inode tail of AGI unlinked bucket 0 */
+	struct xfs_inode	*pag_unlinked_tail;
+
 	/*
 	 * Unlinked inode information.  This incore information reflects
 	 * data stored in the AGI, so callers must hold the AGI buffer lock
-- 
2.18.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-07 13:57 [RFC PATCH 0/2] xfs: more unlinked inode list optimization v1 Gao Xiang
2020-07-07 13:57 ` [RFC PATCH 1/2] xfs: arrange all unlinked inodes into one list Gao Xiang
2020-07-08 22:33   ` Dave Chinner
2020-07-09  0:17     ` Gao Xiang
2020-07-07 13:57 ` [RFC PATCH 2/2] xfs: don't access AGI on unlinked inodes if it can Gao Xiang
2020-07-08 17:03   ` Darrick J. Wong
2020-07-08 23:40     ` Gao Xiang
2020-07-08 23:33   ` Dave Chinner
2020-07-09  0:55     ` Gao Xiang
2020-07-09  2:32       ` Dave Chinner
2020-07-09 10:36         ` Gao Xiang
2020-07-09 10:47           ` Gao Xiang
2020-07-09 22:36           ` Dave Chinner
2020-07-24  6:12 ` [RFC PATCH v2 0/3] xfs: more unlinked inode list optimization v2 Gao Xiang
2020-07-24  6:12   ` [RFC PATCH v2 1/3] xfs: arrange all unlinked inodes into one list Gao Xiang
2020-07-24  6:12   ` [RFC PATCH v2 2/3] xfs: introduce perag iunlink lock Gao Xiang
2020-07-24  6:12   ` [RFC PATCH v2 3/3] xfs: insert unlinked inodes from tail Gao Xiang

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git