All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] xfs: incore unlinked list
@ 2019-02-06 16:54 Darrick J. Wong
  2019-02-06 16:54 ` [PATCH 1/9] xfs: clean up iunlink functions Darrick J. Wong
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:54 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This new patch series refactors the existing code that handles metadata
updates to the unlinked list when adding or removing inodes from that
list.  It then adds an in-core hashtable to record which inode's
next_unlinked field points to a given inode.  This enables us to remove
any inode from the on-disk unlinked structure without having to actually
walk the entire unlinked list, which reduces overhead substantially.

Since v2 this series has been reworked so that the iunlink code is more
robust in the face of backref cache failures (since we can always fall
back to the old bucket walking code), removal of the unlinked inode
counts, standardization of the helper function parameter types, and
fixing other review comments.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=incore-unlinked-list

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=incore-unlinked-list

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=incore-unlinked-list

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/9] xfs: clean up iunlink functions
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
@ 2019-02-06 16:54 ` Darrick J. Wong
  2019-02-06 16:54 ` [PATCH 2/9] xfs: add xfs_verify_agino_or_null helper Darrick J. Wong
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:54 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Brian Foster, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix some indentation issues with the iunlink functions and reorganize
the tops of the functions to be identical.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_inode.c |   79 +++++++++++++++++++++-------------------------------
 1 file changed, 32 insertions(+), 47 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ae667ba74a1c..8a39d80f4863 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1918,26 +1918,24 @@ xfs_inactive(
  */
 STATIC int
 xfs_iunlink(
-	struct xfs_trans *tp,
-	struct xfs_inode *ip)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
 {
-	xfs_mount_t	*mp = tp->t_mountp;
-	xfs_agi_t	*agi;
-	xfs_dinode_t	*dip;
-	xfs_buf_t	*agibp;
-	xfs_buf_t	*ibp;
-	xfs_agino_t	agino;
-	short		bucket_index;
-	int		offset;
-	int		error;
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_agi		*agi;
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*agibp;
+	struct xfs_buf		*ibp;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
+	int			offset;
+	int			error;
 
 	ASSERT(VFS_I(ip)->i_mode != 0);
 
-	/*
-	 * Get the agi buffer first.  It ensures lock ordering
-	 * on the list.
-	 */
-	error = xfs_read_agi(mp, tp, XFS_INO_TO_AGNO(mp, ip->i_ino), &agibp);
+	/* Get the agi buffer first.  It ensures lock ordering on the list. */
+	error = xfs_read_agi(mp, tp, agno, &agibp);
 	if (error)
 		return error;
 	agi = XFS_BUF_TO_AGI(agibp);
@@ -1946,9 +1944,6 @@ xfs_iunlink(
 	 * Get the index into the agi hash table for the
 	 * list this inode will go on.
 	 */
-	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	ASSERT(agino != 0);
-	bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	ASSERT(agi->agi_unlinked[bucket_index]);
 	ASSERT(be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino);
 
@@ -1995,45 +1990,35 @@ xfs_iunlink(
  */
 STATIC int
 xfs_iunlink_remove(
-	xfs_trans_t	*tp,
-	xfs_inode_t	*ip)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
 {
-	xfs_ino_t	next_ino;
-	xfs_mount_t	*mp;
-	xfs_agi_t	*agi;
-	xfs_dinode_t	*dip;
-	xfs_buf_t	*agibp;
-	xfs_buf_t	*ibp;
-	xfs_agnumber_t	agno;
-	xfs_agino_t	agino;
-	xfs_agino_t	next_agino;
-	xfs_buf_t	*last_ibp;
-	xfs_dinode_t	*last_dip = NULL;
-	short		bucket_index;
-	int		offset, last_offset = 0;
-	int		error;
-
-	mp = tp->t_mountp;
-	agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_agi		*agi;
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*agibp;
+	struct xfs_buf		*ibp;
+	struct xfs_buf		*last_ibp;
+	struct xfs_dinode	*last_dip = NULL;
+	xfs_ino_t		next_ino;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	xfs_agino_t		next_agino;
+	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
+	int			offset;
+	int			last_offset = 0;
+	int			error;
 
-	/*
-	 * Get the agi buffer first.  It ensures lock ordering
-	 * on the list.
-	 */
+	/* Get the agi buffer first.  It ensures lock ordering on the list. */
 	error = xfs_read_agi(mp, tp, agno, &agibp);
 	if (error)
 		return error;
-
 	agi = XFS_BUF_TO_AGI(agibp);
 
 	/*
 	 * Get the index into the agi hash table for the
 	 * list this inode will go on.
 	 */
-	agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	if (!xfs_verify_agino(mp, agno, agino))
-		return -EFSCORRUPTED;
-	bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	if (!xfs_verify_agino(mp, agno,
 			be32_to_cpu(agi->agi_unlinked[bucket_index]))) {
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/9] xfs: add xfs_verify_agino_or_null helper
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
  2019-02-06 16:54 ` [PATCH 1/9] xfs: clean up iunlink functions Darrick J. Wong
@ 2019-02-06 16:54 ` Darrick J. Wong
  2019-02-06 16:54 ` [PATCH 3/9] xfs: refactor AGI unlinked bucket updates Darrick J. Wong
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:54 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Brian Foster, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a new helper to check that a per-AG inode pointer is either null or
points somewhere valid within that AG.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_inode_buf.c |    3 +--
 fs/xfs/libxfs/xfs_types.c     |   13 +++++++++++++
 fs/xfs/libxfs/xfs_types.h     |    2 ++
 fs/xfs/scrub/agheader.c       |    8 +++-----
 4 files changed, 19 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 09d9c8cfa4a0..82c4374acb4c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -99,8 +99,7 @@ xfs_inode_buf_verify(
 		unlinked_ino = be32_to_cpu(dip->di_next_unlinked);
 		di_ok = dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
 			xfs_dinode_good_version(mp, dip->di_version) &&
-			(unlinked_ino == NULLAGINO ||
-			 xfs_verify_agino(mp, agno, unlinked_ino));
+			xfs_verify_agino_or_null(mp, agno, unlinked_ino);
 		if (unlikely(XFS_TEST_ERROR(!di_ok, mp,
 						XFS_ERRTAG_ITOBP_INOTOBP))) {
 			if (readahead) {
diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c
index 451608b5a37b..de310712dd6d 100644
--- a/fs/xfs/libxfs/xfs_types.c
+++ b/fs/xfs/libxfs/xfs_types.c
@@ -115,6 +115,19 @@ xfs_verify_agino(
 	return agino >= first && agino <= last;
 }
 
+/*
+ * Verify that an AG inode number pointer neither points outside the AG
+ * nor points at static metadata, or is NULLAGINO.
+ */
+bool
+xfs_verify_agino_or_null(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino)
+{
+	return agino == NULLAGINO || xfs_verify_agino(mp, agno, agino);
+}
+
 /*
  * Verify that an FS inode number pointer neither points outside the
  * filesystem nor points at static AG metadata.
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 704b4f308780..c5a25403b4db 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -183,6 +183,8 @@ void xfs_agino_range(struct xfs_mount *mp, xfs_agnumber_t agno,
 		xfs_agino_t *first, xfs_agino_t *last);
 bool xfs_verify_agino(struct xfs_mount *mp, xfs_agnumber_t agno,
 		xfs_agino_t agino);
+bool xfs_verify_agino_or_null(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_agino_t agino);
 bool xfs_verify_ino(struct xfs_mount *mp, xfs_ino_t ino);
 bool xfs_internal_inum(struct xfs_mount *mp, xfs_ino_t ino);
 bool xfs_verify_dir_ino(struct xfs_mount *mp, xfs_ino_t ino);
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 90955ab1e895..9d4e8293d37e 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -864,19 +864,17 @@ xchk_agi(
 
 	/* Check inode pointers */
 	agino = be32_to_cpu(agi->agi_newino);
-	if (agino != NULLAGINO && !xfs_verify_agino(mp, agno, agino))
+	if (!xfs_verify_agino_or_null(mp, agno, agino))
 		xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 
 	agino = be32_to_cpu(agi->agi_dirino);
-	if (agino != NULLAGINO && !xfs_verify_agino(mp, agno, agino))
+	if (!xfs_verify_agino_or_null(mp, agno, agino))
 		xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 
 	/* Check unlinked inode buckets */
 	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
 		agino = be32_to_cpu(agi->agi_unlinked[i]);
-		if (agino == NULLAGINO)
-			continue;
-		if (!xfs_verify_agino(mp, agno, agino))
+		if (!xfs_verify_agino_or_null(mp, agno, agino))
 			xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 	}
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/9] xfs: refactor AGI unlinked bucket updates
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
  2019-02-06 16:54 ` [PATCH 1/9] xfs: clean up iunlink functions Darrick J. Wong
  2019-02-06 16:54 ` [PATCH 2/9] xfs: add xfs_verify_agino_or_null helper Darrick J. Wong
@ 2019-02-06 16:54 ` Darrick J. Wong
  2019-02-06 16:55 ` [PATCH 4/9] xfs: strengthen AGI unlinked inode bucket pointer checks Darrick J. Wong
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:54 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Brian Foster, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

Split the AGI unlinked bucket updates into a separate function.  No
functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_inode.c |   65 ++++++++++++++++++++++++++++++++++++----------------
 fs/xfs/xfs_trace.h |   26 +++++++++++++++++++++
 2 files changed, 71 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8a39d80f4863..f1de80e8b9a6 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1906,6 +1906,43 @@ xfs_inactive(
 	xfs_qm_dqdetach(ip);
 }
 
+/*
+ * Point the AGI unlinked bucket at an inode and log the results.  The caller
+ * is responsible for validating the old value.
+ */
+STATIC int
+xfs_iunlink_update_bucket(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	struct xfs_buf		*agibp,
+	unsigned int		bucket_index,
+	xfs_agino_t		new_agino)
+{
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agibp);
+	xfs_agino_t		old_value;
+	int			offset;
+
+	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
+
+	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
+			old_value, new_agino);
+
+	/*
+	 * We should never find the head of the list already set to the value
+	 * passed in because either we're adding or removing ourselves from the
+	 * head of the list.
+	 */
+	if (old_value == new_agino)
+		return -EFSCORRUPTED;
+
+	agi->agi_unlinked[bucket_index] = cpu_to_be32(new_agino);
+	offset = offsetof(struct xfs_agi, agi_unlinked) +
+			(sizeof(xfs_agino_t) * bucket_index);
+	xfs_trans_log_buf(tp, agibp, offset, offset + sizeof(xfs_agino_t) - 1);
+	return 0;
+}
+
 /*
  * This is called when the inode's link count goes to 0 or we are creating a
  * tmpfile via O_TMPFILE. In the case of a tmpfile, @ignore_linkcount will be
@@ -1973,16 +2010,8 @@ xfs_iunlink(
 		xfs_inobp_check(mp, ibp);
 	}
 
-	/*
-	 * Point the bucket head pointer at the inode being inserted.
-	 */
-	ASSERT(agino != 0);
-	agi->agi_unlinked[bucket_index] = cpu_to_be32(agino);
-	offset = offsetof(xfs_agi_t, agi_unlinked) +
-		(sizeof(xfs_agino_t) * bucket_index);
-	xfs_trans_log_buf(tp, agibp, offset,
-			  (offset + sizeof(xfs_agino_t) - 1));
-	return 0;
+	/* Point the head of the list to point to this inode. */
+	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
 }
 
 /*
@@ -2058,16 +2087,12 @@ xfs_iunlink_remove(
 		} else {
 			xfs_trans_brelse(tp, ibp);
 		}
-		/*
-		 * Point the bucket head pointer at the next inode.
-		 */
-		ASSERT(next_agino != 0);
-		ASSERT(next_agino != agino);
-		agi->agi_unlinked[bucket_index] = cpu_to_be32(next_agino);
-		offset = offsetof(xfs_agi_t, agi_unlinked) +
-			(sizeof(xfs_agino_t) * bucket_index);
-		xfs_trans_log_buf(tp, agibp, offset,
-				  (offset + sizeof(xfs_agino_t) - 1));
+
+		/* Point the head of the list to the next unlinked inode. */
+		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
+				next_agino);
+		if (error)
+			return error;
 	} else {
 		/*
 		 * We need to search the list for the inode being freed.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6fcc893dfc91..c10478e7e49a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3371,6 +3371,32 @@ DEFINE_TRANS_EVENT(xfs_trans_roll);
 DEFINE_TRANS_EVENT(xfs_trans_add_item);
 DEFINE_TRANS_EVENT(xfs_trans_free_items);
 
+TRACE_EVENT(xfs_iunlink_update_bucket,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, unsigned int bucket,
+		 xfs_agino_t old_ptr, xfs_agino_t new_ptr),
+	TP_ARGS(mp, agno, bucket, old_ptr, new_ptr),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(unsigned int, bucket)
+		__field(xfs_agino_t, old_ptr)
+		__field(xfs_agino_t, new_ptr)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->bucket = bucket;
+		__entry->old_ptr = old_ptr;
+		__entry->new_ptr = new_ptr;
+	),
+	TP_printk("dev %d:%d agno %u bucket %u old 0x%x new 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->bucket,
+		  __entry->old_ptr,
+		  __entry->new_ptr)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/9] xfs: strengthen AGI unlinked inode bucket pointer checks
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
                   ` (2 preceding siblings ...)
  2019-02-06 16:54 ` [PATCH 3/9] xfs: refactor AGI unlinked bucket updates Darrick J. Wong
@ 2019-02-06 16:55 ` Darrick J. Wong
  2019-02-06 16:55 ` [PATCH 5/9] xfs: refactor inode unlinked pointer update functions Darrick J. Wong
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:55 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Brian Foster, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

Strengthen our checking of the AGI unlinked pointers when we start to
use them for updating the metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c |   25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f1de80e8b9a6..ba19cfd52b5e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1963,6 +1963,7 @@ xfs_iunlink(
 	struct xfs_dinode	*dip;
 	struct xfs_buf		*agibp;
 	struct xfs_buf		*ibp;
+	xfs_agino_t		next_agino;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
@@ -1978,13 +1979,16 @@ xfs_iunlink(
 	agi = XFS_BUF_TO_AGI(agibp);
 
 	/*
-	 * Get the index into the agi hash table for the
-	 * list this inode will go on.
+	 * Get the index into the agi hash table for the list this inode will
+	 * go on.  Make sure the pointer isn't garbage and that this inode
+	 * isn't already on the list.
 	 */
-	ASSERT(agi->agi_unlinked[bucket_index]);
-	ASSERT(be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino);
+	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	if (next_agino == agino ||
+	    !xfs_verify_agino_or_null(mp, agno, next_agino))
+		return -EFSCORRUPTED;
 
-	if (agi->agi_unlinked[bucket_index] != cpu_to_be32(NULLAGINO)) {
+	if (next_agino != NULLAGINO) {
 		/*
 		 * There is already another inode in the bucket we need
 		 * to add ourselves to.  Add us at the front of the list.
@@ -2045,17 +2049,17 @@ xfs_iunlink_remove(
 	agi = XFS_BUF_TO_AGI(agibp);
 
 	/*
-	 * Get the index into the agi hash table for the
-	 * list this inode will go on.
+	 * Get the index into the agi hash table for the list this inode will
+	 * go on.  Make sure the head pointer isn't garbage.
 	 */
-	if (!xfs_verify_agino(mp, agno,
-			be32_to_cpu(agi->agi_unlinked[bucket_index]))) {
+	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	if (!xfs_verify_agino(mp, agno, next_agino)) {
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 				agi, sizeof(*agi));
 		return -EFSCORRUPTED;
 	}
 
-	if (be32_to_cpu(agi->agi_unlinked[bucket_index]) == agino) {
+	if (next_agino == agino) {
 		/*
 		 * We're at the head of the list.  Get the inode's on-disk
 		 * buffer to see if there is anyone after us on the list.
@@ -2097,7 +2101,6 @@ xfs_iunlink_remove(
 		/*
 		 * We need to search the list for the inode being freed.
 		 */
-		next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
 		last_ibp = NULL;
 		while (next_agino != agino) {
 			struct xfs_imap	imap;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/9] xfs: refactor inode unlinked pointer update functions
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
                   ` (3 preceding siblings ...)
  2019-02-06 16:55 ` [PATCH 4/9] xfs: strengthen AGI unlinked inode bucket pointer checks Darrick J. Wong
@ 2019-02-06 16:55 ` Darrick J. Wong
  2019-02-06 16:55 ` [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:55 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Brian Foster, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

Hoist the functions that update an inode's unlinked pointer updates into
a helper.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c |  191 +++++++++++++++++++++++++++-------------------------
 fs/xfs/xfs_trace.h |   26 +++++++
 2 files changed, 125 insertions(+), 92 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ba19cfd52b5e..eb51fa33f91a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1943,6 +1943,85 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
+/* Set an on-disk inode's next_unlinked pointer. */
+STATIC void
+xfs_iunlink_update_dinode(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino,
+	struct xfs_buf		*ibp,
+	struct xfs_dinode	*dip,
+	struct xfs_imap		*imap,
+	xfs_agino_t		next_agino)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	int			offset;
+
+	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
+
+	trace_xfs_iunlink_update_dinode(mp, agno, agino,
+			be32_to_cpu(dip->di_next_unlinked), next_agino);
+
+	dip->di_next_unlinked = cpu_to_be32(next_agino);
+	offset = imap->im_boffset +
+			offsetof(struct xfs_dinode, di_next_unlinked);
+
+	/* need to recalc the inode CRC if appropriate */
+	xfs_dinode_calc_crc(mp, dip);
+	xfs_trans_inode_buf(tp, ibp);
+	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
+	xfs_inobp_check(mp, ibp);
+}
+
+/* Set an in-core inode's unlinked pointer and return the old value. */
+STATIC int
+xfs_iunlink_update_inode(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		next_agino,
+	xfs_agino_t		*old_next_agino)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*ibp;
+	xfs_agino_t		old_value;
+	int			error;
+
+	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
+
+	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0, 0);
+	if (error)
+		return error;
+
+	/* Make sure the old pointer isn't garbage. */
+	old_value = be32_to_cpu(dip->di_next_unlinked);
+	if (!xfs_verify_agino_or_null(mp, agno, old_value)) {
+		error = -EFSCORRUPTED;
+		goto out;
+	}
+
+	/*
+	 * Since we're updating a linked list, we should never find that the
+	 * current pointer is the same as the new value, unless we're
+	 * terminating the list.
+	 */
+	*old_next_agino = old_value;
+	if (old_value == next_agino) {
+		if (next_agino != NULLAGINO)
+			error = -EFSCORRUPTED;
+		goto out;
+	}
+
+	/* Ok, update the new pointer. */
+	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			ibp, dip, &ip->i_imap, next_agino);
+	return 0;
+out:
+	xfs_trans_brelse(tp, ibp);
+	return error;
+}
+
 /*
  * This is called when the inode's link count goes to 0 or we are creating a
  * tmpfile via O_TMPFILE. In the case of a tmpfile, @ignore_linkcount will be
@@ -1960,14 +2039,11 @@ xfs_iunlink(
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi;
-	struct xfs_dinode	*dip;
 	struct xfs_buf		*agibp;
-	struct xfs_buf		*ibp;
 	xfs_agino_t		next_agino;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
-	int			offset;
 	int			error;
 
 	ASSERT(VFS_I(ip)->i_mode != 0);
@@ -1989,29 +2065,17 @@ xfs_iunlink(
 		return -EFSCORRUPTED;
 
 	if (next_agino != NULLAGINO) {
+		xfs_agino_t	old_agino;
+
 		/*
-		 * There is already another inode in the bucket we need
-		 * to add ourselves to.  Add us at the front of the list.
-		 * Here we put the head pointer into our next pointer,
-		 * and then we fall through to point the head at us.
+		 * There is already another inode in the bucket, so point this
+		 * inode to the current head of the list.
 		 */
-		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
-				       0, 0);
+		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino,
+				&old_agino);
 		if (error)
 			return error;
-
-		ASSERT(dip->di_next_unlinked == cpu_to_be32(NULLAGINO));
-		dip->di_next_unlinked = agi->agi_unlinked[bucket_index];
-		offset = ip->i_imap.im_boffset +
-			offsetof(xfs_dinode_t, di_next_unlinked);
-
-		/* need to recalc the inode CRC if appropriate */
-		xfs_dinode_calc_crc(mp, dip);
-
-		xfs_trans_inode_buf(tp, ibp);
-		xfs_trans_log_buf(tp, ibp, offset,
-				  (offset + sizeof(xfs_agino_t) - 1));
-		xfs_inobp_check(mp, ibp);
+		ASSERT(old_agino == NULLAGINO);
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2028,9 +2092,7 @@ xfs_iunlink_remove(
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi;
-	struct xfs_dinode	*dip;
 	struct xfs_buf		*agibp;
-	struct xfs_buf		*ibp;
 	struct xfs_buf		*last_ibp;
 	struct xfs_dinode	*last_dip = NULL;
 	xfs_ino_t		next_ino;
@@ -2038,8 +2100,6 @@ xfs_iunlink_remove(
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
-	int			offset;
-	int			last_offset = 0;
 	int			error;
 
 	/* Get the agi buffer first.  It ensures lock ordering on the list. */
@@ -2063,34 +2123,11 @@ xfs_iunlink_remove(
 		/*
 		 * We're at the head of the list.  Get the inode's on-disk
 		 * buffer to see if there is anyone after us on the list.
-		 * Only modify our next pointer if it is not already NULLAGINO.
-		 * This saves us the overhead of dealing with the buffer when
-		 * there is no need to change it.
 		 */
-		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
-				       0, 0);
-		if (error) {
-			xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
-				__func__, error);
+		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
+				&next_agino);
+		if (error)
 			return error;
-		}
-		next_agino = be32_to_cpu(dip->di_next_unlinked);
-		ASSERT(next_agino != 0);
-		if (next_agino != NULLAGINO) {
-			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
-			offset = ip->i_imap.im_boffset +
-				offsetof(xfs_dinode_t, di_next_unlinked);
-
-			/* need to recalc the inode CRC if appropriate */
-			xfs_dinode_calc_crc(mp, dip);
-
-			xfs_trans_inode_buf(tp, ibp);
-			xfs_trans_log_buf(tp, ibp, offset,
-					  (offset + sizeof(xfs_agino_t) - 1));
-			xfs_inobp_check(mp, ibp);
-		} else {
-			xfs_trans_brelse(tp, ibp);
-		}
 
 		/* Point the head of the list to the next unlinked inode. */
 		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
@@ -2098,13 +2135,14 @@ xfs_iunlink_remove(
 		if (error)
 			return error;
 	} else {
+		struct xfs_imap	imap;
+		xfs_agino_t	prev_agino;
+
 		/*
 		 * We need to search the list for the inode being freed.
 		 */
 		last_ibp = NULL;
 		while (next_agino != agino) {
-			struct xfs_imap	imap;
-
 			if (last_ibp)
 				xfs_trans_brelse(tp, last_ibp);
 
@@ -2128,7 +2166,7 @@ xfs_iunlink_remove(
 				return error;
 			}
 
-			last_offset = imap.im_boffset;
+			prev_agino = next_agino;
 			next_agino = be32_to_cpu(last_dip->di_next_unlinked);
 			if (!xfs_verify_agino(mp, agno, next_agino)) {
 				XFS_CORRUPTION_ERROR(__func__,
@@ -2142,45 +2180,14 @@ xfs_iunlink_remove(
 		 * Now last_ibp points to the buffer previous to us on the
 		 * unlinked list.  Pull us from the list.
 		 */
-		error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp,
-				       0, 0);
-		if (error) {
-			xfs_warn(mp, "%s: xfs_imap_to_bp(2) returned error %d.",
-				__func__, error);
+		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
+				&next_agino);
+		if (error)
 			return error;
-		}
-		next_agino = be32_to_cpu(dip->di_next_unlinked);
-		ASSERT(next_agino != 0);
-		ASSERT(next_agino != agino);
-		if (next_agino != NULLAGINO) {
-			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
-			offset = ip->i_imap.im_boffset +
-				offsetof(xfs_dinode_t, di_next_unlinked);
-
-			/* need to recalc the inode CRC if appropriate */
-			xfs_dinode_calc_crc(mp, dip);
-
-			xfs_trans_inode_buf(tp, ibp);
-			xfs_trans_log_buf(tp, ibp, offset,
-					  (offset + sizeof(xfs_agino_t) - 1));
-			xfs_inobp_check(mp, ibp);
-		} else {
-			xfs_trans_brelse(tp, ibp);
-		}
-		/*
-		 * Point the previous inode on the list to the next inode.
-		 */
-		last_dip->di_next_unlinked = cpu_to_be32(next_agino);
-		ASSERT(next_agino != 0);
-		offset = last_offset + offsetof(xfs_dinode_t, di_next_unlinked);
-
-		/* need to recalc the inode CRC if appropriate */
-		xfs_dinode_calc_crc(mp, last_dip);
 
-		xfs_trans_inode_buf(tp, last_ibp);
-		xfs_trans_log_buf(tp, last_ibp, offset,
-				  (offset + sizeof(xfs_agino_t) - 1));
-		xfs_inobp_check(mp, last_ibp);
+		/* Point the previous inode on the list to the next inode. */
+		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
+				last_dip, &imap, next_agino);
 	}
 	return 0;
 }
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c10478e7e49a..fbec8f0e1a9a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3397,6 +3397,32 @@ TRACE_EVENT(xfs_iunlink_update_bucket,
 		  __entry->new_ptr)
 );
 
+TRACE_EVENT(xfs_iunlink_update_dinode,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agino_t agino,
+		 xfs_agino_t old_ptr, xfs_agino_t new_ptr),
+	TP_ARGS(mp, agno, agino, old_ptr, new_ptr),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, agino)
+		__field(xfs_agino_t, old_ptr)
+		__field(xfs_agino_t, new_ptr)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->agino = agino;
+		__entry->old_ptr = old_ptr;
+		__entry->new_ptr = new_ptr;
+	),
+	TP_printk("dev %d:%d agno %u agino 0x%x old 0x%x new 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->agino,
+		  __entry->old_ptr,
+		  __entry->new_ptr)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
                   ` (4 preceding siblings ...)
  2019-02-06 16:55 ` [PATCH 5/9] xfs: refactor inode unlinked pointer update functions Darrick J. Wong
@ 2019-02-06 16:55 ` Darrick J. Wong
  2019-02-07 14:28   ` Brian Foster
  2019-02-08  7:57   ` Christoph Hellwig
  2019-02-06 16:55 ` [PATCH 7/9] xfs: refactor inode update in iunlink_remove Darrick J. Wong
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:55 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

There's a loop that searches an unlinked bucket list to find the inode
that points to a given inode.  Hoist this into a separate function;
later we'll use our iunlink backref cache to bypass the slow list
operation.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c |  135 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 96 insertions(+), 39 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index eb51fa33f91a..161d8da459aa 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2082,6 +2082,96 @@ xfs_iunlink(
 	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
 }
 
+/* Return the imap, dinode pointer, and buffer for an inode. */
+STATIC int
+xfs_iunlink_map_ino(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino,
+	struct xfs_imap		*imap,
+	struct xfs_dinode	**dipp,
+	struct xfs_buf		**bpp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	int			error;
+
+	imap->im_blkno = 0;
+	error = xfs_imap(mp, tp, XFS_AGINO_TO_INO(mp, agno, agino), imap, 0);
+	if (error) {
+		xfs_warn(mp, "%s: xfs_imap returned error %d.",
+				__func__, error);
+		return error;
+	}
+
+	error = xfs_imap_to_bp(mp, tp, imap, dipp, bpp, 0, 0);
+	if (error) {
+		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
+				__func__, error);
+		return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Walk the unlinked chain from @head_agino until we find the inode that
+ * points to @target_agino.  Return the inode number, map, dinode pointer,
+ * and inode cluster buffer of that inode as @agino, @imap, @dipp, and @bpp.
+ *
+ * @tp, @pag, @head_agino, and @target_agino are input parameters.
+ * @agino, @imap, @dipp, and @bpp are all output parameters.
+ *
+ * Do not call this function if @target_agino is the head of the list.
+ */
+STATIC int
+xfs_iunlink_map_prev(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		head_agino,
+	xfs_agino_t		target_agino,
+	xfs_agino_t		*agino,
+	struct xfs_imap		*imap,
+	struct xfs_dinode	**dipp,
+	struct xfs_buf		**bpp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_agino_t		next_agino;
+	int			error;
+
+	ASSERT(head_agino != target_agino);
+
+	next_agino = head_agino;
+	while (next_agino != target_agino) {
+		xfs_agino_t	unlinked_agino;
+
+		if (*bpp)
+			xfs_trans_brelse(tp, *bpp);
+
+		*agino = next_agino;
+		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
+				bpp);
+		if (error)
+			return error;
+
+		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
+		/*
+		 * Make sure this pointer is valid and isn't an obvious
+		 * infinite loop.
+		 */
+		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
+		    next_agino == unlinked_agino) {
+			XFS_CORRUPTION_ERROR(__func__,
+					XFS_ERRLEVEL_LOW, mp,
+					*dipp, sizeof(**dipp));
+			error = -EFSCORRUPTED;
+			return error;
+		}
+		next_agino = unlinked_agino;
+	}
+
+	return 0;
+}
+
 /*
  * Pull the on-disk inode from the AGI unlinked list.
  */
@@ -2093,9 +2183,8 @@ xfs_iunlink_remove(
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi;
 	struct xfs_buf		*agibp;
-	struct xfs_buf		*last_ibp;
+	struct xfs_buf		*last_ibp = NULL;
 	struct xfs_dinode	*last_dip = NULL;
-	xfs_ino_t		next_ino;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
@@ -2138,43 +2227,11 @@ xfs_iunlink_remove(
 		struct xfs_imap	imap;
 		xfs_agino_t	prev_agino;
 
-		/*
-		 * We need to search the list for the inode being freed.
-		 */
-		last_ibp = NULL;
-		while (next_agino != agino) {
-			if (last_ibp)
-				xfs_trans_brelse(tp, last_ibp);
-
-			imap.im_blkno = 0;
-			next_ino = XFS_AGINO_TO_INO(mp, agno, next_agino);
-
-			error = xfs_imap(mp, tp, next_ino, &imap, 0);
-			if (error) {
-				xfs_warn(mp,
-	"%s: xfs_imap returned error %d.",
-					 __func__, error);
-				return error;
-			}
-
-			error = xfs_imap_to_bp(mp, tp, &imap, &last_dip,
-					       &last_ibp, 0, 0);
-			if (error) {
-				xfs_warn(mp,
-	"%s: xfs_imap_to_bp returned error %d.",
-					__func__, error);
-				return error;
-			}
-
-			prev_agino = next_agino;
-			next_agino = be32_to_cpu(last_dip->di_next_unlinked);
-			if (!xfs_verify_agino(mp, agno, next_agino)) {
-				XFS_CORRUPTION_ERROR(__func__,
-						XFS_ERRLEVEL_LOW, mp,
-						last_dip, sizeof(*last_dip));
-				return -EFSCORRUPTED;
-			}
-		}
+		/* We need to search the list for the inode being freed. */
+		error = xfs_iunlink_map_prev(tp, agno, next_agino, agino,
+				&prev_agino, &imap, &last_dip, &last_ibp);
+		if (error)
+			return error;
 
 		/*
 		 * Now last_ibp points to the buffer previous to us on the

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/9] xfs: refactor inode update in iunlink_remove
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
                   ` (5 preceding siblings ...)
  2019-02-06 16:55 ` [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
@ 2019-02-06 16:55 ` Darrick J. Wong
  2019-02-06 16:55 ` [PATCH 8/9] xfs: add tracepoints for high level iunlink operations Darrick J. Wong
  2019-02-06 16:55 ` [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
  8 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:55 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Brian Foster, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

In xfs_iunlink_remove we have two identical calls to
xfs_iunlink_update_inode, so move it out of the if statement to simplify
the code some more.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c |   34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 161d8da459aa..aeca536e3e75 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2188,6 +2188,7 @@ xfs_iunlink_remove(
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
+	xfs_agino_t		head_agino;
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
@@ -2201,23 +2202,23 @@ xfs_iunlink_remove(
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the head pointer isn't garbage.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
-	if (!xfs_verify_agino(mp, agno, next_agino)) {
+	head_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	if (!xfs_verify_agino(mp, agno, head_agino)) {
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 				agi, sizeof(*agi));
 		return -EFSCORRUPTED;
 	}
 
-	if (next_agino == agino) {
-		/*
-		 * We're at the head of the list.  Get the inode's on-disk
-		 * buffer to see if there is anyone after us on the list.
-		 */
-		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
-				&next_agino);
-		if (error)
-			return error;
+	/*
+	 * Set our inode's next_unlinked pointer to NULL and then return
+	 * the old pointer value so that we can update whatever was previous
+	 * to us in the list to point to whatever was next in the list.
+	 */
+	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO, &next_agino);
+	if (error)
+		return error;
 
+	if (head_agino == agino) {
 		/* Point the head of the list to the next unlinked inode. */
 		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
 				next_agino);
@@ -2228,20 +2229,11 @@ xfs_iunlink_remove(
 		xfs_agino_t	prev_agino;
 
 		/* We need to search the list for the inode being freed. */
-		error = xfs_iunlink_map_prev(tp, agno, next_agino, agino,
+		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
 				&prev_agino, &imap, &last_dip, &last_ibp);
 		if (error)
 			return error;
 
-		/*
-		 * Now last_ibp points to the buffer previous to us on the
-		 * unlinked list.  Pull us from the list.
-		 */
-		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
-				&next_agino);
-		if (error)
-			return error;
-
 		/* Point the previous inode on the list to the next inode. */
 		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
 				last_dip, &imap, next_agino);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/9] xfs: add tracepoints for high level iunlink operations
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
                   ` (6 preceding siblings ...)
  2019-02-06 16:55 ` [PATCH 7/9] xfs: refactor inode update in iunlink_remove Darrick J. Wong
@ 2019-02-06 16:55 ` Darrick J. Wong
  2019-02-06 16:55 ` [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
  8 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:55 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Brian Foster, Christoph Hellwig

From: Darrick J. Wong <darrick.wong@oracle.com>

Add tracepoints so we can associate high level operations with low level
updates.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c |    3 +++
 fs/xfs/xfs_trace.h |   25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index aeca536e3e75..53a9c8c26d18 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2047,6 +2047,7 @@ xfs_iunlink(
 	int			error;
 
 	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
 
 	/* Get the agi buffer first.  It ensures lock ordering on the list. */
 	error = xfs_read_agi(mp, tp, agno, &agibp);
@@ -2192,6 +2193,8 @@ xfs_iunlink_remove(
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
+	trace_xfs_iunlink_remove(ip);
+
 	/* Get the agi buffer first.  It ensures lock ordering on the list. */
 	error = xfs_read_agi(mp, tp, agno, &agibp);
 	if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fbec8f0e1a9a..a6e384a642b1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3423,6 +3423,31 @@ TRACE_EVENT(xfs_iunlink_update_dinode,
 		  __entry->new_ptr)
 );
 
+DECLARE_EVENT_CLASS(xfs_ag_inode_class,
+	TP_PROTO(struct xfs_inode *ip),
+	TP_ARGS(ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, agino)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino);
+		__entry->agino = XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino);
+	),
+	TP_printk("dev %d:%d agno %u agino %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno, __entry->agino)
+)
+
+#define DEFINE_AGINODE_EVENT(name) \
+DEFINE_EVENT(xfs_ag_inode_class, name, \
+	TP_PROTO(struct xfs_inode *ip), \
+	TP_ARGS(ip))
+DEFINE_AGINODE_EVENT(xfs_iunlink);
+DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
                   ` (7 preceding siblings ...)
  2019-02-06 16:55 ` [PATCH 8/9] xfs: add tracepoints for high level iunlink operations Darrick J. Wong
@ 2019-02-06 16:55 ` Darrick J. Wong
  2019-02-07 14:31   ` Brian Foster
  2019-02-07 18:24   ` [PATCH v2 " Darrick J. Wong
  8 siblings, 2 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-06 16:55 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Use a rhashtable to cache the unlinked list incore.  This should speed
up unlinked processing considerably when there are a lot of inodes on
the unlinked list because iunlink_remove no longer has to traverse an
entire bucket list to find which inode points to the one being removed.

The incore list structure records "X.next_unlinked = Y" relations, with
the rhashtable using Y to index the records.  This makes finding the
inode X that points to a inode Y very quick.  If our cache fails to find
anything we can always fall back on the old method.

FWIW this drastically reduces the amount of time it takes to remove
inodes from the unlinked list.  I wrote a program to open a lot of
O_TMPFILE files and then close them in the same order, which takes
a very long time if we have to traverse the unlinked lists.  With the
ptach, I see:

+ /d/t/tmpfile/tmpfile
Opened 193531 files in 6.33s.
Closed 193531 files in 5.86s

real    0m12.192s
user    0m0.064s
sys     0m11.619s
+ cd /
+ umount /mnt

real    0m0.050s
user    0m0.004s
sys     0m0.030s

And without the patch:

+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s

real    12m38.853s
user    0m0.084s
sys     12m34.470s
+ cd /
+ umount /mnt

real    0m0.086s
user    0m0.000s
sys     0m0.060s

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_errortag.h |    4 -
 fs/xfs/xfs_error.c           |    3 
 fs/xfs/xfs_inode.c           |  269 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_inode.h           |    3 
 fs/xfs/xfs_mount.c           |    5 +
 fs/xfs/xfs_mount.h           |    7 +
 fs/xfs/xfs_trace.h           |    1 
 7 files changed, 285 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 66077a105cbb..79e6c4fb1d8a 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -54,7 +54,8 @@
 #define XFS_ERRTAG_BUF_LRU_REF				31
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR			32
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
-#define XFS_ERRTAG_MAX					34
+#define XFS_ERRTAG_IUNLINK_FALLBACK			34
+#define XFS_ERRTAG_MAX					35
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -93,5 +94,6 @@
 #define XFS_RANDOM_BUF_LRU_REF				2
 #define XFS_RANDOM_FORCE_SCRUB_REPAIR			1
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
+#define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 9866f542e77b..cc77c11bd0d4 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -51,6 +51,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_BUF_LRU_REF,
 	XFS_RANDOM_FORCE_SCRUB_REPAIR,
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
+	XFS_RANDOM_IUNLINK_FALLBACK,
 };
 
 struct xfs_errortag_attr {
@@ -159,6 +160,7 @@ XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
 XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
 XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
+XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -195,6 +197,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(buf_lru_ref),
 	XFS_ERRORTAG_ATTR_LIST(force_repair),
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
+	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
 	NULL,
 };
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 53a9c8c26d18..fdfcb3a9bac7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1906,6 +1906,194 @@ xfs_inactive(
 	xfs_qm_dqdetach(ip);
 }
 
+/*
+ * In-Core Unlinked List Lookups
+ * =============================
+ *
+ * Every inode is supposed to be reachable from some other piece of metadata
+ * with the exception of the root directory.  Inodes with a connection to a
+ * file descriptor but not linked from anywhere in the on-disk directory tree
+ * are collectively known as unlinked inodes, though the filesystem itself
+ * maintains links to these inodes so that on-disk metadata are consistent.
+ *
+ * XFS implements a per-AG on-disk hash table of unlinked inodes.  The AGI
+ * header contains a number of buckets that point to an inode, and each inode
+ * record has a pointer to the next inode in the hash chain.  This
+ * singly-linked list causes scaling problems in the iunlink remove function
+ * because we must walk that list to find the inode that points to the inode
+ * being removed from the unlinked hash bucket list.
+ *
+ * What if we modelled the unlinked list as a collection of records capturing
+ * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
+ * have a fast way to look up unlinked list predecessors, which avoids the
+ * slow list walk.  That's exactly what we do here (in-core) with a per-AG
+ * rhashtable.
+ *
+ * Because this is a backref cache, we ignore operational failures since the
+ * iunlink code can fall back to the slow bucket walk.  The only errors that
+ * should bubble out are for obviously incorrect situations.
+ *
+ * All users of the backref cache MUST hold the AGI buffer lock to serialize
+ * access or have otherwise provided for concurrency control.
+ */
+
+/* Capture a "X.next_unlinked = Y" relationship. */
+struct xfs_iunlink {
+	struct rhash_head	iu_rhash_head;
+	xfs_agino_t		iu_agino;		/* X */
+	xfs_agino_t		iu_next_unlinked;	/* Y */
+};
+
+/* Unlinked list predecessor lookup hashtable construction */
+static int
+xfs_iunlink_obj_cmpfn(
+	struct rhashtable_compare_arg	*arg,
+	const void			*obj)
+{
+	const xfs_agino_t		*key = arg->key;
+	const struct xfs_iunlink	*iu = obj;
+
+	if (iu->iu_next_unlinked != *key)
+		return 1;
+	return 0;
+}
+
+static const struct rhashtable_params xfs_iunlink_hash_params = {
+	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
+	.key_len		= sizeof(xfs_agino_t),
+	.key_offset		= offsetof(struct xfs_iunlink,
+					   iu_next_unlinked),
+	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
+	.automatic_shrinking	= true,
+	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
+};
+
+/*
+ * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
+ * relation is found.
+ */
+xfs_agino_t
+xfs_iunlink_lookup_backref(
+	struct xfs_perag	*pag,
+	xfs_agino_t		agino)
+{
+	struct xfs_iunlink	*iu;
+
+	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
+			xfs_iunlink_hash_params);
+	return iu ? iu->iu_agino : NULLAGINO;
+}
+
+/*
+ * Remember that @prev_agino.next_unlinked = @this_agino.  Fail loudly if there
+ * already was an entry, but absorb any other runtime errors.
+ */
+int
+xfs_iunlink_add_backref(
+	struct xfs_perag	*pag,
+	xfs_agino_t		prev_agino,
+	xfs_agino_t		this_agino)
+{
+	struct xfs_iunlink	*iu;
+	int			error;
+
+	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
+		return 0;
+
+	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
+	iu->iu_agino = prev_agino;
+	iu->iu_next_unlinked = this_agino;
+
+	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
+			&iu->iu_rhash_head, xfs_iunlink_hash_params);
+	if (error == 0 || error == -EEXIST)
+		return error;
+	return 0;
+}
+
+/*
+ * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
+ * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
+ * wasn't any such entry then we don't bother.
+ */
+int
+xfs_iunlink_change_backref(
+	struct xfs_perag	*pag,
+	xfs_agino_t		agino,
+	xfs_agino_t		next_unlinked)
+{
+	struct xfs_iunlink	*iu;
+	int			error;
+
+	/* Look up the old entry; if there wasn't one then exit. */
+	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
+			xfs_iunlink_hash_params);
+	if (!iu)
+		return 0;
+
+	/*
+	 * Remove the entry.  This shouldn't ever return an error, but if we
+	 * couldn't remove the old entry we don't want to add it again to the
+	 * hash table, and if the entry disappeared on us then someone's
+	 * violated the locking rules and we need to fail loudly.
+	 */
+	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
+			&iu->iu_rhash_head, xfs_iunlink_hash_params);
+	if (error)
+		return error;
+
+	/* If there is no new next entry just free our item and return. */
+	if (next_unlinked == NULLAGINO) {
+		kmem_free(iu);
+		return 0;
+	}
+
+	/*
+	 * Update the hash table to the new entry, ignoring operational
+	 * errors.
+	 */
+	iu->iu_next_unlinked = next_unlinked;
+	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
+			&iu->iu_rhash_head, xfs_iunlink_hash_params);
+	if (error == 0 || error == -EEXIST)
+		return error;
+	return 0;
+}
+
+/* Set up the in-core predecessor structures. */
+int
+xfs_iunlink_init(
+	struct xfs_perag	*pag)
+{
+	return rhashtable_init(&pag->pagi_unlinked_hash,
+			&xfs_iunlink_hash_params);
+}
+
+/* Free the in-core predecessor structures. */
+static void
+xfs_iunlink_free_item(
+	void			*ptr,
+	void			*arg)
+{
+	struct xfs_iunlink	*iu = ptr;
+	bool			*freed_anything = arg;
+
+	*freed_anything = true;
+	kmem_free(iu);
+}
+
+void
+xfs_iunlink_destroy(
+	struct xfs_perag	*pag)
+{
+	bool			freed_anything = false;
+
+	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
+			xfs_iunlink_free_item, &freed_anything);
+
+	ASSERT(freed_anything == false || XFS_FORCED_SHUTDOWN(pag->pag_mount));
+}
+
 /*
  * Point the AGI unlinked bucket at an inode and log the results.  The caller
  * is responsible for validating the old value.
@@ -2066,7 +2254,8 @@ xfs_iunlink(
 		return -EFSCORRUPTED;
 
 	if (next_agino != NULLAGINO) {
-		xfs_agino_t	old_agino;
+		struct xfs_perag	*pag;
+		xfs_agino_t		old_agino;
 
 		/*
 		 * There is already another inode in the bucket, so point this
@@ -2077,6 +2266,16 @@ xfs_iunlink(
 		if (error)
 			return error;
 		ASSERT(old_agino == NULLAGINO);
+
+		/*
+		 * agino has been unlinked, add a backref from the next inode
+		 * back to agino.
+		 */
+		pag = xfs_perag_get(mp, agno);
+		error = xfs_iunlink_add_backref(pag, agino, next_agino);
+		xfs_perag_put(pag);
+		if (error)
+			return error;
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2133,7 +2332,8 @@ xfs_iunlink_map_prev(
 	xfs_agino_t		*agino,
 	struct xfs_imap		*imap,
 	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp)
+	struct xfs_buf		**bpp,
+	struct xfs_perag	*pag)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	xfs_agino_t		next_agino;
@@ -2141,6 +2341,27 @@ xfs_iunlink_map_prev(
 
 	ASSERT(head_agino != target_agino);
 
+	/* See if our backref cache can find it faster. */
+	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
+	if (*agino != NULLAGINO) {
+		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
+		if (error)
+			return error;
+
+		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
+			return 0;
+
+		/*
+		 * If we get here the cache contents were corrupt, so drop the
+		 * buffer and fall back to walking the bucket list.
+		 */
+		xfs_trans_brelse(tp, *bpp);
+		*bpp = NULL;
+	}
+
+	trace_xfs_iunlink_map_prev_fallback(mp, agno);
+
+	/* Otherwise, walk the entire bucket until we find it. */
 	next_agino = head_agino;
 	while (next_agino != target_agino) {
 		xfs_agino_t	unlinked_agino;
@@ -2186,6 +2407,7 @@ xfs_iunlink_remove(
 	struct xfs_buf		*agibp;
 	struct xfs_buf		*last_ibp = NULL;
 	struct xfs_dinode	*last_dip = NULL;
+	struct xfs_perag	*pag = NULL;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
@@ -2221,27 +2443,62 @@ xfs_iunlink_remove(
 	if (error)
 		return error;
 
+	/*
+	 * If there was a backref pointing from the next inode back to this
+	 * one, remove it because we've removed this inode from the list.
+	 *
+	 * Later, if this inode was in the middle of the list we'll update
+	 * this inode's backref to point from the next inode.
+	 */
+	if (next_agino != NULLAGINO) {
+		pag = xfs_perag_get(mp, agno);
+		error = xfs_iunlink_change_backref(pag, next_agino,
+				NULLAGINO);
+		if (error)
+			goto out;
+	}
+
 	if (head_agino == agino) {
 		/* Point the head of the list to the next unlinked inode. */
 		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
 				next_agino);
 		if (error)
-			return error;
+			goto out;
 	} else {
 		struct xfs_imap	imap;
 		xfs_agino_t	prev_agino;
 
+		if (!pag)
+			pag = xfs_perag_get(mp, agno);
+
 		/* We need to search the list for the inode being freed. */
 		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
-				&prev_agino, &imap, &last_dip, &last_ibp);
+				&prev_agino, &imap, &last_dip, &last_ibp,
+				pag);
 		if (error)
-			return error;
+			goto out;
 
 		/* Point the previous inode on the list to the next inode. */
 		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
 				last_dip, &imap, next_agino);
+
+		/*
+		 * Now we deal with the backref for this inode.  If this inode
+		 * pointed at a real inode, change the backref that pointed to
+		 * us to point to our old next.  If this inode was the end of
+		 * the list, delete the backref that pointed to us.  Note that
+		 * change_backref takes care of deleting the backref if
+		 * next_agino is NULLAGINO.
+		 */
+		error = xfs_iunlink_change_backref(pag, agino, next_agino);
+		if (error)
+			goto out;
 	}
-	return 0;
+
+out:
+	if (pag)
+		xfs_perag_put(pag);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index be2014520155..e62074a5257c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -500,4 +500,7 @@ extern struct kmem_zone	*xfs_inode_zone;
 
 bool xfs_inode_verify_forks(struct xfs_inode *ip);
 
+int xfs_iunlink_init(struct xfs_perag *pag);
+void xfs_iunlink_destroy(struct xfs_perag *pag);
+
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b4d8c318be3c..fd63b0b1307c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -149,6 +149,7 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
+		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
@@ -227,6 +228,9 @@ xfs_initialize_perag(
 		/* first new pag is fully initialized */
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
+		error = xfs_iunlink_init(pag);
+		if (error)
+			goto out_hash_destroy;
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
@@ -249,6 +253,7 @@ xfs_initialize_perag(
 		if (!pag)
 			break;
 		xfs_buf_hash_destroy(pag);
+		xfs_iunlink_destroy(pag);
 		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7daafe064af8..a33f45077867 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -396,6 +396,13 @@ typedef struct xfs_perag {
 
 	/* reference count */
 	uint8_t			pagf_refcount_level;
+
+	/*
+	 * Unlinked inode information.  This incore information reflects
+	 * data stored in the AGI, so callers must hold the AGI buffer lock
+	 * or have some other means to control concurrency.
+	 */
+	struct rhashtable	pagi_unlinked_hash;
 } xfs_perag_t;
 
 static inline struct xfs_ag_resv *
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a6e384a642b1..c83ce022a355 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3447,6 +3447,7 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
 	TP_ARGS(ip))
 DEFINE_AGINODE_EVENT(xfs_iunlink);
 DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
+DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
 
 #endif /* _TRACE_XFS_H */
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function
  2019-02-06 16:55 ` [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
@ 2019-02-07 14:28   ` Brian Foster
  2019-02-07 16:19     ` Darrick J. Wong
  2019-02-08  7:57   ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Brian Foster @ 2019-02-07 14:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 06, 2019 at 08:55:17AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> There's a loop that searches an unlinked bucket list to find the inode
> that points to a given inode.  Hoist this into a separate function;
> later we'll use our iunlink backref cache to bypass the slow list
> operation.  No functional changes.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_inode.c |  135 +++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 96 insertions(+), 39 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index eb51fa33f91a..161d8da459aa 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2082,6 +2082,96 @@ xfs_iunlink(
>  	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
>  }
>  
...
> +STATIC int
> +xfs_iunlink_map_prev(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	xfs_agino_t		head_agino,
> +	xfs_agino_t		target_agino,
> +	xfs_agino_t		*agino,
> +	struct xfs_imap		*imap,
> +	struct xfs_dinode	**dipp,
> +	struct xfs_buf		**bpp)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	xfs_agino_t		next_agino;
> +	int			error;
> +
> +	ASSERT(head_agino != target_agino);
> +
> +	next_agino = head_agino;
> +	while (next_agino != target_agino) {
> +		xfs_agino_t	unlinked_agino;
> +
> +		if (*bpp)
> +			xfs_trans_brelse(tp, *bpp);

Just because we have this check here, it might be appropriate to set
*bpp to NULL at the start of this function. Otherwise looks fine:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +
> +		*agino = next_agino;
> +		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
> +				bpp);
> +		if (error)
> +			return error;
> +
> +		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
> +		/*
> +		 * Make sure this pointer is valid and isn't an obvious
> +		 * infinite loop.
> +		 */
> +		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
> +		    next_agino == unlinked_agino) {
> +			XFS_CORRUPTION_ERROR(__func__,
> +					XFS_ERRLEVEL_LOW, mp,
> +					*dipp, sizeof(**dipp));
> +			error = -EFSCORRUPTED;
> +			return error;
> +		}
> +		next_agino = unlinked_agino;
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Pull the on-disk inode from the AGI unlinked list.
>   */
> @@ -2093,9 +2183,8 @@ xfs_iunlink_remove(
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi;
>  	struct xfs_buf		*agibp;
> -	struct xfs_buf		*last_ibp;
> +	struct xfs_buf		*last_ibp = NULL;
>  	struct xfs_dinode	*last_dip = NULL;
> -	xfs_ino_t		next_ino;
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
> @@ -2138,43 +2227,11 @@ xfs_iunlink_remove(
>  		struct xfs_imap	imap;
>  		xfs_agino_t	prev_agino;
>  
> -		/*
> -		 * We need to search the list for the inode being freed.
> -		 */
> -		last_ibp = NULL;
> -		while (next_agino != agino) {
> -			if (last_ibp)
> -				xfs_trans_brelse(tp, last_ibp);
> -
> -			imap.im_blkno = 0;
> -			next_ino = XFS_AGINO_TO_INO(mp, agno, next_agino);
> -
> -			error = xfs_imap(mp, tp, next_ino, &imap, 0);
> -			if (error) {
> -				xfs_warn(mp,
> -	"%s: xfs_imap returned error %d.",
> -					 __func__, error);
> -				return error;
> -			}
> -
> -			error = xfs_imap_to_bp(mp, tp, &imap, &last_dip,
> -					       &last_ibp, 0, 0);
> -			if (error) {
> -				xfs_warn(mp,
> -	"%s: xfs_imap_to_bp returned error %d.",
> -					__func__, error);
> -				return error;
> -			}
> -
> -			prev_agino = next_agino;
> -			next_agino = be32_to_cpu(last_dip->di_next_unlinked);
> -			if (!xfs_verify_agino(mp, agno, next_agino)) {
> -				XFS_CORRUPTION_ERROR(__func__,
> -						XFS_ERRLEVEL_LOW, mp,
> -						last_dip, sizeof(*last_dip));
> -				return -EFSCORRUPTED;
> -			}
> -		}
> +		/* We need to search the list for the inode being freed. */
> +		error = xfs_iunlink_map_prev(tp, agno, next_agino, agino,
> +				&prev_agino, &imap, &last_dip, &last_ibp);
> +		if (error)
> +			return error;
>  
>  		/*
>  		 * Now last_ibp points to the buffer previous to us on the
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-06 16:55 ` [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
@ 2019-02-07 14:31   ` Brian Foster
  2019-02-07 16:33     ` Darrick J. Wong
  2019-02-08  8:00     ` Christoph Hellwig
  2019-02-07 18:24   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 2 replies; 22+ messages in thread
From: Brian Foster @ 2019-02-07 14:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 06, 2019 at 08:55:36AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Use a rhashtable to cache the unlinked list incore.  This should speed
> up unlinked processing considerably when there are a lot of inodes on
> the unlinked list because iunlink_remove no longer has to traverse an
> entire bucket list to find which inode points to the one being removed.
> 
> The incore list structure records "X.next_unlinked = Y" relations, with
> the rhashtable using Y to index the records.  This makes finding the
> inode X that points to a inode Y very quick.  If our cache fails to find
> anything we can always fall back on the old method.
> 
> FWIW this drastically reduces the amount of time it takes to remove
> inodes from the unlinked list.  I wrote a program to open a lot of
> O_TMPFILE files and then close them in the same order, which takes
> a very long time if we have to traverse the unlinked lists.  With the
> ptach, I see:
> 
...
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

A few minor comments below. With those addressed, this otherwise looks
pretty good to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_errortag.h |    4 -
>  fs/xfs/xfs_error.c           |    3 
>  fs/xfs/xfs_inode.c           |  269 +++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_inode.h           |    3 
>  fs/xfs/xfs_mount.c           |    5 +
>  fs/xfs/xfs_mount.h           |    7 +
>  fs/xfs/xfs_trace.h           |    1 
>  7 files changed, 285 insertions(+), 7 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 53a9c8c26d18..fdfcb3a9bac7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1906,6 +1906,194 @@ xfs_inactive(
>  	xfs_qm_dqdetach(ip);
>  }
>  
...
> +/*
> + * Remember that @prev_agino.next_unlinked = @this_agino.  Fail loudly if there
> + * already was an entry, but absorb any other runtime errors.
> + */
> +int
> +xfs_iunlink_add_backref(
> +	struct xfs_perag	*pag,
> +	xfs_agino_t		prev_agino,
> +	xfs_agino_t		this_agino)
> +{
> +	struct xfs_iunlink	*iu;
> +	int			error;
> +
> +	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> +		return 0;
> +
> +	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> +	iu->iu_agino = prev_agino;
> +	iu->iu_next_unlinked = this_agino;
> +
> +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> +	if (error == 0 || error == -EEXIST)
> +		return error;
> +	return 0;

The return val looks funny at a glance: return -EEXIST or otherwise
return 0 for success and all other errors (also the error == 0 looks
superfluous). I see the comment above describes what this does, but it
doesn't explain why. I'd move the comment to above the if check and
explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
because fallback blah blah. We care about -EEXIST because ...").

Also I assume we need to free the iu object on insert failure,
regardless of whether we ultimately return the error.

> +}
> +
> +/*
> + * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> + * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
> + * wasn't any such entry then we don't bother.
> + */
> +int
> +xfs_iunlink_change_backref(
> +	struct xfs_perag	*pag,
> +	xfs_agino_t		agino,
> +	xfs_agino_t		next_unlinked)
> +{
> +	struct xfs_iunlink	*iu;
> +	int			error;
> +
> +	/* Look up the old entry; if there wasn't one then exit. */
> +	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> +			xfs_iunlink_hash_params);
> +	if (!iu)
> +		return 0;
> +
> +	/*
> +	 * Remove the entry.  This shouldn't ever return an error, but if we
> +	 * couldn't remove the old entry we don't want to add it again to the
> +	 * hash table, and if the entry disappeared on us then someone's
> +	 * violated the locking rules and we need to fail loudly.
> +	 */
> +	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> +	if (error)
> +		return error;

What's interesting about remove failure is that if we otherwise found an
iu for this inode, removing the inode from the unlinked list leaves a
stale/incorrect iu around in the in-core table. So it's one thing for
the in-core table to be a valid subset of the on-disk structures, but
another thing entirely to have incoherence between the two. It might be
worth pointing out that it's critical to return an error here so we
don't actually remove the inode (whereas the re-add below is less
strict).

> +
> +	/* If there is no new next entry just free our item and return. */
> +	if (next_unlinked == NULLAGINO) {
> +		kmem_free(iu);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Update the hash table to the new entry, ignoring operational
> +	 * errors.
> +	 */
> +	iu->iu_next_unlinked = next_unlinked;
> +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> +	if (error == 0 || error == -EEXIST)
> +		return error;
> +	return 0;

Similar error == 0 thing and potential memory leak here.

> +}
> +
...
> @@ -2141,6 +2341,27 @@ xfs_iunlink_map_prev(
>  
>  	ASSERT(head_agino != target_agino);
>  
> +	/* See if our backref cache can find it faster. */
> +	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
> +	if (*agino != NULLAGINO) {
> +		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> +		if (error)
> +			return error;
> +
> +		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> +			return 0;
> +
> +		/*
> +		 * If we get here the cache contents were corrupt, so drop the
> +		 * buffer and fall back to walking the bucket list.
> +		 */
> +		xfs_trans_brelse(tp, *bpp);
> +		*bpp = NULL;

I like the logic to ensure we don't screw around with an inode that
doesn't actually point to our target, but I do wonder whether we should
do a little more than silently ignore the fact we found an incoherent
backref. Even just a WARN_ON[_ONCE]() or something here to help us
detect this case during testing might be sufficient..?

Brian

> +	}
> +
> +	trace_xfs_iunlink_map_prev_fallback(mp, agno);
> +
> +	/* Otherwise, walk the entire bucket until we find it. */
>  	next_agino = head_agino;
>  	while (next_agino != target_agino) {
>  		xfs_agino_t	unlinked_agino;
> @@ -2186,6 +2407,7 @@ xfs_iunlink_remove(
>  	struct xfs_buf		*agibp;
>  	struct xfs_buf		*last_ibp = NULL;
>  	struct xfs_dinode	*last_dip = NULL;
> +	struct xfs_perag	*pag = NULL;
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
> @@ -2221,27 +2443,62 @@ xfs_iunlink_remove(
>  	if (error)
>  		return error;
>  
> +	/*
> +	 * If there was a backref pointing from the next inode back to this
> +	 * one, remove it because we've removed this inode from the list.
> +	 *
> +	 * Later, if this inode was in the middle of the list we'll update
> +	 * this inode's backref to point from the next inode.
> +	 */
> +	if (next_agino != NULLAGINO) {
> +		pag = xfs_perag_get(mp, agno);
> +		error = xfs_iunlink_change_backref(pag, next_agino,
> +				NULLAGINO);
> +		if (error)
> +			goto out;
> +	}
> +
>  	if (head_agino == agino) {
>  		/* Point the head of the list to the next unlinked inode. */
>  		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
>  				next_agino);
>  		if (error)
> -			return error;
> +			goto out;
>  	} else {
>  		struct xfs_imap	imap;
>  		xfs_agino_t	prev_agino;
>  
> +		if (!pag)
> +			pag = xfs_perag_get(mp, agno);
> +
>  		/* We need to search the list for the inode being freed. */
>  		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> -				&prev_agino, &imap, &last_dip, &last_ibp);
> +				&prev_agino, &imap, &last_dip, &last_ibp,
> +				pag);
>  		if (error)
> -			return error;
> +			goto out;
>  
>  		/* Point the previous inode on the list to the next inode. */
>  		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
>  				last_dip, &imap, next_agino);
> +
> +		/*
> +		 * Now we deal with the backref for this inode.  If this inode
> +		 * pointed at a real inode, change the backref that pointed to
> +		 * us to point to our old next.  If this inode was the end of
> +		 * the list, delete the backref that pointed to us.  Note that
> +		 * change_backref takes care of deleting the backref if
> +		 * next_agino is NULLAGINO.
> +		 */
> +		error = xfs_iunlink_change_backref(pag, agino, next_agino);
> +		if (error)
> +			goto out;
>  	}
> -	return 0;
> +
> +out:
> +	if (pag)
> +		xfs_perag_put(pag);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index be2014520155..e62074a5257c 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -500,4 +500,7 @@ extern struct kmem_zone	*xfs_inode_zone;
>  
>  bool xfs_inode_verify_forks(struct xfs_inode *ip);
>  
> +int xfs_iunlink_init(struct xfs_perag *pag);
> +void xfs_iunlink_destroy(struct xfs_perag *pag);
> +
>  #endif	/* __XFS_INODE_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index b4d8c318be3c..fd63b0b1307c 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -149,6 +149,7 @@ xfs_free_perag(
>  		spin_unlock(&mp->m_perag_lock);
>  		ASSERT(pag);
>  		ASSERT(atomic_read(&pag->pag_ref) == 0);
> +		xfs_iunlink_destroy(pag);
>  		xfs_buf_hash_destroy(pag);
>  		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
> @@ -227,6 +228,9 @@ xfs_initialize_perag(
>  		/* first new pag is fully initialized */
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
> +		error = xfs_iunlink_init(pag);
> +		if (error)
> +			goto out_hash_destroy;
>  	}
>  
>  	index = xfs_set_inode_alloc(mp, agcount);
> @@ -249,6 +253,7 @@ xfs_initialize_perag(
>  		if (!pag)
>  			break;
>  		xfs_buf_hash_destroy(pag);
> +		xfs_iunlink_destroy(pag);
>  		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		kmem_free(pag);
>  	}
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 7daafe064af8..a33f45077867 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -396,6 +396,13 @@ typedef struct xfs_perag {
>  
>  	/* reference count */
>  	uint8_t			pagf_refcount_level;
> +
> +	/*
> +	 * Unlinked inode information.  This incore information reflects
> +	 * data stored in the AGI, so callers must hold the AGI buffer lock
> +	 * or have some other means to control concurrency.
> +	 */
> +	struct rhashtable	pagi_unlinked_hash;
>  } xfs_perag_t;
>  
>  static inline struct xfs_ag_resv *
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index a6e384a642b1..c83ce022a355 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3447,6 +3447,7 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
>  	TP_ARGS(ip))
>  DEFINE_AGINODE_EVENT(xfs_iunlink);
>  DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
> +DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
>  
>  #endif /* _TRACE_XFS_H */
>  
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function
  2019-02-07 14:28   ` Brian Foster
@ 2019-02-07 16:19     ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-07 16:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 07, 2019 at 09:28:15AM -0500, Brian Foster wrote:
> On Wed, Feb 06, 2019 at 08:55:17AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > There's a loop that searches an unlinked bucket list to find the inode
> > that points to a given inode.  Hoist this into a separate function;
> > later we'll use our iunlink backref cache to bypass the slow list
> > operation.  No functional changes.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_inode.c |  135 +++++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 96 insertions(+), 39 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index eb51fa33f91a..161d8da459aa 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2082,6 +2082,96 @@ xfs_iunlink(
> >  	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
> >  }
> >  
> ...
> > +STATIC int
> > +xfs_iunlink_map_prev(
> > +	struct xfs_trans	*tp,
> > +	xfs_agnumber_t		agno,
> > +	xfs_agino_t		head_agino,
> > +	xfs_agino_t		target_agino,
> > +	xfs_agino_t		*agino,
> > +	struct xfs_imap		*imap,
> > +	struct xfs_dinode	**dipp,
> > +	struct xfs_buf		**bpp)
> > +{
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	xfs_agino_t		next_agino;
> > +	int			error;
> > +
> > +	ASSERT(head_agino != target_agino);
> > +
> > +	next_agino = head_agino;
> > +	while (next_agino != target_agino) {
> > +		xfs_agino_t	unlinked_agino;
> > +
> > +		if (*bpp)
> > +			xfs_trans_brelse(tp, *bpp);
> 
> Just because we have this check here, it might be appropriate to set
> *bpp to NULL at the start of this function. Otherwise looks fine:

It's set to NULL in the caller but I can move it to the top of
_map_prev to make the connection easier to find.

--D

> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +
> > +		*agino = next_agino;
> > +		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
> > +				bpp);
> > +		if (error)
> > +			return error;
> > +
> > +		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
> > +		/*
> > +		 * Make sure this pointer is valid and isn't an obvious
> > +		 * infinite loop.
> > +		 */
> > +		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
> > +		    next_agino == unlinked_agino) {
> > +			XFS_CORRUPTION_ERROR(__func__,
> > +					XFS_ERRLEVEL_LOW, mp,
> > +					*dipp, sizeof(**dipp));
> > +			error = -EFSCORRUPTED;
> > +			return error;
> > +		}
> > +		next_agino = unlinked_agino;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Pull the on-disk inode from the AGI unlinked list.
> >   */
> > @@ -2093,9 +2183,8 @@ xfs_iunlink_remove(
> >  	struct xfs_mount	*mp = tp->t_mountp;
> >  	struct xfs_agi		*agi;
> >  	struct xfs_buf		*agibp;
> > -	struct xfs_buf		*last_ibp;
> > +	struct xfs_buf		*last_ibp = NULL;
> >  	struct xfs_dinode	*last_dip = NULL;
> > -	xfs_ino_t		next_ino;
> >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> >  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> >  	xfs_agino_t		next_agino;
> > @@ -2138,43 +2227,11 @@ xfs_iunlink_remove(
> >  		struct xfs_imap	imap;
> >  		xfs_agino_t	prev_agino;
> >  
> > -		/*
> > -		 * We need to search the list for the inode being freed.
> > -		 */
> > -		last_ibp = NULL;
> > -		while (next_agino != agino) {
> > -			if (last_ibp)
> > -				xfs_trans_brelse(tp, last_ibp);
> > -
> > -			imap.im_blkno = 0;
> > -			next_ino = XFS_AGINO_TO_INO(mp, agno, next_agino);
> > -
> > -			error = xfs_imap(mp, tp, next_ino, &imap, 0);
> > -			if (error) {
> > -				xfs_warn(mp,
> > -	"%s: xfs_imap returned error %d.",
> > -					 __func__, error);
> > -				return error;
> > -			}
> > -
> > -			error = xfs_imap_to_bp(mp, tp, &imap, &last_dip,
> > -					       &last_ibp, 0, 0);
> > -			if (error) {
> > -				xfs_warn(mp,
> > -	"%s: xfs_imap_to_bp returned error %d.",
> > -					__func__, error);
> > -				return error;
> > -			}
> > -
> > -			prev_agino = next_agino;
> > -			next_agino = be32_to_cpu(last_dip->di_next_unlinked);
> > -			if (!xfs_verify_agino(mp, agno, next_agino)) {
> > -				XFS_CORRUPTION_ERROR(__func__,
> > -						XFS_ERRLEVEL_LOW, mp,
> > -						last_dip, sizeof(*last_dip));
> > -				return -EFSCORRUPTED;
> > -			}
> > -		}
> > +		/* We need to search the list for the inode being freed. */
> > +		error = xfs_iunlink_map_prev(tp, agno, next_agino, agino,
> > +				&prev_agino, &imap, &last_dip, &last_ibp);
> > +		if (error)
> > +			return error;
> >  
> >  		/*
> >  		 * Now last_ibp points to the buffer previous to us on the
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-07 14:31   ` Brian Foster
@ 2019-02-07 16:33     ` Darrick J. Wong
  2019-02-08  8:00     ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-07 16:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 07, 2019 at 09:31:15AM -0500, Brian Foster wrote:
> On Wed, Feb 06, 2019 at 08:55:36AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Use a rhashtable to cache the unlinked list incore.  This should speed
> > up unlinked processing considerably when there are a lot of inodes on
> > the unlinked list because iunlink_remove no longer has to traverse an
> > entire bucket list to find which inode points to the one being removed.
> > 
> > The incore list structure records "X.next_unlinked = Y" relations, with
> > the rhashtable using Y to index the records.  This makes finding the
> > inode X that points to a inode Y very quick.  If our cache fails to find
> > anything we can always fall back on the old method.
> > 
> > FWIW this drastically reduces the amount of time it takes to remove
> > inodes from the unlinked list.  I wrote a program to open a lot of
> > O_TMPFILE files and then close them in the same order, which takes
> > a very long time if we have to traverse the unlinked lists.  With the
> > ptach, I see:
> > 
> ...
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> A few minor comments below. With those addressed, this otherwise looks
> pretty good to me:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> >  fs/xfs/libxfs/xfs_errortag.h |    4 -
> >  fs/xfs/xfs_error.c           |    3 
> >  fs/xfs/xfs_inode.c           |  269 +++++++++++++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_inode.h           |    3 
> >  fs/xfs/xfs_mount.c           |    5 +
> >  fs/xfs/xfs_mount.h           |    7 +
> >  fs/xfs/xfs_trace.h           |    1 
> >  7 files changed, 285 insertions(+), 7 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 53a9c8c26d18..fdfcb3a9bac7 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1906,6 +1906,194 @@ xfs_inactive(
> >  	xfs_qm_dqdetach(ip);
> >  }
> >  
> ...
> > +/*
> > + * Remember that @prev_agino.next_unlinked = @this_agino.  Fail loudly if there
> > + * already was an entry, but absorb any other runtime errors.
> > + */
> > +int
> > +xfs_iunlink_add_backref(
> > +	struct xfs_perag	*pag,
> > +	xfs_agino_t		prev_agino,
> > +	xfs_agino_t		this_agino)
> > +{
> > +	struct xfs_iunlink	*iu;
> > +	int			error;
> > +
> > +	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > +		return 0;
> > +
> > +	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > +	iu->iu_agino = prev_agino;
> > +	iu->iu_next_unlinked = this_agino;
> > +
> > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error == 0 || error == -EEXIST)
> > +		return error;
> > +	return 0;
> 
> The return val looks funny at a glance: return -EEXIST or otherwise
> return 0 for success and all other errors (also the error == 0 looks
> superfluous). I see the comment above describes what this does, but it
> doesn't explain why. I'd move the comment to above the if check and
> explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> because fallback blah blah. We care about -EEXIST because ...").

Will do.

> Also I assume we need to free the iu object on insert failure,
> regardless of whether we ultimately return the error.

Oops, yes, good catch!

> > +}
> > +
> > +/*
> > + * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> > + * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
> > + * wasn't any such entry then we don't bother.
> > + */
> > +int
> > +xfs_iunlink_change_backref(
> > +	struct xfs_perag	*pag,
> > +	xfs_agino_t		agino,
> > +	xfs_agino_t		next_unlinked)
> > +{
> > +	struct xfs_iunlink	*iu;
> > +	int			error;
> > +
> > +	/* Look up the old entry; if there wasn't one then exit. */
> > +	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> > +			xfs_iunlink_hash_params);
> > +	if (!iu)
> > +		return 0;
> > +
> > +	/*
> > +	 * Remove the entry.  This shouldn't ever return an error, but if we
> > +	 * couldn't remove the old entry we don't want to add it again to the
> > +	 * hash table, and if the entry disappeared on us then someone's
> > +	 * violated the locking rules and we need to fail loudly.
> > +	 */
> > +	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error)
> > +		return error;
> 
> What's interesting about remove failure is that if we otherwise found an
> iu for this inode, removing the inode from the unlinked list leaves a
> stale/incorrect iu around in the in-core table. So it's one thing for
> the in-core table to be a valid subset of the on-disk structures, but
> another thing entirely to have incoherence between the two. It might be
> worth pointing out that it's critical to return an error here so we
> don't actually remove the inode (whereas the re-add below is less
> strict).

Ok.

> > +
> > +	/* If there is no new next entry just free our item and return. */
> > +	if (next_unlinked == NULLAGINO) {
> > +		kmem_free(iu);
> > +		return 0;
> > +	}
> > +
> > +	/*
> > +	 * Update the hash table to the new entry, ignoring operational
> > +	 * errors.
> > +	 */
> > +	iu->iu_next_unlinked = next_unlinked;
> > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error == 0 || error == -EEXIST)
> > +		return error;
> > +	return 0;
> 
> Similar error == 0 thing and potential memory leak here.

Refactored into a common helper to capture all the logic and
documentation.

> > +}
> > +
> ...
> > @@ -2141,6 +2341,27 @@ xfs_iunlink_map_prev(
> >  
> >  	ASSERT(head_agino != target_agino);
> >  
> > +	/* See if our backref cache can find it faster. */
> > +	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
> > +	if (*agino != NULLAGINO) {
> > +		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> > +		if (error)
> > +			return error;
> > +
> > +		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> > +			return 0;
> > +
> > +		/*
> > +		 * If we get here the cache contents were corrupt, so drop the
> > +		 * buffer and fall back to walking the bucket list.
> > +		 */
> > +		xfs_trans_brelse(tp, *bpp);
> > +		*bpp = NULL;
> 
> I like the logic to ensure we don't screw around with an inode that
> doesn't actually point to our target, but I do wonder whether we should
> do a little more than silently ignore the fact we found an incoherent
> backref. Even just a WARN_ON[_ONCE]() or something here to help us
> detect this case during testing might be sufficient..?

Yeah, that's a good idea.  The cache can miss, but it shouldn't ever be
corrupt.

--D

> 
> Brian
> 
> > +	}
> > +
> > +	trace_xfs_iunlink_map_prev_fallback(mp, agno);
> > +
> > +	/* Otherwise, walk the entire bucket until we find it. */
> >  	next_agino = head_agino;
> >  	while (next_agino != target_agino) {
> >  		xfs_agino_t	unlinked_agino;
> > @@ -2186,6 +2407,7 @@ xfs_iunlink_remove(
> >  	struct xfs_buf		*agibp;
> >  	struct xfs_buf		*last_ibp = NULL;
> >  	struct xfs_dinode	*last_dip = NULL;
> > +	struct xfs_perag	*pag = NULL;
> >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> >  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> >  	xfs_agino_t		next_agino;
> > @@ -2221,27 +2443,62 @@ xfs_iunlink_remove(
> >  	if (error)
> >  		return error;
> >  
> > +	/*
> > +	 * If there was a backref pointing from the next inode back to this
> > +	 * one, remove it because we've removed this inode from the list.
> > +	 *
> > +	 * Later, if this inode was in the middle of the list we'll update
> > +	 * this inode's backref to point from the next inode.
> > +	 */
> > +	if (next_agino != NULLAGINO) {
> > +		pag = xfs_perag_get(mp, agno);
> > +		error = xfs_iunlink_change_backref(pag, next_agino,
> > +				NULLAGINO);
> > +		if (error)
> > +			goto out;
> > +	}
> > +
> >  	if (head_agino == agino) {
> >  		/* Point the head of the list to the next unlinked inode. */
> >  		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> >  				next_agino);
> >  		if (error)
> > -			return error;
> > +			goto out;
> >  	} else {
> >  		struct xfs_imap	imap;
> >  		xfs_agino_t	prev_agino;
> >  
> > +		if (!pag)
> > +			pag = xfs_perag_get(mp, agno);
> > +
> >  		/* We need to search the list for the inode being freed. */
> >  		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> > -				&prev_agino, &imap, &last_dip, &last_ibp);
> > +				&prev_agino, &imap, &last_dip, &last_ibp,
> > +				pag);
> >  		if (error)
> > -			return error;
> > +			goto out;
> >  
> >  		/* Point the previous inode on the list to the next inode. */
> >  		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
> >  				last_dip, &imap, next_agino);
> > +
> > +		/*
> > +		 * Now we deal with the backref for this inode.  If this inode
> > +		 * pointed at a real inode, change the backref that pointed to
> > +		 * us to point to our old next.  If this inode was the end of
> > +		 * the list, delete the backref that pointed to us.  Note that
> > +		 * change_backref takes care of deleting the backref if
> > +		 * next_agino is NULLAGINO.
> > +		 */
> > +		error = xfs_iunlink_change_backref(pag, agino, next_agino);
> > +		if (error)
> > +			goto out;
> >  	}
> > -	return 0;
> > +
> > +out:
> > +	if (pag)
> > +		xfs_perag_put(pag);
> > +	return error;
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index be2014520155..e62074a5257c 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -500,4 +500,7 @@ extern struct kmem_zone	*xfs_inode_zone;
> >  
> >  bool xfs_inode_verify_forks(struct xfs_inode *ip);
> >  
> > +int xfs_iunlink_init(struct xfs_perag *pag);
> > +void xfs_iunlink_destroy(struct xfs_perag *pag);
> > +
> >  #endif	/* __XFS_INODE_H__ */
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index b4d8c318be3c..fd63b0b1307c 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -149,6 +149,7 @@ xfs_free_perag(
> >  		spin_unlock(&mp->m_perag_lock);
> >  		ASSERT(pag);
> >  		ASSERT(atomic_read(&pag->pag_ref) == 0);
> > +		xfs_iunlink_destroy(pag);
> >  		xfs_buf_hash_destroy(pag);
> >  		mutex_destroy(&pag->pag_ici_reclaim_lock);
> >  		call_rcu(&pag->rcu_head, __xfs_free_perag);
> > @@ -227,6 +228,9 @@ xfs_initialize_perag(
> >  		/* first new pag is fully initialized */
> >  		if (first_initialised == NULLAGNUMBER)
> >  			first_initialised = index;
> > +		error = xfs_iunlink_init(pag);
> > +		if (error)
> > +			goto out_hash_destroy;
> >  	}
> >  
> >  	index = xfs_set_inode_alloc(mp, agcount);
> > @@ -249,6 +253,7 @@ xfs_initialize_perag(
> >  		if (!pag)
> >  			break;
> >  		xfs_buf_hash_destroy(pag);
> > +		xfs_iunlink_destroy(pag);
> >  		mutex_destroy(&pag->pag_ici_reclaim_lock);
> >  		kmem_free(pag);
> >  	}
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 7daafe064af8..a33f45077867 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -396,6 +396,13 @@ typedef struct xfs_perag {
> >  
> >  	/* reference count */
> >  	uint8_t			pagf_refcount_level;
> > +
> > +	/*
> > +	 * Unlinked inode information.  This incore information reflects
> > +	 * data stored in the AGI, so callers must hold the AGI buffer lock
> > +	 * or have some other means to control concurrency.
> > +	 */
> > +	struct rhashtable	pagi_unlinked_hash;
> >  } xfs_perag_t;
> >  
> >  static inline struct xfs_ag_resv *
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a6e384a642b1..c83ce022a355 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3447,6 +3447,7 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
> >  	TP_ARGS(ip))
> >  DEFINE_AGINODE_EVENT(xfs_iunlink);
> >  DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
> > +DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
> >  
> >  #endif /* _TRACE_XFS_H */
> >  
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-06 16:55 ` [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
  2019-02-07 14:31   ` Brian Foster
@ 2019-02-07 18:24   ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-07 18:24 UTC (permalink / raw)
  To: linux-xfs; +Cc: Brian Foster

From: Darrick J. Wong <darrick.wong@oracle.com>

Use a rhashtable to cache the unlinked list incore.  This should speed
up unlinked processing considerably when there are a lot of inodes on
the unlinked list because iunlink_remove no longer has to traverse an
entire bucket list to find which inode points to the one being removed.

The incore list structure records "X.next_unlinked = Y" relations, with
the rhashtable using Y to index the records.  This makes finding the
inode X that points to a inode Y very quick.  If our cache fails to find
anything we can always fall back on the old method.

FWIW this drastically reduces the amount of time it takes to remove
inodes from the unlinked list.  I wrote a program to open a lot of
O_TMPFILE files and then close them in the same order, which takes
a very long time if we have to traverse the unlinked lists.  With the
ptach, I see:

+ /d/t/tmpfile/tmpfile
Opened 193531 files in 6.33s.
Closed 193531 files in 5.86s

real    0m12.192s
user    0m0.064s
sys     0m11.619s
+ cd /
+ umount /mnt

real    0m0.050s
user    0m0.004s
sys     0m0.030s

And without the patch:

+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s

real    12m38.853s
user    0m0.084s
sys     12m34.470s
+ cd /
+ umount /mnt

real    0m0.086s
user    0m0.000s
sys     0m0.060s

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
v2: rework comments and factor out hashtable insertion to a helper
---
 fs/xfs/libxfs/xfs_errortag.h |    4 -
 fs/xfs/xfs_error.c           |    3 
 fs/xfs/xfs_inode.c           |  284 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_inode.h           |    3 
 fs/xfs/xfs_mount.c           |    5 +
 fs/xfs/xfs_mount.h           |    7 +
 fs/xfs/xfs_trace.h           |    1 
 7 files changed, 300 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 66077a105cbb..79e6c4fb1d8a 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -54,7 +54,8 @@
 #define XFS_ERRTAG_BUF_LRU_REF				31
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR			32
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
-#define XFS_ERRTAG_MAX					34
+#define XFS_ERRTAG_IUNLINK_FALLBACK			34
+#define XFS_ERRTAG_MAX					35
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -93,5 +94,6 @@
 #define XFS_RANDOM_BUF_LRU_REF				2
 #define XFS_RANDOM_FORCE_SCRUB_REPAIR			1
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
+#define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 9866f542e77b..cc77c11bd0d4 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -51,6 +51,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_BUF_LRU_REF,
 	XFS_RANDOM_FORCE_SCRUB_REPAIR,
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
+	XFS_RANDOM_IUNLINK_FALLBACK,
 };
 
 struct xfs_errortag_attr {
@@ -159,6 +160,7 @@ XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
 XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
 XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
+XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -195,6 +197,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(buf_lru_ref),
 	XFS_ERRORTAG_ATTR_LIST(force_repair),
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
+	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
 	NULL,
 };
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 206bb004c4f2..9e2e0e9e3473 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1906,6 +1906,208 @@ xfs_inactive(
 	xfs_qm_dqdetach(ip);
 }
 
+/*
+ * In-Core Unlinked List Lookups
+ * =============================
+ *
+ * Every inode is supposed to be reachable from some other piece of metadata
+ * with the exception of the root directory.  Inodes with a connection to a
+ * file descriptor but not linked from anywhere in the on-disk directory tree
+ * are collectively known as unlinked inodes, though the filesystem itself
+ * maintains links to these inodes so that on-disk metadata are consistent.
+ *
+ * XFS implements a per-AG on-disk hash table of unlinked inodes.  The AGI
+ * header contains a number of buckets that point to an inode, and each inode
+ * record has a pointer to the next inode in the hash chain.  This
+ * singly-linked list causes scaling problems in the iunlink remove function
+ * because we must walk that list to find the inode that points to the inode
+ * being removed from the unlinked hash bucket list.
+ *
+ * What if we modelled the unlinked list as a collection of records capturing
+ * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
+ * have a fast way to look up unlinked list predecessors, which avoids the
+ * slow list walk.  That's exactly what we do here (in-core) with a per-AG
+ * rhashtable.
+ *
+ * Because this is a backref cache, we ignore operational failures since the
+ * iunlink code can fall back to the slow bucket walk.  The only errors that
+ * should bubble out are for obviously incorrect situations.
+ *
+ * All users of the backref cache MUST hold the AGI buffer lock to serialize
+ * access or have otherwise provided for concurrency control.
+ */
+
+/* Capture a "X.next_unlinked = Y" relationship. */
+struct xfs_iunlink {
+	struct rhash_head	iu_rhash_head;
+	xfs_agino_t		iu_agino;		/* X */
+	xfs_agino_t		iu_next_unlinked;	/* Y */
+};
+
+/* Unlinked list predecessor lookup hashtable construction */
+static int
+xfs_iunlink_obj_cmpfn(
+	struct rhashtable_compare_arg	*arg,
+	const void			*obj)
+{
+	const xfs_agino_t		*key = arg->key;
+	const struct xfs_iunlink	*iu = obj;
+
+	if (iu->iu_next_unlinked != *key)
+		return 1;
+	return 0;
+}
+
+static const struct rhashtable_params xfs_iunlink_hash_params = {
+	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
+	.key_len		= sizeof(xfs_agino_t),
+	.key_offset		= offsetof(struct xfs_iunlink,
+					   iu_next_unlinked),
+	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
+	.automatic_shrinking	= true,
+	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
+};
+
+/*
+ * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
+ * relation is found.
+ */
+static xfs_agino_t
+xfs_iunlink_lookup_backref(
+	struct xfs_perag	*pag,
+	xfs_agino_t		agino)
+{
+	struct xfs_iunlink	*iu;
+
+	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
+			xfs_iunlink_hash_params);
+	return iu ? iu->iu_agino : NULLAGINO;
+}
+
+/*
+ * Take ownership of an iunlink cache entry and insert it into the hash table.
+ * If successful, the entry will be owned by the cache; if not it is freed.
+ * Either way the caller must not use @iu after this call.
+ */
+static int
+xfs_iunlink_insert_backref(
+	struct xfs_perag	*pag,
+	struct xfs_iunlink	*iu)
+{
+	int			error;
+
+	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
+			&iu->iu_rhash_head, xfs_iunlink_hash_params);
+	/*
+	 * Fail loudly if there already was an entry because that's a sign of
+	 * corruption of in-memory data.  Absorb any other runtime errors
+	 * because this is a cache and we can always fall back to bucket list
+	 * scanning.
+	 */
+	if (error)
+		kmem_free(iu);
+	if (error != 0 && error != -EEXIST)
+		error = 0;
+	return error;
+}
+
+/* Remember that @prev_agino.next_unlinked = @this_agino. */
+static int
+xfs_iunlink_add_backref(
+	struct xfs_perag	*pag,
+	xfs_agino_t		prev_agino,
+	xfs_agino_t		this_agino)
+{
+	struct xfs_iunlink	*iu;
+
+	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
+		return 0;
+
+	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
+	iu->iu_agino = prev_agino;
+	iu->iu_next_unlinked = this_agino;
+
+	return xfs_iunlink_insert_backref(pag, iu);
+}
+
+/*
+ * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
+ * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
+ * wasn't any such entry then we don't bother.
+ */
+static int
+xfs_iunlink_change_backref(
+	struct xfs_perag	*pag,
+	xfs_agino_t		agino,
+	xfs_agino_t		next_unlinked)
+{
+	struct xfs_iunlink	*iu;
+	int			error;
+
+	/* Look up the old entry; if there wasn't one then exit. */
+	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
+			xfs_iunlink_hash_params);
+	if (!iu)
+		return 0;
+
+	/*
+	 * Remove the entry.  This shouldn't ever return an error, but if we
+	 * couldn't remove the old entry we don't want to add it again to the
+	 * hash table, and if the entry disappeared on us then someone's
+	 * violated the locking rules and we need to fail loudly.  Either way
+	 * we cannot remove the inode because internal state is or would have
+	 * been corrupt.
+	 */
+	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
+			&iu->iu_rhash_head, xfs_iunlink_hash_params);
+	if (error)
+		return error;
+
+	/* If there is no new next entry just free our item and return. */
+	if (next_unlinked == NULLAGINO) {
+		kmem_free(iu);
+		return 0;
+	}
+
+	/* Update the entry and re-add it to the hash table. */
+	iu->iu_next_unlinked = next_unlinked;
+	return xfs_iunlink_insert_backref(pag, iu);
+}
+
+/* Set up the in-core predecessor structures. */
+int
+xfs_iunlink_init(
+	struct xfs_perag	*pag)
+{
+	return rhashtable_init(&pag->pagi_unlinked_hash,
+			&xfs_iunlink_hash_params);
+}
+
+/* Free the in-core predecessor structures. */
+static void
+xfs_iunlink_free_item(
+	void			*ptr,
+	void			*arg)
+{
+	struct xfs_iunlink	*iu = ptr;
+	bool			*freed_anything = arg;
+
+	*freed_anything = true;
+	kmem_free(iu);
+}
+
+void
+xfs_iunlink_destroy(
+	struct xfs_perag	*pag)
+{
+	bool			freed_anything = false;
+
+	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
+			xfs_iunlink_free_item, &freed_anything);
+
+	ASSERT(freed_anything == false || XFS_FORCED_SHUTDOWN(pag->pag_mount));
+}
+
 /*
  * Point the AGI unlinked bucket at an inode and log the results.  The caller
  * is responsible for validating the old value.
@@ -2066,7 +2268,8 @@ xfs_iunlink(
 		return -EFSCORRUPTED;
 
 	if (next_agino != NULLAGINO) {
-		xfs_agino_t	old_agino;
+		struct xfs_perag	*pag;
+		xfs_agino_t		old_agino;
 
 		/*
 		 * There is already another inode in the bucket, so point this
@@ -2077,6 +2280,16 @@ xfs_iunlink(
 		if (error)
 			return error;
 		ASSERT(old_agino == NULLAGINO);
+
+		/*
+		 * agino has been unlinked, add a backref from the next inode
+		 * back to agino.
+		 */
+		pag = xfs_perag_get(mp, agno);
+		error = xfs_iunlink_add_backref(pag, agino, next_agino);
+		xfs_perag_put(pag);
+		if (error)
+			return error;
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2133,7 +2346,8 @@ xfs_iunlink_map_prev(
 	xfs_agino_t		*agino,
 	struct xfs_imap		*imap,
 	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp)
+	struct xfs_buf		**bpp,
+	struct xfs_perag	*pag)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	xfs_agino_t		next_agino;
@@ -2142,6 +2356,28 @@ xfs_iunlink_map_prev(
 	ASSERT(head_agino != target_agino);
 	*bpp = NULL;
 
+	/* See if our backref cache can find it faster. */
+	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
+	if (*agino != NULLAGINO) {
+		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
+		if (error)
+			return error;
+
+		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
+			return 0;
+
+		/*
+		 * If we get here the cache contents were corrupt, so drop the
+		 * buffer and fall back to walking the bucket list.
+		 */
+		xfs_trans_brelse(tp, *bpp);
+		*bpp = NULL;
+		WARN_ON_ONCE(1);
+	}
+
+	trace_xfs_iunlink_map_prev_fallback(mp, agno);
+
+	/* Otherwise, walk the entire bucket until we find it. */
 	next_agino = head_agino;
 	while (next_agino != target_agino) {
 		xfs_agino_t	unlinked_agino;
@@ -2187,6 +2423,7 @@ xfs_iunlink_remove(
 	struct xfs_buf		*agibp;
 	struct xfs_buf		*last_ibp;
 	struct xfs_dinode	*last_dip = NULL;
+	struct xfs_perag	*pag = NULL;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
@@ -2222,27 +2459,62 @@ xfs_iunlink_remove(
 	if (error)
 		return error;
 
+	/*
+	 * If there was a backref pointing from the next inode back to this
+	 * one, remove it because we've removed this inode from the list.
+	 *
+	 * Later, if this inode was in the middle of the list we'll update
+	 * this inode's backref to point from the next inode.
+	 */
+	if (next_agino != NULLAGINO) {
+		pag = xfs_perag_get(mp, agno);
+		error = xfs_iunlink_change_backref(pag, next_agino,
+				NULLAGINO);
+		if (error)
+			goto out;
+	}
+
 	if (head_agino == agino) {
 		/* Point the head of the list to the next unlinked inode. */
 		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
 				next_agino);
 		if (error)
-			return error;
+			goto out;
 	} else {
 		struct xfs_imap	imap;
 		xfs_agino_t	prev_agino;
 
+		if (!pag)
+			pag = xfs_perag_get(mp, agno);
+
 		/* We need to search the list for the inode being freed. */
 		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
-				&prev_agino, &imap, &last_dip, &last_ibp);
+				&prev_agino, &imap, &last_dip, &last_ibp,
+				pag);
 		if (error)
-			return error;
+			goto out;
 
 		/* Point the previous inode on the list to the next inode. */
 		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
 				last_dip, &imap, next_agino);
+
+		/*
+		 * Now we deal with the backref for this inode.  If this inode
+		 * pointed at a real inode, change the backref that pointed to
+		 * us to point to our old next.  If this inode was the end of
+		 * the list, delete the backref that pointed to us.  Note that
+		 * change_backref takes care of deleting the backref if
+		 * next_agino is NULLAGINO.
+		 */
+		error = xfs_iunlink_change_backref(pag, agino, next_agino);
+		if (error)
+			goto out;
 	}
-	return 0;
+
+out:
+	if (pag)
+		xfs_perag_put(pag);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index be2014520155..e62074a5257c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -500,4 +500,7 @@ extern struct kmem_zone	*xfs_inode_zone;
 
 bool xfs_inode_verify_forks(struct xfs_inode *ip);
 
+int xfs_iunlink_init(struct xfs_perag *pag);
+void xfs_iunlink_destroy(struct xfs_perag *pag);
+
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b4d8c318be3c..fd63b0b1307c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -149,6 +149,7 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
+		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
@@ -227,6 +228,9 @@ xfs_initialize_perag(
 		/* first new pag is fully initialized */
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
+		error = xfs_iunlink_init(pag);
+		if (error)
+			goto out_hash_destroy;
 	}
 
 	index = xfs_set_inode_alloc(mp, agcount);
@@ -249,6 +253,7 @@ xfs_initialize_perag(
 		if (!pag)
 			break;
 		xfs_buf_hash_destroy(pag);
+		xfs_iunlink_destroy(pag);
 		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7daafe064af8..a33f45077867 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -396,6 +396,13 @@ typedef struct xfs_perag {
 
 	/* reference count */
 	uint8_t			pagf_refcount_level;
+
+	/*
+	 * Unlinked inode information.  This incore information reflects
+	 * data stored in the AGI, so callers must hold the AGI buffer lock
+	 * or have some other means to control concurrency.
+	 */
+	struct rhashtable	pagi_unlinked_hash;
 } xfs_perag_t;
 
 static inline struct xfs_ag_resv *
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a6e384a642b1..c83ce022a355 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3447,6 +3447,7 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
 	TP_ARGS(ip))
 DEFINE_AGINODE_EVENT(xfs_iunlink);
 DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
+DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
 
 #endif /* _TRACE_XFS_H */
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function
  2019-02-06 16:55 ` [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
  2019-02-07 14:28   ` Brian Foster
@ 2019-02-08  7:57   ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2019-02-08  7:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 06, 2019 at 08:55:17AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> There's a loop that searches an unlinked bucket list to find the inode
> that points to a given inode.  Hoist this into a separate function;
> later we'll use our iunlink backref cache to bypass the slow list
> operation.  No functional changes.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-07 14:31   ` Brian Foster
  2019-02-07 16:33     ` Darrick J. Wong
@ 2019-02-08  8:00     ` Christoph Hellwig
  2019-02-08 12:06       ` Brian Foster
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2019-02-08  8:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

> > +	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > +		return 0;
> > +
> > +	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > +	iu->iu_agino = prev_agino;
> > +	iu->iu_next_unlinked = this_agino;
> > +
> > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error == 0 || error == -EEXIST)
> > +		return error;
> > +	return 0;
> 
> The return val looks funny at a glance: return -EEXIST or otherwise
> return 0 for success and all other errors (also the error == 0 looks
> superfluous). I see the comment above describes what this does, but it
> doesn't explain why. I'd move the comment to above the if check and
> explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> because fallback blah blah. We care about -EEXIST because ...").
> 
> Also I assume we need to free the iu object on insert failure,
> regardless of whether we ultimately return the error.

But is this really the right thing to do?  Everything but ENOMEM
would be a bug in the rhashtable implementation, or the XFS use,
wouldn't it?

So why not drop the return value entirely and do a:

	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
			&iu->iu_rhash_head, xfs_iunlink_hash_params);
	ASSERT(error == 0 || error == -ENOMEM);

> > +	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > +	if (error)
> > +		return error;
> 
> What's interesting about remove failure is that if we otherwise found an
> iu for this inode, removing the inode from the unlinked list leaves a
> stale/incorrect iu around in the in-core table. So it's one thing for
> the in-core table to be a valid subset of the on-disk structures, but
> another thing entirely to have incoherence between the two. It might be
> worth pointing out that it's critical to return an error here so we
> don't actually remove the inode (whereas the re-add below is less
> strict).

Again, remove failure here sounds like a programming bug - so ASSERT
and or forced shutdown here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-08  8:00     ` Christoph Hellwig
@ 2019-02-08 12:06       ` Brian Foster
  2019-02-08 16:54         ` Darrick J. Wong
  2019-02-11  8:08         ` Christoph Hellwig
  0 siblings, 2 replies; 22+ messages in thread
From: Brian Foster @ 2019-02-08 12:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 08, 2019 at 12:00:16AM -0800, Christoph Hellwig wrote:
> > > +	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > > +		return 0;
> > > +
> > > +	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > > +	iu->iu_agino = prev_agino;
> > > +	iu->iu_next_unlinked = this_agino;
> > > +
> > > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > > +	if (error == 0 || error == -EEXIST)
> > > +		return error;
> > > +	return 0;
> > 
> > The return val looks funny at a glance: return -EEXIST or otherwise
> > return 0 for success and all other errors (also the error == 0 looks
> > superfluous). I see the comment above describes what this does, but it
> > doesn't explain why. I'd move the comment to above the if check and
> > explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> > because fallback blah blah. We care about -EEXIST because ...").
> > 
> > Also I assume we need to free the iu object on insert failure,
> > regardless of whether we ultimately return the error.
> 
> But is this really the right thing to do?  Everything but ENOMEM
> would be a bug in the rhashtable implementation, or the XFS use,
> wouldn't it?
> 

I think anything beyond -ENOMEM would imply a bug somewhere, yes.

> So why not drop the return value entirely and do a:
> 
> 	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> 			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> 	ASSERT(error == 0 || error == -ENOMEM);
> 

It's not clear to me whether you're suggesting we return 0, error or
nothing at all here. The assert otherwise seems fine to me as I don't
think we'd ever expect anything outside of success or -ENOMEM.

That said, I don't see any reason to ever leak an iu if we know it
didn't make it into the table. I could probably go either way on whether
we wanted to allow the filesystem to continue or not on unexpected
insert errors. The original comment was just that we probably shouldn't
explode on "expected" errors like -ENOSPC.

> > > +	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > > +	if (error)
> > > +		return error;
> > 
> > What's interesting about remove failure is that if we otherwise found an
> > iu for this inode, removing the inode from the unlinked list leaves a
> > stale/incorrect iu around in the in-core table. So it's one thing for
> > the in-core table to be a valid subset of the on-disk structures, but
> > another thing entirely to have incoherence between the two. It might be
> > worth pointing out that it's critical to return an error here so we
> > don't actually remove the inode (whereas the re-add below is less
> > strict).
> 
> Again, remove failure here sounds like a programming bug - so ASSERT
> and or forced shutdown here.

Remove failure already unconditionally returns the error (which in the
inactive path translates to a shutdown). My feedback above was just a
suggestion to explain why in a comment because the error handling is
intentionally different between table insert/remove/lookup.

Brian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-08 12:06       ` Brian Foster
@ 2019-02-08 16:54         ` Darrick J. Wong
  2019-02-11 12:21           ` Brian Foster
  2019-02-11  8:08         ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2019-02-08 16:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

On Fri, Feb 08, 2019 at 07:06:24AM -0500, Brian Foster wrote:
> On Fri, Feb 08, 2019 at 12:00:16AM -0800, Christoph Hellwig wrote:
> > > > +	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > > > +		return 0;
> > > > +
> > > > +	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > > > +	iu->iu_agino = prev_agino;
> > > > +	iu->iu_next_unlinked = this_agino;
> > > > +
> > > > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > > > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > > > +	if (error == 0 || error == -EEXIST)
> > > > +		return error;
> > > > +	return 0;
> > > 
> > > The return val looks funny at a glance: return -EEXIST or otherwise
> > > return 0 for success and all other errors (also the error == 0 looks
> > > superfluous). I see the comment above describes what this does, but it
> > > doesn't explain why. I'd move the comment to above the if check and
> > > explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> > > because fallback blah blah. We care about -EEXIST because ...").
> > > 
> > > Also I assume we need to free the iu object on insert failure,
> > > regardless of whether we ultimately return the error.
> > 
> > But is this really the right thing to do?  Everything but ENOMEM
> > would be a bug in the rhashtable implementation, or the XFS use,
> > wouldn't it?
> > 
> 
> I think anything beyond -ENOMEM would imply a bug somewhere, yes.

AFAICT the only other error we can receive from rhashtable is -E2BIG.
That only happens if we somehow end up with more than 2^31 inodes (or
leave a logic bomb for ourselves by setting max_size, I guess...)

> > So why not drop the return value entirely and do a:
> > 
> > 	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > 			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > 	ASSERT(error == 0 || error == -ENOMEM);

But if someone hits -EEXIST on a non-debug filesystem we'll never know
that our code was actually buggy, whereas returning it shuts down the
filesystem and we'll start getting letters.

> It's not clear to me whether you're suggesting we return 0, error or
> nothing at all here. The assert otherwise seems fine to me as I don't
> think we'd ever expect anything outside of success or -ENOMEM.

As I mentioned above, we could theoretically receive -E2BIG due to a
programming bug, or at some point the rhashtable code could start
returning new errors it hasn't before, or we could be misreading the
code too. :)

> That said, I don't see any reason to ever leak an iu if we know it
> didn't make it into the table.

Agreed.

> I could probably go either way on whether we wanted to allow the
> filesystem to continue or not on unexpected insert errors. The
> original comment was just that we probably shouldn't explode on
> "expected" errors like -ENOSPC.

Since we /do/ have a fallback to handle cache misses, I think it's fine
to ignore hashtable insertion failures, though I'll put in a ASSERT if
we see any error other than ENOMEM.

> 
> > > > +	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > > > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > > > +	if (error)
> > > > +		return error;
> > > 
> > > What's interesting about remove failure is that if we otherwise found an
> > > iu for this inode, removing the inode from the unlinked list leaves a
> > > stale/incorrect iu around in the in-core table. So it's one thing for
> > > the in-core table to be a valid subset of the on-disk structures, but
> > > another thing entirely to have incoherence between the two. It might be
> > > worth pointing out that it's critical to return an error here so we
> > > don't actually remove the inode (whereas the re-add below is less
> > > strict).
> > 
> > Again, remove failure here sounds like a programming bug - so ASSERT
> > and or forced shutdown here.
> 
> Remove failure already unconditionally returns the error (which in the
> inactive path translates to a shutdown). My feedback above was just a
> suggestion to explain why in a comment because the error handling is
> intentionally different between table insert/remove/lookup.

I assume you're ok with the comment that got added in the followup patch
I sent to this thread?

--D

> Brian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-08 12:06       ` Brian Foster
  2019-02-08 16:54         ` Darrick J. Wong
@ 2019-02-11  8:08         ` Christoph Hellwig
  2019-02-11 12:21           ` Brian Foster
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2019-02-11  8:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs

On Fri, Feb 08, 2019 at 07:06:24AM -0500, Brian Foster wrote:
> It's not clear to me whether you're suggesting we return 0, error or
> nothing at all here. The assert otherwise seems fine to me as I don't
> think we'd ever expect anything outside of success or -ENOMEM.

I'm arguing that we should return nothing.

> That said, I don't see any reason to ever leak an iu if we know it
> didn't make it into the table. I could probably go either way on whether
> we wanted to allow the filesystem to continue or not on unexpected
> insert errors. The original comment was just that we probably shouldn't
> explode on "expected" errors like -ENOSPC.

Well, IFF the only error case that should happen is either ENOMEM or
E2BIG we don't have an allocation in that case.  Everything else is
a programming bug where we should assert / shut the file system down,
in which case the tiny leak is the least of our problems.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-08 16:54         ` Darrick J. Wong
@ 2019-02-11 12:21           ` Brian Foster
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2019-02-11 12:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Fri, Feb 08, 2019 at 08:54:14AM -0800, Darrick J. Wong wrote:
> On Fri, Feb 08, 2019 at 07:06:24AM -0500, Brian Foster wrote:
> > On Fri, Feb 08, 2019 at 12:00:16AM -0800, Christoph Hellwig wrote:
> > > > > +	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > > > > +		return 0;
> > > > > +
> > > > > +	iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > > > > +	iu->iu_agino = prev_agino;
> > > > > +	iu->iu_next_unlinked = this_agino;
> > > > > +
> > > > > +	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > > > > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > > > > +	if (error == 0 || error == -EEXIST)
> > > > > +		return error;
> > > > > +	return 0;
> > > > 
> > > > The return val looks funny at a glance: return -EEXIST or otherwise
> > > > return 0 for success and all other errors (also the error == 0 looks
> > > > superfluous). I see the comment above describes what this does, but it
> > > > doesn't explain why. I'd move the comment to above the if check and
> > > > explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> > > > because fallback blah blah. We care about -EEXIST because ...").
> > > > 
> > > > Also I assume we need to free the iu object on insert failure,
> > > > regardless of whether we ultimately return the error.
> > > 
> > > But is this really the right thing to do?  Everything but ENOMEM
> > > would be a bug in the rhashtable implementation, or the XFS use,
> > > wouldn't it?
> > > 
> > 
> > I think anything beyond -ENOMEM would imply a bug somewhere, yes.
> 
> AFAICT the only other error we can receive from rhashtable is -E2BIG.
> That only happens if we somehow end up with more than 2^31 inodes (or
> leave a logic bomb for ourselves by setting max_size, I guess...)
> 
> > > So why not drop the return value entirely and do a:
> > > 
> > > 	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > > 			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > > 	ASSERT(error == 0 || error == -ENOMEM);
> 
> But if someone hits -EEXIST on a non-debug filesystem we'll never know
> that our code was actually buggy, whereas returning it shuts down the
> filesystem and we'll start getting letters.
> 
> > It's not clear to me whether you're suggesting we return 0, error or
> > nothing at all here. The assert otherwise seems fine to me as I don't
> > think we'd ever expect anything outside of success or -ENOMEM.
> 
> As I mentioned above, we could theoretically receive -E2BIG due to a
> programming bug, or at some point the rhashtable code could start
> returning new errors it hasn't before, or we could be misreading the
> code too. :)
> 

Good point. Despite all our reasoning about what errors we might or
might not expect, I think this comes down to the fact that the API
returns an error. It's probably not appropriate to ignore it entirely,
even with a fallback mechanism in place that allows us to not fail the
higher level operation.

> > That said, I don't see any reason to ever leak an iu if we know it
> > didn't make it into the table.
> 
> Agreed.
> 
> > I could probably go either way on whether we wanted to allow the
> > filesystem to continue or not on unexpected insert errors. The
> > original comment was just that we probably shouldn't explode on
> > "expected" errors like -ENOSPC.
> 
> Since we /do/ have a fallback to handle cache misses, I think it's fine
> to ignore hashtable insertion failures, though I'll put in a ASSERT if
> we see any error other than ENOMEM.
> 

Sounds good.

> > 
> > > > > +	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > > > > +			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> > > > > +	if (error)
> > > > > +		return error;
> > > > 
> > > > What's interesting about remove failure is that if we otherwise found an
> > > > iu for this inode, removing the inode from the unlinked list leaves a
> > > > stale/incorrect iu around in the in-core table. So it's one thing for
> > > > the in-core table to be a valid subset of the on-disk structures, but
> > > > another thing entirely to have incoherence between the two. It might be
> > > > worth pointing out that it's critical to return an error here so we
> > > > don't actually remove the inode (whereas the re-add below is less
> > > > strict).
> > > 
> > > Again, remove failure here sounds like a programming bug - so ASSERT
> > > and or forced shutdown here.
> > 
> > Remove failure already unconditionally returns the error (which in the
> > inactive path translates to a shutdown). My feedback above was just a
> > suggestion to explain why in a comment because the error handling is
> > intentionally different between table insert/remove/lookup.
> 
> I assume you're ok with the comment that got added in the followup patch
> I sent to this thread?
> 

Yep, thanks.

Brian

> --D
> 
> > Brian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
  2019-02-11  8:08         ` Christoph Hellwig
@ 2019-02-11 12:21           ` Brian Foster
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2019-02-11 12:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Mon, Feb 11, 2019 at 12:08:31AM -0800, Christoph Hellwig wrote:
> On Fri, Feb 08, 2019 at 07:06:24AM -0500, Brian Foster wrote:
> > It's not clear to me whether you're suggesting we return 0, error or
> > nothing at all here. The assert otherwise seems fine to me as I don't
> > think we'd ever expect anything outside of success or -ENOMEM.
> 
> I'm arguing that we should return nothing.
> 
> > That said, I don't see any reason to ever leak an iu if we know it
> > didn't make it into the table. I could probably go either way on whether
> > we wanted to allow the filesystem to continue or not on unexpected
> > insert errors. The original comment was just that we probably shouldn't
> > explode on "expected" errors like -ENOSPC.
> 
> Well, IFF the only error case that should happen is either ENOMEM or
> E2BIG we don't have an allocation in that case.  Everything else is
> a programming bug where we should assert / shut the file system down,
> in which case the tiny leak is the least of our problems.

The caller allocated the object we're trying to insert. From the earlier
patch in this thread:

+       iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
+       iu->iu_agino = prev_agino;
+       iu->iu_next_unlinked = this_agino;
+
+       error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
+                       &iu->iu_rhash_head, xfs_iunlink_hash_params);

Brian

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-02-11 12:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
2019-02-06 16:54 ` [PATCH 1/9] xfs: clean up iunlink functions Darrick J. Wong
2019-02-06 16:54 ` [PATCH 2/9] xfs: add xfs_verify_agino_or_null helper Darrick J. Wong
2019-02-06 16:54 ` [PATCH 3/9] xfs: refactor AGI unlinked bucket updates Darrick J. Wong
2019-02-06 16:55 ` [PATCH 4/9] xfs: strengthen AGI unlinked inode bucket pointer checks Darrick J. Wong
2019-02-06 16:55 ` [PATCH 5/9] xfs: refactor inode unlinked pointer update functions Darrick J. Wong
2019-02-06 16:55 ` [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
2019-02-07 14:28   ` Brian Foster
2019-02-07 16:19     ` Darrick J. Wong
2019-02-08  7:57   ` Christoph Hellwig
2019-02-06 16:55 ` [PATCH 7/9] xfs: refactor inode update in iunlink_remove Darrick J. Wong
2019-02-06 16:55 ` [PATCH 8/9] xfs: add tracepoints for high level iunlink operations Darrick J. Wong
2019-02-06 16:55 ` [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
2019-02-07 14:31   ` Brian Foster
2019-02-07 16:33     ` Darrick J. Wong
2019-02-08  8:00     ` Christoph Hellwig
2019-02-08 12:06       ` Brian Foster
2019-02-08 16:54         ` Darrick J. Wong
2019-02-11 12:21           ` Brian Foster
2019-02-11  8:08         ` Christoph Hellwig
2019-02-11 12:21           ` Brian Foster
2019-02-07 18:24   ` [PATCH v2 " Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.