All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
@ 2016-04-07 23:37 Dave Chinner
  2016-04-07 23:37 ` [PATCH 1/6] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Dave Chinner @ 2016-04-07 23:37 UTC (permalink / raw)
  To: xfs

Hi folks,

This is the second version of this patch set, first posted and
described here:

http://oss.sgi.com/archives/xfs/2016-04/msg00069.html

The only change from the first version is splitting up the first
patch into two as Christoph requested - one for the bug fix, the
other for the variable renaming.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] xfs: fix inode validity check in xfs_iflush_cluster
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
@ 2016-04-07 23:37 ` Dave Chinner
  2016-04-07 23:43   ` Christoph Hellwig
  2016-04-07 23:37 ` [PATCH 2/6] xfs: rename variables in xfs_iflush_cluster for clarity Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-04-07 23:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
convert inode cache lookups to use RCU locking") back in late 2010,
and so xfs_iflush_cluster checks the wrong inode for whether it is
still valid under RCU protection. Fix it to lock and check the
correct inode.

(*) Careless-idiot: Dave Chinner <dchinner@redhat.com>

Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f79ea59..32f11a9 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3192,13 +3192,13 @@ xfs_iflush_cluster(
 		 * We need to check under the i_flags_lock for a valid inode
 		 * here. Skip it if it is not valid or the wrong inode.
 		 */
-		spin_lock(&ip->i_flags_lock);
-		if (!ip->i_ino ||
+		spin_lock(&iq->i_flags_lock);
+		if (!iq->i_ino ||
 		    (XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) {
-			spin_unlock(&ip->i_flags_lock);
+			spin_unlock(&iq->i_flags_lock);
 			continue;
 		}
-		spin_unlock(&ip->i_flags_lock);
+		spin_unlock(&iq->i_flags_lock);
 
 		/*
 		 * Do an un-protected check to see if the inode is dirty and
-- 
2.7.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] xfs: rename variables in xfs_iflush_cluster for clarity
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
  2016-04-07 23:37 ` [PATCH 1/6] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
@ 2016-04-07 23:37 ` Dave Chinner
  2016-04-07 23:44   ` Christoph Hellwig
  2016-04-07 23:37 ` [PATCH 3/6] xfs: skip stale inodes in xfs_iflush_cluster Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-04-07 23:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The cluster inode variable uses unconventional naming - iq - which
makes it hard to distinguish it between the inode passed into the
function - ip - and that is a vector for mistakes to be made.
Rename all the cluster inode variables to use a more conventional
prefixes to reduce potential future confusion (cilist, cilist_size,
cip).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 64 +++++++++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 32f11a9..2718d10 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3149,16 +3149,16 @@ out_release_wip:
 
 STATIC int
 xfs_iflush_cluster(
-	xfs_inode_t	*ip,
-	xfs_buf_t	*bp)
+	struct xfs_inode	*ip,
+	struct xfs_buf		*bp)
 {
-	xfs_mount_t		*mp = ip->i_mount;
+	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
 	unsigned long		first_index, mask;
 	unsigned long		inodes_per_cluster;
-	int			ilist_size;
-	xfs_inode_t		**ilist;
-	xfs_inode_t		*iq;
+	int			cilist_size;
+	struct xfs_inode	**cilist;
+	struct xfs_inode	*cip;
 	int			nr_found;
 	int			clcount = 0;
 	int			bufwasdelwri;
@@ -3167,23 +3167,23 @@ xfs_iflush_cluster(
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 
 	inodes_per_cluster = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
-	ilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
-	ilist = kmem_alloc(ilist_size, KM_MAYFAIL|KM_NOFS);
-	if (!ilist)
+	cilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
+	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
+	if (!cilist)
 		goto out_put;
 
 	mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
 	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
 	rcu_read_lock();
 	/* really need a gang lookup range call here */
-	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
+	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
 					first_index, inodes_per_cluster);
 	if (nr_found == 0)
 		goto out_free;
 
 	for (i = 0; i < nr_found; i++) {
-		iq = ilist[i];
-		if (iq == ip)
+		cip = cilist[i];
+		if (cip == ip)
 			continue;
 
 		/*
@@ -3192,20 +3192,20 @@ xfs_iflush_cluster(
 		 * We need to check under the i_flags_lock for a valid inode
 		 * here. Skip it if it is not valid or the wrong inode.
 		 */
-		spin_lock(&iq->i_flags_lock);
-		if (!iq->i_ino ||
-		    (XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index) {
-			spin_unlock(&iq->i_flags_lock);
+		spin_lock(&cip->i_flags_lock);
+		if (!cip->i_ino ||
+		    (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+			spin_unlock(&cip->i_flags_lock);
 			continue;
 		}
-		spin_unlock(&iq->i_flags_lock);
+		spin_unlock(&cip->i_flags_lock);
 
 		/*
 		 * Do an un-protected check to see if the inode is dirty and
 		 * is a candidate for flushing.  These checks will be repeated
 		 * later after the appropriate locks are acquired.
 		 */
-		if (xfs_inode_clean(iq) && xfs_ipincount(iq) == 0)
+		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
 			continue;
 
 		/*
@@ -3213,15 +3213,15 @@ xfs_iflush_cluster(
 		 * then this inode cannot be flushed and is skipped.
 		 */
 
-		if (!xfs_ilock_nowait(iq, XFS_ILOCK_SHARED))
+		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
 			continue;
-		if (!xfs_iflock_nowait(iq)) {
-			xfs_iunlock(iq, XFS_ILOCK_SHARED);
+		if (!xfs_iflock_nowait(cip)) {
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
 			continue;
 		}
-		if (xfs_ipincount(iq)) {
-			xfs_ifunlock(iq);
-			xfs_iunlock(iq, XFS_ILOCK_SHARED);
+		if (xfs_ipincount(cip)) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
 			continue;
 		}
 
@@ -3229,18 +3229,18 @@ xfs_iflush_cluster(
 		 * arriving here means that this inode can be flushed.  First
 		 * re-check that it's dirty before flushing.
 		 */
-		if (!xfs_inode_clean(iq)) {
+		if (!xfs_inode_clean(cip)) {
 			int	error;
-			error = xfs_iflush_int(iq, bp);
+			error = xfs_iflush_int(cip, bp);
 			if (error) {
-				xfs_iunlock(iq, XFS_ILOCK_SHARED);
+				xfs_iunlock(cip, XFS_ILOCK_SHARED);
 				goto cluster_corrupt_out;
 			}
 			clcount++;
 		} else {
-			xfs_ifunlock(iq);
+			xfs_ifunlock(cip);
 		}
-		xfs_iunlock(iq, XFS_ILOCK_SHARED);
+		xfs_iunlock(cip, XFS_ILOCK_SHARED);
 	}
 
 	if (clcount) {
@@ -3250,7 +3250,7 @@ xfs_iflush_cluster(
 
 out_free:
 	rcu_read_unlock();
-	kmem_free(ilist);
+	kmem_free(cilist);
 out_put:
 	xfs_perag_put(pag);
 	return 0;
@@ -3293,8 +3293,8 @@ cluster_corrupt_out:
 	/*
 	 * Unlocks the flush lock
 	 */
-	xfs_iflush_abort(iq, false);
-	kmem_free(ilist);
+	xfs_iflush_abort(cip, false);
+	kmem_free(cilist);
 	xfs_perag_put(pag);
 	return -EFSCORRUPTED;
 }
-- 
2.7.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] xfs: skip stale inodes in xfs_iflush_cluster
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
  2016-04-07 23:37 ` [PATCH 1/6] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
  2016-04-07 23:37 ` [PATCH 2/6] xfs: rename variables in xfs_iflush_cluster for clarity Dave Chinner
@ 2016-04-07 23:37 ` Dave Chinner
  2016-04-07 23:37 ` [PATCH 4/6] xfs: xfs_iflush_cluster has range issues Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-04-07 23:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We don't write back stale inodes so we should skip them in
xfs_iflush_cluster, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2718d10..6598104 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3190,10 +3190,11 @@ xfs_iflush_cluster(
 		 * because this is an RCU protected lookup, we could find a
 		 * recently freed or even reallocated inode during the lookup.
 		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
+		 * here. Skip it if it is not valid, stale or the wrong inode.
 		 */
 		spin_lock(&cip->i_flags_lock);
 		if (!cip->i_ino ||
+		    __xfs_iflags_test(ip, XFS_ISTALE) ||
 		    (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
 			spin_unlock(&cip->i_flags_lock);
 			continue;
-- 
2.7.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] xfs: xfs_iflush_cluster has range issues
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
                   ` (2 preceding siblings ...)
  2016-04-07 23:37 ` [PATCH 3/6] xfs: skip stale inodes in xfs_iflush_cluster Dave Chinner
@ 2016-04-07 23:37 ` Dave Chinner
  2016-04-07 23:37 ` [PATCH 5/6] xfs: xfs_inode_free() isn't RCU safe Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-04-07 23:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_iflush_cluster() does a gang lookup on the radix tree, meaning
it can find inodes beyond the current cluster if there is sparse
cache population. gang lookups return results in ascending index
order, so stop trying to cluster inodes once the first inode outside
the cluster mask is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_inode.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6598104..b984be4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3194,11 +3194,20 @@ xfs_iflush_cluster(
 		 */
 		spin_lock(&cip->i_flags_lock);
 		if (!cip->i_ino ||
-		    __xfs_iflags_test(ip, XFS_ISTALE) ||
-		    (XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+		    __xfs_iflags_test(ip, XFS_ISTALE)) {
 			spin_unlock(&cip->i_flags_lock);
 			continue;
 		}
+
+		/*
+		 * Once we fall off the end of the cluster, no point checking
+		 * any more inodes in the list because they will also all be
+		 * outside the cluster.
+		 */
+		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+			spin_unlock(&cip->i_flags_lock);
+			break;
+		}
 		spin_unlock(&cip->i_flags_lock);
 
 		/*
-- 
2.7.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] xfs: xfs_inode_free() isn't RCU safe
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
                   ` (3 preceding siblings ...)
  2016-04-07 23:37 ` [PATCH 4/6] xfs: xfs_iflush_cluster has range issues Dave Chinner
@ 2016-04-07 23:37 ` Dave Chinner
  2016-04-07 23:37 ` [PATCH 6/6] xfs: mark reclaimed inodes invalid earlier Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-04-07 23:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The xfs_inode freed in xfs_inode_free() has multiple allocated
structures attached to it. We free these in xfs_inode_free() before
we mark the inode as invalid, and before we run call_rcu() to queue
the structure for freeing.

Unfortunately, this freeing can race with other accesses that are in
the RCU current grace period that have found the inode in the radix
tree with a valid state.  This includes xfs_iflush_cluster(), which
calls xfs_inode_clean(), and that accesses the inode log item on the
xfs_inode.

The log item structure is freed in xfs_inode_free(), so there is the
possibility we can be accessing freed memory in xfs_iflush_cluster()
after validating the xfs_inode structure as being valid for this RCU
context. Hence we can get spuriously incorrect clean state returned
from such checks. This can lead to use thinking the inode is dirty
when it is, in fact, clean, and so incorrectly attaching it to the
buffer for IO and completion processing.

This then leads to use-after-free situations on the xfs_inode itself
if the IO completes after the current RCU grace period expires. The
buffer callbacks will access the xfs_inode and try to do all sorts
of things it shouldn't with freed memory.

IOWs, xfs_iflush_cluster() only works correctly when racing with
inode reclaim if the inode log item is present and correctly stating
the inode is clean. If the inode is being freed, then reclaim has
already made sure the inode is clean, and hence xfs_iflush_cluster
can skip it. However, we are accessing the inode inode under RCU
read lock protection and so also must ensure that all dynamically
allocated memory we reference in this context is not freed until the
RCU grace period expires.

To fix this, move all the potential memory freeing into
xfs_inode_free_callback() so that we are guarantee RCU protected
lookup code will always have the memory structures it needs
available during the RCU grace period that lookup races can occur
in.

Discovered-by: Brain Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_icache.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index bf2d607..0c94cde 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -94,13 +94,6 @@ xfs_inode_free_callback(
 	struct inode		*inode = container_of(head, struct inode, i_rcu);
 	struct xfs_inode	*ip = XFS_I(inode);
 
-	kmem_zone_free(xfs_inode_zone, ip);
-}
-
-void
-xfs_inode_free(
-	struct xfs_inode	*ip)
-{
 	switch (VFS_I(ip)->i_mode & S_IFMT) {
 	case S_IFREG:
 	case S_IFDIR:
@@ -118,6 +111,13 @@ xfs_inode_free(
 		ip->i_itemp = NULL;
 	}
 
+	kmem_zone_free(xfs_inode_zone, ip);
+}
+
+void
+xfs_inode_free(
+	struct xfs_inode	*ip)
+{
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always
 	 * appears to be reclaimed with an invalid inode number when in the
-- 
2.7.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] xfs: mark reclaimed inodes invalid earlier
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
                   ` (4 preceding siblings ...)
  2016-04-07 23:37 ` [PATCH 5/6] xfs: xfs_inode_free() isn't RCU safe Dave Chinner
@ 2016-04-07 23:37 ` Dave Chinner
  2016-04-07 23:46   ` Christoph Hellwig
  2016-04-08  3:28 ` [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Eryu Guan
  2016-04-08 17:18 ` Brian Foster
  7 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-04-07 23:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The last thing we do before using call_rcu() on an xfs_inode to be
freed is mark it as invalid. This means there is a window between
when we know for certain that the inode is going to be freed and
when we do actually mark it as "freed".

This is important in the context of RCU lookups - we can look up the
inode, find that it is valid, and then use it as such not realising
that it is in the final stages of being freed.

As such, mark the inode as being invalid the moment we know it is
going to be reclaimed. This can be done while we still hold the
XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
it occurs well before we remove it from the radix tree, and that
the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
synchronisation points for detecting that an inode is about to go
away.

For defensive purposes, this allows us to add a further check to
xfs_iflush_cluster to ensure we skip inodes that are being freed
after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
if the inode number if valid while we have these locks held we know
that it has not progressed through reclaim to the point where it is
clean and is about to be freed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 40 ++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_inode.c  | 14 +++++++++++++-
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0c94cde..a60db43 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -114,6 +114,18 @@ xfs_inode_free_callback(
 	kmem_zone_free(xfs_inode_zone, ip);
 }
 
+static void
+__xfs_inode_free(
+	struct xfs_inode	*ip)
+{
+	/* asserts to verify all state is correct here */
+	ASSERT(atomic_read(&ip->i_pincount) == 0);
+	ASSERT(!xfs_isiflocked(ip));
+	XFS_STATS_DEC(ip->i_mount, vn_active);
+
+	call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
+}
+
 void
 xfs_inode_free(
 	struct xfs_inode	*ip)
@@ -129,12 +141,7 @@ xfs_inode_free(
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
 
-	/* asserts to verify all state is correct here */
-	ASSERT(atomic_read(&ip->i_pincount) == 0);
-	ASSERT(!xfs_isiflocked(ip));
-	XFS_STATS_DEC(ip->i_mount, vn_active);
-
-	call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
+	__xfs_inode_free(ip);
 }
 
 /*
@@ -929,6 +936,7 @@ xfs_reclaim_inode(
 	int			sync_mode)
 {
 	struct xfs_buf		*bp = NULL;
+	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 	int			error;
 
 restart:
@@ -993,6 +1001,22 @@ restart:
 
 	xfs_iflock(ip);
 reclaim:
+	/*
+	 * Because we use RCU freeing we need to ensure the inode always appears
+	 * to be reclaimed with an invalid inode number when in the free state.
+	 * We do this as early as possible under the ILOCK and flush lock so
+	 * that xfs_iflush_cluster() can be guaranteed to detect races with us
+	 * here. By doing this, we guarantee that once xfs_iflush_cluster has
+	 * locked both the XFS_ILOCK and the flush lock that it will see either
+	 * a valid, flushable inode that will serialise correctly against the
+	 * locks below, or it will see a clean (and invalid) inode that it can
+	 * skip.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	ip->i_flags = XFS_IRECLAIM;
+	ip->i_ino = 0;
+	spin_unlock(&ip->i_flags_lock);
+
 	xfs_ifunlock(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
@@ -1006,7 +1030,7 @@ reclaim:
 	 */
 	spin_lock(&pag->pag_ici_lock);
 	if (!radix_tree_delete(&pag->pag_ici_root,
-				XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino)))
+				XFS_INO_TO_AGINO(ip->i_mount, ino)))
 		ASSERT(0);
 	__xfs_inode_clear_reclaim(pag, ip);
 	spin_unlock(&pag->pag_ici_lock);
@@ -1023,7 +1047,7 @@ reclaim:
 	xfs_qm_dqdetach(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
-	xfs_inode_free(ip);
+	__xfs_inode_free(ip);
 	return error;
 
 out_ifunlock:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b984be4..5b84bbc 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3222,7 +3222,6 @@ xfs_iflush_cluster(
 		 * Try to get locks.  If any are unavailable or it is pinned,
 		 * then this inode cannot be flushed and is skipped.
 		 */
-
 		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
 			continue;
 		if (!xfs_iflock_nowait(cip)) {
@@ -3235,6 +3234,19 @@ xfs_iflush_cluster(
 			continue;
 		}
 
+
+		/*
+		 * Check the inode number again, just to be certain we are not
+		 * racing with freeing in xfs_reclaim_inode(). See the comments
+		 * in that function for more information as to why the initial
+		 * check is not sufficient.
+		 */
+		if (!cip->i_ino) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+
 		/*
 		 * arriving here means that this inode can be flushed.  First
 		 * re-check that it's dirty before flushing.
-- 
2.7.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] xfs: fix inode validity check in xfs_iflush_cluster
  2016-04-07 23:37 ` [PATCH 1/6] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
@ 2016-04-07 23:43   ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2016-04-07 23:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 08, 2016 at 09:37:46AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs:
> convert inode cache lookups to use RCU locking") back in late 2010,
> and so xfs_iflush_cluster checks the wrong inode for whether it is
> still valid under RCU protection. Fix it to lock and check the
> correct inode.

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] xfs: rename variables in xfs_iflush_cluster for clarity
  2016-04-07 23:37 ` [PATCH 2/6] xfs: rename variables in xfs_iflush_cluster for clarity Dave Chinner
@ 2016-04-07 23:44   ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2016-04-07 23:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/6] xfs: mark reclaimed inodes invalid earlier
  2016-04-07 23:37 ` [PATCH 6/6] xfs: mark reclaimed inodes invalid earlier Dave Chinner
@ 2016-04-07 23:46   ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2016-04-07 23:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> +	spin_lock(&ip->i_flags_lock);
> +	ip->i_flags = XFS_IRECLAIM;
> +	ip->i_ino = 0;
> +	spin_unlock(&ip->i_flags_lock);

It's just a tiny sniplet, but still duplicated from xfs_inode_free().
Maybe add another helper for this?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
                   ` (5 preceding siblings ...)
  2016-04-07 23:37 ` [PATCH 6/6] xfs: mark reclaimed inodes invalid earlier Dave Chinner
@ 2016-04-08  3:28 ` Eryu Guan
  2016-04-08 11:37   ` Brian Foster
  2016-04-08 17:18 ` Brian Foster
  7 siblings, 1 reply; 18+ messages in thread
From: Eryu Guan @ 2016-04-08  3:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> Hi folks,
> 
> This is the second version of this patch set, first posted and
> described here:
> 
> http://oss.sgi.com/archives/xfs/2016-04/msg00069.html

Just a quick note here, I'm testing the v1 patchset right now, v4.6-rc2
kernel + v1 patch, config file is based on rhel7 debug kernel config.

The test is the same as the original reproducer (long term fsstress run
on XFS, exported from NFS). The test on x86_64 host has been running for
two days and everything looks fine. Test on ppc64 host has been running
for a few hours and I noticed a lock issue and a few warnings, not sure
if it's related to the patches or even to XFS yet(I need to run test on
stock -rc2 kernel to be sure), but just post the logs here for reference

[ 1911.626286] ======================================================
[ 1911.626291] [ INFO: possible circular locking dependency detected ]
[ 1911.626297] 4.6.0-rc2.debug+ #1 Not tainted
[ 1911.626301] -------------------------------------------------------
[ 1911.626306] nfsd/7402 is trying to acquire lock:
[ 1911.626311]  (&s->s_sync_lock){+.+.+.}, at: [<c0000000003585f0>] .sync_inodes_sb+0xe0/0x230
[ 1911.626327]
[ 1911.626327] but task is already holding lock:
[ 1911.626333]  (sb_internal){.+.+.+}, at: [<c00000000031a780>] .__sb_start_write+0x90/0x130
[ 1911.626346]
[ 1911.626346] which lock already depends on the new lock.
[ 1911.626346]
[ 1911.626353]
[ 1911.626353] the existing dependency chain (in reverse order) is:
[ 1911.626358]
-> #1 (sb_internal){.+.+.+}:
[ 1911.626368]        [<c00000000013947c>] .__lock_acquire+0x3fc/0xee0
[ 1911.626377]        [<c00000000013a998>] .lock_acquire+0xf8/0x360
[ 1911.626384]        [<c00000000013116c>] .percpu_down_read+0x5c/0x110
[ 1911.626392]        [<c00000000031a7f0>] .__sb_start_write+0x100/0x130
[ 1911.626399]        [<d0000000066e706c>] .xfs_trans_alloc+0x2c/0x70 [xfs]
[ 1911.626461]        [<d0000000066d9bfc>] .xfs_inactive_truncate+0x2c/0x1a0 [xfs]
[ 1911.626521]        [<d0000000066da618>] .xfs_inactive+0x218/0x260 [xfs]
[ 1911.626581]        [<d0000000066e25dc>] .xfs_fs_evict_inode+0x14c/0x370 [xfs]
[ 1911.626640]        [<c00000000033c84c>] .evict+0xdc/0x1f0
[ 1911.626647]        [<c000000000358698>] .sync_inodes_sb+0x188/0x230
[ 1911.626654]        [<c000000000362fcc>] .sync_inodes_one_sb+0x1c/0x30
[ 1911.626661]        [<c00000000031c09c>] .iterate_supers+0x17c/0x190
[ 1911.626668]        [<c000000000363498>] .sys_sync+0x48/0xd0
[ 1911.626675]        [<c000000000009210>] system_call+0x38/0xec
[ 1911.626682]
-> #0 (&s->s_sync_lock){+.+.+.}:
[ 1911.626692]        [<c000000000137b88>] .validate_chain.isra.35+0x918/0xc70
[ 1911.626699]        [<c00000000013947c>] .__lock_acquire+0x3fc/0xee0
[ 1911.626706]        [<c00000000013a998>] .lock_acquire+0xf8/0x360
[ 1911.626713]        [<c000000000922830>] .mutex_lock_nested+0xb0/0x5e0
[ 1911.626721]        [<c0000000003585f0>] .sync_inodes_sb+0xe0/0x230
[ 1911.626728]        [<d0000000066e48f8>] .xfs_flush_inodes+0x38/0x60 [xfs]
[ 1911.626787]        [<d0000000066d85c8>] .xfs_create+0x898/0xb30 [xfs]
[ 1911.626847]        [<d0000000066d0ac8>] .xfs_generic_create+0x208/0x3a0 [xfs]
[ 1911.626907]        [<c0000000003293fc>] .vfs_create+0x15c/0x1f0
[ 1911.626913]        [<d00000000a60d7e4>] .do_nfsd_create+0x444/0x580 [nfsd]
[ 1911.626942]        [<d00000000a61e0b8>] .nfsd4_open+0x388/0xa00 [nfsd]
[ 1911.626972]        [<d00000000a61eb5c>] .nfsd4_proc_compound+0x42c/0x720 [nfsd]
[ 1911.626998]        [<d00000000a603178>] .nfsd_dispatch+0x118/0x2d0 [nfsd]
[ 1911.627024]        [<d000000006dfde90>] .svc_process_common+0x550/0x8b0 [sunrpc]
[ 1911.627056]        [<d000000006dfe33c>] .svc_process+0x14c/0x610 [sunrpc]
[ 1911.627088]        [<d00000000a602268>] .nfsd+0x238/0x380 [nfsd]
[ 1911.627113]        [<c0000000000e9ac4>] .kthread+0x114/0x140
[ 1911.627120]        [<c000000000009578>] .ret_from_kernel_thread+0x58/0x60
[ 1911.627127]
[ 1911.627127] other info that might help us debug this:
[ 1911.627127]
[ 1911.627134]  Possible unsafe locking scenario:
[ 1911.627134]
[ 1911.627139]        CPU0                    CPU1
[ 1911.627143]        ----                    ----
[ 1911.627147]   lock(sb_internal);
[ 1911.627153]                                lock(&s->s_sync_lock);
[ 1911.627160]                                lock(sb_internal);
[ 1911.627166]   lock(&s->s_sync_lock);
[ 1911.627172]
[ 1911.627172]  *** DEADLOCK ***
[ 1911.627172]
[ 1911.627179] 4 locks held by nfsd/7402:
[ 1911.627183]  #0:  (sb_writers#8){.+.+.+}, at: [<c00000000031a7f0>] .__sb_start_write+0x100/0x130
[ 1911.627198]  #1:  (&type->i_mutex_dir_key#3/1){+.+.+.}, at: [<d00000000a60d5c4>] .do_nfsd_create+0x224/0x580 [nfsd]
[ 1911.627234]  #2:  (sb_internal){.+.+.+}, at: [<c00000000031a780>] .__sb_start_write+0x90/0x130
[ 1911.627248]  #3:  (&type->s_umount_key#44){++++++}, at: [<d0000000066e48e4>] .xfs_flush_inodes+0x24/0x60 [xfs]
[ 1911.627316]
[ 1911.627316] stack backtrace:
[ 1911.627323] CPU: 1 PID: 7402 Comm: nfsd Not tainted 4.6.0-rc2.debug+ #1
[ 1911.627328] Call Trace:
[ 1911.627333] [c0000005f2c36d10] [c000000000938df8] .dump_stack+0xe0/0x14c (unreliable)
[ 1911.627342] [c0000005f2c36da0] [c00000000092fd28] .print_circular_bug+0x354/0x388
[ 1911.627350] [c0000005f2c36e50] [c000000000137260] .check_prev_add+0x8d0/0x8e0
[ 1911.627357] [c0000005f2c36f60] [c000000000137b88] .validate_chain.isra.35+0x918/0xc70
[ 1911.627365] [c0000005f2c37050] [c00000000013947c] .__lock_acquire+0x3fc/0xee0
[ 1911.627372] [c0000005f2c37170] [c00000000013a998] .lock_acquire+0xf8/0x360
[ 1911.627379] [c0000005f2c37250] [c000000000922830] .mutex_lock_nested+0xb0/0x5e0
[ 1911.627386] [c0000005f2c37360] [c0000000003585f0] .sync_inodes_sb+0xe0/0x230
[ 1911.627445] [c0000005f2c37460] [d0000000066e48f8] .xfs_flush_inodes+0x38/0x60 [xfs]
[ 1911.627505] [c0000005f2c374e0] [d0000000066d85c8] .xfs_create+0x898/0xb30 [xfs]
[ 1911.627565] [c0000005f2c37600] [d0000000066d0ac8] .xfs_generic_create+0x208/0x3a0 [xfs]
[ 1911.627573] [c0000005f2c376d0] [c0000000003293fc] .vfs_create+0x15c/0x1f0
[ 1911.627599] [c0000005f2c37770] [d00000000a60d7e4] .do_nfsd_create+0x444/0x580 [nfsd]
[ 1911.627625] [c0000005f2c37850] [d00000000a61e0b8] .nfsd4_open+0x388/0xa00 [nfsd]
[ 1911.627651] [c0000005f2c37910] [d00000000a61eb5c] .nfsd4_proc_compound+0x42c/0x720 [nfsd]
[ 1911.627677] [c0000005f2c379f0] [d00000000a603178] .nfsd_dispatch+0x118/0x2d0 [nfsd]
[ 1911.627709] [c0000005f2c37a90] [d000000006dfde90] .svc_process_common+0x550/0x8b0 [sunrpc]
[ 1911.627741] [c0000005f2c37b70] [d000000006dfe33c] .svc_process+0x14c/0x610 [sunrpc]
[ 1911.627768] [c0000005f2c37c10] [d00000000a602268] .nfsd+0x238/0x380 [nfsd]
[ 1911.627775] [c0000005f2c37cb0] [c0000000000e9ac4] .kthread+0x114/0x140
[ 1911.627782] [c0000005f2c37e30] [c000000000009578] .ret_from_kernel_thread+0x58/0x60
[ 2046.852739] kworker/dying (399) used greatest stack depth: 4352 bytes left
[ 2854.687381] XFS: Assertion failed: buffer_mapped(bh), file: fs/xfs/xfs_aops.c, line: 780
[ 2854.687434] ------------[ cut here ]------------
[ 2854.687488] WARNING: CPU: 5 PID: 28924 at fs/xfs/xfs_message.c:105 .asswarn+0x2c/0x40 [xfs]
[ 2854.687493] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_mod loop sg pseries_rng nfsd auth_rpcgss nfs_acl lockd sunrpc grace ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp
[ 2854.687530] CPU: 5 PID: 28924 Comm: kworker/u32:4 Not tainted 4.6.0-rc2.debug+ #1
[ 2854.687538] Workqueue: writeback .wb_workfn (flush-253:0)
[ 2854.687545] task: c0000001e6d28380 ti: c0000000fe3ac000 task.ti: c0000000fe3ac000
[ 2854.687549] NIP: d0000000066ddafc LR: d0000000066ddafc CTR: c0000000004dd880
[ 2854.687554] REGS: c0000000fe3aeeb0 TRAP: 0700   Not tainted  (4.6.0-rc2.debug+)
[ 2854.687558] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI,TM[E]>  CR: 28002048  XER: 0000000d
[ 2854.687577] CFAR: d0000000066dd890 SOFTE: 1
GPR00: d0000000066ddafc c0000000fe3af130 d000000006765850 ffffffffffffffea
GPR04: 000000000000000a c0000000fe3aef50 00000000000000d1 ffffffffffffffc0
GPR08: 0000000000000000 0000000000000021 00000000ffffffd1 d000000006741dc0
GPR12: c0000000004dd880 c00000000e822d00 c0000000fe3af4e0 0000000000000001
GPR16: c0000000fe3af6f0 0000000000000002 0000000000000000 0000000000000007
GPR20: 0000000000000003 c0000000fe3af4f0 0000000000000000 c0000000fe3af210
GPR24: 0000000000000004 0000000000001000 0000000000160000 f000000000f0e100
GPR28: c0000004055d18c8 c000000016cde430 0000000000157000 c0000004055d15f0
[ 2854.687694] NIP [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs]
[ 2854.687742] LR [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs]
[ 2854.687746] Call Trace:
[ 2854.687793] [c0000000fe3af130] [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs] (unreliable)
[ 2854.687846] [c0000000fe3af1a0] [d0000000066a9104] .xfs_do_writepage+0x414/0x930 [xfs]
[ 2854.687854] [c0000000fe3af2b0] [c00000000025df6c] .write_cache_pages+0x5fc/0x820
[ 2854.687905] [c0000000fe3af470] [d0000000066a8a5c] .xfs_vm_writepages+0x8c/0xd0 [xfs]
[ 2854.687912] [c0000000fe3af540] [c00000000025f62c] .do_writepages+0x3c/0x70
[ 2854.687918] [c0000000fe3af5b0] [c00000000035b1ec] .__writeback_single_inode+0x5bc/0xd50
[ 2854.687924] [c0000000fe3af680] [c00000000035c3c0] .writeback_sb_inodes+0x380/0x730
[ 2854.687930] [c0000000fe3af7f0] [c00000000035ca44] .wb_writeback+0x194/0x920
[ 2854.687935] [c0000000fe3af960] [c00000000035ddcc] .wb_workfn+0x19c/0xa40
[ 2854.687943] [c0000000fe3afad0] [c0000000000dfc74] .process_one_work+0x264/0x8f0
[ 2854.687949] [c0000000fe3afbc0] [c0000000000e0388] .worker_thread+0x88/0x520
[ 2854.687955] [c0000000fe3afcb0] [c0000000000e9ac4] .kthread+0x114/0x140
[ 2854.687961] [c0000000fe3afe30] [c000000000009578] .ret_from_kernel_thread+0x58/0x60
[ 2854.687966] Instruction dump:
[ 2854.687970] 60420000 7c0802a6 3d420000 7c691b78 7c862378 e88abe38 7ca72b78 38600000
[ 2854.687983] 7d254b78 f8010010 f821ff91 4bfffcf9 <0fe00000> 38210070 e8010010 7c0803a6
[ 2854.687997] ---[ end trace 872ac2709186f780 ]---
[ 2854.688001] XFS: Assertion failed: buffer_mapped(bh), file: fs/xfs/xfs_aops.c, line: 780
[ 2854.688022] ------------[ cut here ]------------
[ 2854.688072] WARNING: CPU: 5 PID: 28924 at fs/xfs/xfs_message.c:105 .asswarn+0x2c/0x40 [xfs]
[ 2854.688076] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_mod loop sg pseries_rng nfsd auth_rpcgss nfs_acl lockd sunrpc grace ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp
[ 2854.688110] CPU: 5 PID: 28924 Comm: kworker/u32:4 Tainted: G        W       4.6.0-rc2.debug+ #1
[ 2854.688116] Workqueue: writeback .wb_workfn (flush-253:0)
[ 2854.688121] task: c0000001e6d28380 ti: c0000000fe3ac000 task.ti: c0000000fe3ac000
[ 2854.688126] NIP: d0000000066ddafc LR: d0000000066ddafc CTR: c0000000004dd880
[ 2854.688131] REGS: c0000000fe3aeeb0 TRAP: 0700   Tainted: G        W        (4.6.0-rc2.debug+)
[ 2854.688135] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI,TM[E]>  CR: 48002048  XER: 0000000d
[ 2854.688153] CFAR: d0000000066dd890 SOFTE: 1
GPR00: d0000000066ddafc c0000000fe3af130 d000000006765850 ffffffffffffffea
GPR04: 000000000000000a c0000000fe3aef50 00000000000000d1 ffffffffffffffc0
GPR08: 0000000000000000 0000000000000021 00000000ffffffd1 d000000006741dc0
GPR12: c0000000004dd880 c00000000e822d00 c0000000fe3af4e0 0000000000000001
GPR16: c0000000fe3af6f0 0000000000000002 0000000000000000 0000000000000007
GPR20: 0000000000000003 c0000000fe3af4f0 0000000000000000 c0000000fe3af210
GPR24: 0000000000000004 0000000000001000 0000000000160000 f000000000f0e100
GPR28: c0000004055d18c8 c000000016cde430 0000000000158000 c0000004055d1588
[ 2854.688268] NIP [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs]
[ 2854.688318] LR [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs]
[ 2854.688321] Call Trace:
[ 2854.688369] [c0000000fe3af130] [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs] (unreliable)
[ 2854.688423] [c0000000fe3af1a0] [d0000000066a9104] .xfs_do_writepage+0x414/0x930 [xfs]
[ 2854.688430] [c0000000fe3af2b0] [c00000000025df6c] .write_cache_pages+0x5fc/0x820
[ 2854.688481] [c0000000fe3af470] [d0000000066a8a5c] .xfs_vm_writepages+0x8c/0xd0 [xfs]
[ 2854.688487] [c0000000fe3af540] [c00000000025f62c] .do_writepages+0x3c/0x70
[ 2854.688493] [c0000000fe3af5b0] [c00000000035b1ec] .__writeback_single_inode+0x5bc/0xd50
[ 2854.688499] [c0000000fe3af680] [c00000000035c3c0] .writeback_sb_inodes+0x380/0x730
[ 2854.688505] [c0000000fe3af7f0] [c00000000035ca44] .wb_writeback+0x194/0x920
[ 2854.688510] [c0000000fe3af960] [c00000000035ddcc] .wb_workfn+0x19c/0xa40
[ 2854.688516] [c0000000fe3afad0] [c0000000000dfc74] .process_one_work+0x264/0x8f0
[ 2854.688522] [c0000000fe3afbc0] [c0000000000e0388] .worker_thread+0x88/0x520
[ 2854.688528] [c0000000fe3afcb0] [c0000000000e9ac4] .kthread+0x114/0x140
[ 2854.688534] [c0000000fe3afe30] [c000000000009578] .ret_from_kernel_thread+0x58/0x60
[ 2854.688539] Instruction dump:
[ 2854.688543] 60420000 7c0802a6 3d420000 7c691b78 7c862378 e88abe38 7ca72b78 38600000
[ 2854.688555] 7d254b78 f8010010 f821ff91 4bfffcf9 <0fe00000> 38210070 e8010010 7c0803a6
[ 2854.688568] ---[ end trace 872ac2709186f781 ]---

[then the XFS warning repeated for a few times triggered by different
pid]

Thanks,
Eryu

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-08  3:28 ` [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Eryu Guan
@ 2016-04-08 11:37   ` Brian Foster
  2016-04-10  9:22     ` Eryu Guan
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-04-08 11:37 UTC (permalink / raw)
  To: Eryu Guan; +Cc: xfs

On Fri, Apr 08, 2016 at 11:28:41AM +0800, Eryu Guan wrote:
> On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is the second version of this patch set, first posted and
> > described here:
> > 
> > http://oss.sgi.com/archives/xfs/2016-04/msg00069.html
> 
> Just a quick note here, I'm testing the v1 patchset right now, v4.6-rc2
> kernel + v1 patch, config file is based on rhel7 debug kernel config.
> 
> The test is the same as the original reproducer (long term fsstress run
> on XFS, exported from NFS). The test on x86_64 host has been running for
> two days and everything looks fine. Test on ppc64 host has been running
> for a few hours and I noticed a lock issue and a few warnings, not sure
> if it's related to the patches or even to XFS yet(I need to run test on
> stock -rc2 kernel to be sure), but just post the logs here for reference
> 

Had the original problem ever been reproduced on an upstream kernel?

FWIW, my rhel kernel based test is still running well approaching ~48
hours. I've seen some lockdep messages (bad unlock balance), but IIRC
I've been seeing those from the start so I haven't been paying much
attention to it while digging into the core problem.

> [ 1911.626286] ======================================================
> [ 1911.626291] [ INFO: possible circular locking dependency detected ]
> [ 1911.626297] 4.6.0-rc2.debug+ #1 Not tainted
> [ 1911.626301] -------------------------------------------------------
> [ 1911.626306] nfsd/7402 is trying to acquire lock:
> [ 1911.626311]  (&s->s_sync_lock){+.+.+.}, at: [<c0000000003585f0>] .sync_inodes_sb+0xe0/0x230
> [ 1911.626327]
> [ 1911.626327] but task is already holding lock:
> [ 1911.626333]  (sb_internal){.+.+.+}, at: [<c00000000031a780>] .__sb_start_write+0x90/0x130
> [ 1911.626346]
> [ 1911.626346] which lock already depends on the new lock.
> [ 1911.626346]
> [ 1911.626353]
> [ 1911.626353] the existing dependency chain (in reverse order) is:
> [ 1911.626358]
...
> [ 1911.627134]  Possible unsafe locking scenario:
> [ 1911.627134]
> [ 1911.627139]        CPU0                    CPU1
> [ 1911.627143]        ----                    ----
> [ 1911.627147]   lock(sb_internal);
> [ 1911.627153]                                lock(&s->s_sync_lock);
> [ 1911.627160]                                lock(sb_internal);
> [ 1911.627166]   lock(&s->s_sync_lock);
> [ 1911.627172]
> [ 1911.627172]  *** DEADLOCK ***
> [ 1911.627172]
...

We actually have a report of this one on the list:

http://oss.sgi.com/archives/xfs/2016-04/msg00001.html

... so I don't think it's related to this series. I believe I've seen
this once or twice when testing something completely unrelated, as well.

> [ 2046.852739] kworker/dying (399) used greatest stack depth: 4352 bytes left
> [ 2854.687381] XFS: Assertion failed: buffer_mapped(bh), file: fs/xfs/xfs_aops.c, line: 780
> [ 2854.687434] ------------[ cut here ]------------
> [ 2854.687488] WARNING: CPU: 5 PID: 28924 at fs/xfs/xfs_message.c:105 .asswarn+0x2c/0x40 [xfs]
...
> [ 2854.687997] ---[ end trace 872ac2709186f780 ]---

These asserts look new to me, however. It would be interesting to see if
these reproduce independently.

Brian

> [ 2854.688001] XFS: Assertion failed: buffer_mapped(bh), file: fs/xfs/xfs_aops.c, line: 780
> [ 2854.688022] ------------[ cut here ]------------
> [ 2854.688072] WARNING: CPU: 5 PID: 28924 at fs/xfs/xfs_message.c:105 .asswarn+0x2c/0x40 [xfs]
> [ 2854.688076] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dm_mod loop sg pseries_rng nfsd auth_rpcgss nfs_acl lockd sunrpc grace ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp
> [ 2854.688110] CPU: 5 PID: 28924 Comm: kworker/u32:4 Tainted: G        W       4.6.0-rc2.debug+ #1
> [ 2854.688116] Workqueue: writeback .wb_workfn (flush-253:0)
> [ 2854.688121] task: c0000001e6d28380 ti: c0000000fe3ac000 task.ti: c0000000fe3ac000
> [ 2854.688126] NIP: d0000000066ddafc LR: d0000000066ddafc CTR: c0000000004dd880
> [ 2854.688131] REGS: c0000000fe3aeeb0 TRAP: 0700   Tainted: G        W        (4.6.0-rc2.debug+)
> [ 2854.688135] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI,TM[E]>  CR: 48002048  XER: 0000000d
> [ 2854.688153] CFAR: d0000000066dd890 SOFTE: 1
> GPR00: d0000000066ddafc c0000000fe3af130 d000000006765850 ffffffffffffffea
> GPR04: 000000000000000a c0000000fe3aef50 00000000000000d1 ffffffffffffffc0
> GPR08: 0000000000000000 0000000000000021 00000000ffffffd1 d000000006741dc0
> GPR12: c0000000004dd880 c00000000e822d00 c0000000fe3af4e0 0000000000000001
> GPR16: c0000000fe3af6f0 0000000000000002 0000000000000000 0000000000000007
> GPR20: 0000000000000003 c0000000fe3af4f0 0000000000000000 c0000000fe3af210
> GPR24: 0000000000000004 0000000000001000 0000000000160000 f000000000f0e100
> GPR28: c0000004055d18c8 c000000016cde430 0000000000158000 c0000004055d1588
> [ 2854.688268] NIP [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs]
> [ 2854.688318] LR [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs]
> [ 2854.688321] Call Trace:
> [ 2854.688369] [c0000000fe3af130] [d0000000066ddafc] .asswarn+0x2c/0x40 [xfs] (unreliable)
> [ 2854.688423] [c0000000fe3af1a0] [d0000000066a9104] .xfs_do_writepage+0x414/0x930 [xfs]
> [ 2854.688430] [c0000000fe3af2b0] [c00000000025df6c] .write_cache_pages+0x5fc/0x820
> [ 2854.688481] [c0000000fe3af470] [d0000000066a8a5c] .xfs_vm_writepages+0x8c/0xd0 [xfs]
> [ 2854.688487] [c0000000fe3af540] [c00000000025f62c] .do_writepages+0x3c/0x70
> [ 2854.688493] [c0000000fe3af5b0] [c00000000035b1ec] .__writeback_single_inode+0x5bc/0xd50
> [ 2854.688499] [c0000000fe3af680] [c00000000035c3c0] .writeback_sb_inodes+0x380/0x730
> [ 2854.688505] [c0000000fe3af7f0] [c00000000035ca44] .wb_writeback+0x194/0x920
> [ 2854.688510] [c0000000fe3af960] [c00000000035ddcc] .wb_workfn+0x19c/0xa40
> [ 2854.688516] [c0000000fe3afad0] [c0000000000dfc74] .process_one_work+0x264/0x8f0
> [ 2854.688522] [c0000000fe3afbc0] [c0000000000e0388] .worker_thread+0x88/0x520
> [ 2854.688528] [c0000000fe3afcb0] [c0000000000e9ac4] .kthread+0x114/0x140
> [ 2854.688534] [c0000000fe3afe30] [c000000000009578] .ret_from_kernel_thread+0x58/0x60
> [ 2854.688539] Instruction dump:
> [ 2854.688543] 60420000 7c0802a6 3d420000 7c691b78 7c862378 e88abe38 7ca72b78 38600000
> [ 2854.688555] 7d254b78 f8010010 f821ff91 4bfffcf9 <0fe00000> 38210070 e8010010 7c0803a6
> [ 2854.688568] ---[ end trace 872ac2709186f781 ]---
> 
> [then the XFS warning repeated for a few times triggered by different
> pid]
> 
> Thanks,
> Eryu
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
                   ` (6 preceding siblings ...)
  2016-04-08  3:28 ` [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Eryu Guan
@ 2016-04-08 17:18 ` Brian Foster
  2016-04-08 22:17   ` Dave Chinner
  7 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-04-08 17:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> Hi folks,
> 
> This is the second version of this patch set, first posted and
> described here:
> 
> http://oss.sgi.com/archives/xfs/2016-04/msg00069.html
> 
> The only change from the first version is splitting up the first
> patch into two as Christoph requested - one for the bug fix, the
> other for the variable renaming.
> 

Did your xfstests testing for this series include generic/233? I'm
seeing a consistently reproducible test hang. The test is hanging on a
"xfs_quota -x -c off -ug /mnt/scratch" command. The stack is as follows:

[<ffffffffa0772306>] xfs_qm_dquot_walk.isra.8+0x196/0x1b0 [xfs]
[<ffffffffa0774a98>] xfs_qm_dqpurge_all+0x78/0x80 [xfs]
[<ffffffffa07713e8>] xfs_qm_scall_quotaoff+0x148/0x640 [xfs]
[<ffffffffa077733d>] xfs_quota_disable+0x3d/0x50 [xfs]
[<ffffffff812c27e3>] SyS_quotactl+0x3b3/0x8c0
[<ffffffff81003e17>] do_syscall_64+0x67/0x190
[<ffffffff81763f7f>] return_from_SYSCALL_64+0x0/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff

... and it looks like the kernel is spinning somehow or another between
inode reclaim and xfsaild:

    ...
     kworker/1:2-210   [001] ...1   895.750591: xfs_perag_get_tag: dev 253:3 agno 1 refcount 1 caller xfs_reclaim_inodes_ag [xfs]
     kworker/1:2-210   [001] ...1   895.750609: xfs_perag_put: dev 253:3 agno 1 refcount 0 caller xfs_reclaim_inodes_ag [xfs]
     kworker/1:2-210   [001] ...1   895.750609: xfs_perag_get_tag: dev 253:3 agno 2 refcount 5 caller xfs_reclaim_inodes_ag [xfs]
     kworker/1:2-210   [001] ...1   895.750611: xfs_perag_put: dev 253:3 agno 2 refcount 4 caller xfs_reclaim_inodes_ag [xfs]
     kworker/1:2-210   [001] ...1   895.750612: xfs_perag_get_tag: dev 253:3 agno 3 refcount 1 caller xfs_reclaim_inodes_ag [xfs]
     kworker/1:2-210   [001] ...1   895.750613: xfs_perag_put: dev 253:3 agno 3 refcount 0 caller xfs_reclaim_inodes_ag [xfs]
    xfsaild/dm-3-12406 [003] ...2   895.760588: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
    xfsaild/dm-3-12406 [003] ...2   895.810595: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
    xfsaild/dm-3-12406 [003] ...2   895.860586: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
    xfsaild/dm-3-12406 [003] ...2   895.910596: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
    ...

FWIW, this only occurs with patch 6 applied. The test and scratch
devices are both 10GB lvm volumes formatted with mkfs defaults (v5).

Brian

> Cheers,
> 
> Dave.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-08 17:18 ` Brian Foster
@ 2016-04-08 22:17   ` Dave Chinner
  2016-04-11 13:37     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-04-08 22:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Apr 08, 2016 at 01:18:44PM -0400, Brian Foster wrote:
> On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is the second version of this patch set, first posted and
> > described here:
> > 
> > http://oss.sgi.com/archives/xfs/2016-04/msg00069.html
> > 
> > The only change from the first version is splitting up the first
> > patch into two as Christoph requested - one for the bug fix, the
> > other for the variable renaming.
> > 
> 
> Did your xfstests testing for this series include generic/233? I'm
> seeing a consistently reproducible test hang. The test is hanging on a
> "xfs_quota -x -c off -ug /mnt/scratch" command. The stack is as follows:
> 
> [<ffffffffa0772306>] xfs_qm_dquot_walk.isra.8+0x196/0x1b0 [xfs]
> [<ffffffffa0774a98>] xfs_qm_dqpurge_all+0x78/0x80 [xfs]
> [<ffffffffa07713e8>] xfs_qm_scall_quotaoff+0x148/0x640 [xfs]
> [<ffffffffa077733d>] xfs_quota_disable+0x3d/0x50 [xfs]
> [<ffffffff812c27e3>] SyS_quotactl+0x3b3/0x8c0
> [<ffffffff81003e17>] do_syscall_64+0x67/0x190
> [<ffffffff81763f7f>] return_from_SYSCALL_64+0x0/0x7a
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> ... and it looks like the kernel is spinning somehow or another between
> inode reclaim and xfsaild:
> 
>     ...
>      kworker/1:2-210   [001] ...1   895.750591: xfs_perag_get_tag: dev 253:3 agno 1 refcount 1 caller xfs_reclaim_inodes_ag [xfs]
>      kworker/1:2-210   [001] ...1   895.750609: xfs_perag_put: dev 253:3 agno 1 refcount 0 caller xfs_reclaim_inodes_ag [xfs]
>      kworker/1:2-210   [001] ...1   895.750609: xfs_perag_get_tag: dev 253:3 agno 2 refcount 5 caller xfs_reclaim_inodes_ag [xfs]
>      kworker/1:2-210   [001] ...1   895.750611: xfs_perag_put: dev 253:3 agno 2 refcount 4 caller xfs_reclaim_inodes_ag [xfs]
>      kworker/1:2-210   [001] ...1   895.750612: xfs_perag_get_tag: dev 253:3 agno 3 refcount 1 caller xfs_reclaim_inodes_ag [xfs]
>      kworker/1:2-210   [001] ...1   895.750613: xfs_perag_put: dev 253:3 agno 3 refcount 0 caller xfs_reclaim_inodes_ag [xfs]
>     xfsaild/dm-3-12406 [003] ...2   895.760588: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
>     xfsaild/dm-3-12406 [003] ...2   895.810595: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
>     xfsaild/dm-3-12406 [003] ...2   895.860586: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
>     xfsaild/dm-3-12406 [003] ...2   895.910596: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
>     ...

No deadlock involving the AIL - it doesn't remove the
XFS_LI_QUOTAOFF from the AIL - the quota code committing the
quotaoff-end transactions is what removes that. IOWs, the dquot walk
has not completed, so quotaoff has not completed, so the
XFS_LI_QUOTAOFF is still in the AIL.

IOWs, this looks like xfs_qm_dquot_walk() is skipping dquots because
xfs_qm_dqpurge is hitting this:

        xfs_dqlock(dqp);
        if ((dqp->dq_flags & XFS_DQ_FREEING) || dqp->q_nrefs != 0) {
                xfs_dqunlock(dqp);
                return -EAGAIN;
        }

So that means we've got an inode that probably hasn't been
reclaimed, because the last thing that happens during reclaim is the
dquots are detatched from the inode and hence the reference counts
are dropped.

> FWIW, this only occurs with patch 6 applied. The test and scratch
> devices are both 10GB lvm volumes formatted with mkfs defaults (v5).

I can't see how patch 6 would prevent an inode from being reclaimed,
as all the changes occur *after* the reclaim decision has been made.
More investigation needed, I guess...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-08 11:37   ` Brian Foster
@ 2016-04-10  9:22     ` Eryu Guan
  2016-04-11  6:25       ` Eryu Guan
  0 siblings, 1 reply; 18+ messages in thread
From: Eryu Guan @ 2016-04-10  9:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Fri, Apr 08, 2016 at 07:37:09AM -0400, Brian Foster wrote:
> On Fri, Apr 08, 2016 at 11:28:41AM +0800, Eryu Guan wrote:
> > On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > This is the second version of this patch set, first posted and
> > > described here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2016-04/msg00069.html
> > 
> > Just a quick note here, I'm testing the v1 patchset right now, v4.6-rc2
> > kernel + v1 patch, config file is based on rhel7 debug kernel config.
> > 
> > The test is the same as the original reproducer (long term fsstress run
> > on XFS, exported from NFS). The test on x86_64 host has been running for
> > two days and everything looks fine. Test on ppc64 host has been running
> > for a few hours and I noticed a lock issue and a few warnings, not sure
> > if it's related to the patches or even to XFS yet(I need to run test on
> > stock -rc2 kernel to be sure), but just post the logs here for reference
> > 
> 
> Had the original problem ever been reproduced on an upstream kernel?

No, I've never seen the original problem in my upstream kernel testings.
Perhaps that's because I didn't run tests on debug kernels. But I didn't
see it in RHEL7 debug kernel testings either.

> 
> FWIW, my rhel kernel based test is still running well approaching ~48
> hours. I've seen some lockdep messages (bad unlock balance), but IIRC
> I've been seeing those from the start so I haven't been paying much
> attention to it while digging into the core problem.
> 
> > [ 1911.626286] ======================================================
> > [ 1911.626291] [ INFO: possible circular locking dependency detected ]
> > [ 1911.626297] 4.6.0-rc2.debug+ #1 Not tainted
> > [ 1911.626301] -------------------------------------------------------
> > [ 1911.626306] nfsd/7402 is trying to acquire lock:
> > [ 1911.626311]  (&s->s_sync_lock){+.+.+.}, at: [<c0000000003585f0>] .sync_inodes_sb+0xe0/0x230
> > [ 1911.626327]
> > [ 1911.626327] but task is already holding lock:
> > [ 1911.626333]  (sb_internal){.+.+.+}, at: [<c00000000031a780>] .__sb_start_write+0x90/0x130
> > [ 1911.626346]
> > [ 1911.626346] which lock already depends on the new lock.
> > [ 1911.626346]
> > [ 1911.626353]
> > [ 1911.626353] the existing dependency chain (in reverse order) is:
> > [ 1911.626358]
> ...
> > [ 1911.627134]  Possible unsafe locking scenario:
> > [ 1911.627134]
> > [ 1911.627139]        CPU0                    CPU1
> > [ 1911.627143]        ----                    ----
> > [ 1911.627147]   lock(sb_internal);
> > [ 1911.627153]                                lock(&s->s_sync_lock);
> > [ 1911.627160]                                lock(sb_internal);
> > [ 1911.627166]   lock(&s->s_sync_lock);
> > [ 1911.627172]
> > [ 1911.627172]  *** DEADLOCK ***
> > [ 1911.627172]
> ...
> 
> We actually have a report of this one on the list:
> 
> http://oss.sgi.com/archives/xfs/2016-04/msg00001.html
> 
> ... so I don't think it's related to this series. I believe I've seen
> this once or twice when testing something completely unrelated, as well.
> 
> > [ 2046.852739] kworker/dying (399) used greatest stack depth: 4352 bytes left
> > [ 2854.687381] XFS: Assertion failed: buffer_mapped(bh), file: fs/xfs/xfs_aops.c, line: 780
> > [ 2854.687434] ------------[ cut here ]------------
> > [ 2854.687488] WARNING: CPU: 5 PID: 28924 at fs/xfs/xfs_message.c:105 .asswarn+0x2c/0x40 [xfs]
> ...
> > [ 2854.687997] ---[ end trace 872ac2709186f780 ]---
> 
> These asserts look new to me, however. It would be interesting to see if
> these reproduce independently.

I've seen just the assert failures in the same fsstress testing on ppc64
host (no lock warnings in the beginning). Will see if it's still
reproducible on stock kernel.

Thanks,
Eryu

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-10  9:22     ` Eryu Guan
@ 2016-04-11  6:25       ` Eryu Guan
  0 siblings, 0 replies; 18+ messages in thread
From: Eryu Guan @ 2016-04-11  6:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Sun, Apr 10, 2016 at 05:22:35PM +0800, Eryu Guan wrote:
> On Fri, Apr 08, 2016 at 07:37:09AM -0400, Brian Foster wrote:
> > On Fri, Apr 08, 2016 at 11:28:41AM +0800, Eryu Guan wrote:
> > > On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > This is the second version of this patch set, first posted and
> > > > described here:
> > > > 
> > > > http://oss.sgi.com/archives/xfs/2016-04/msg00069.html
> > > 
> > > Just a quick note here, I'm testing the v1 patchset right now, v4.6-rc2
> > > kernel + v1 patch, config file is based on rhel7 debug kernel config.
> > > 
> > > The test is the same as the original reproducer (long term fsstress run
> > > on XFS, exported from NFS). The test on x86_64 host has been running for
> > > two days and everything looks fine. Test on ppc64 host has been running
> > > for a few hours and I noticed a lock issue and a few warnings, not sure
> > > if it's related to the patches or even to XFS yet(I need to run test on
> > > stock -rc2 kernel to be sure), but just post the logs here for reference
> > > 
> > 
> > Had the original problem ever been reproduced on an upstream kernel?
> 
> No, I've never seen the original problem in my upstream kernel testings.
> Perhaps that's because I didn't run tests on debug kernels. But I didn't
> see it in RHEL7 debug kernel testings either.
> 
> > 
> > FWIW, my rhel kernel based test is still running well approaching ~48
> > hours. I've seen some lockdep messages (bad unlock balance), but IIRC
> > I've been seeing those from the start so I haven't been paying much
> > attention to it while digging into the core problem.
> > 
> > > [ 1911.626286] ======================================================
> > > [ 1911.626291] [ INFO: possible circular locking dependency detected ]
> > > [ 1911.626297] 4.6.0-rc2.debug+ #1 Not tainted
> > > [ 1911.626301] -------------------------------------------------------
> > > [ 1911.626306] nfsd/7402 is trying to acquire lock:
> > > [ 1911.626311]  (&s->s_sync_lock){+.+.+.}, at: [<c0000000003585f0>] .sync_inodes_sb+0xe0/0x230
> > > [ 1911.626327]
> > > [ 1911.626327] but task is already holding lock:
> > > [ 1911.626333]  (sb_internal){.+.+.+}, at: [<c00000000031a780>] .__sb_start_write+0x90/0x130
> > > [ 1911.626346]
> > > [ 1911.626346] which lock already depends on the new lock.
> > > [ 1911.626346]
> > > [ 1911.626353]
> > > [ 1911.626353] the existing dependency chain (in reverse order) is:
> > > [ 1911.626358]
> > ...
> > > [ 1911.627134]  Possible unsafe locking scenario:
> > > [ 1911.627134]
> > > [ 1911.627139]        CPU0                    CPU1
> > > [ 1911.627143]        ----                    ----
> > > [ 1911.627147]   lock(sb_internal);
> > > [ 1911.627153]                                lock(&s->s_sync_lock);
> > > [ 1911.627160]                                lock(sb_internal);
> > > [ 1911.627166]   lock(&s->s_sync_lock);
> > > [ 1911.627172]
> > > [ 1911.627172]  *** DEADLOCK ***
> > > [ 1911.627172]
> > ...
> > 
> > We actually have a report of this one on the list:
> > 
> > http://oss.sgi.com/archives/xfs/2016-04/msg00001.html
> > 
> > ... so I don't think it's related to this series. I believe I've seen
> > this once or twice when testing something completely unrelated, as well.
> > 
> > > [ 2046.852739] kworker/dying (399) used greatest stack depth: 4352 bytes left
> > > [ 2854.687381] XFS: Assertion failed: buffer_mapped(bh), file: fs/xfs/xfs_aops.c, line: 780
> > > [ 2854.687434] ------------[ cut here ]------------
> > > [ 2854.687488] WARNING: CPU: 5 PID: 28924 at fs/xfs/xfs_message.c:105 .asswarn+0x2c/0x40 [xfs]
> > ...
> > > [ 2854.687997] ---[ end trace 872ac2709186f780 ]---
> > 
> > These asserts look new to me, however. It would be interesting to see if
> > these reproduce independently.
> 
> I've seen just the assert failures in the same fsstress testing on ppc64
> host (no lock warnings in the beginning). Will see if it's still
> reproducible on stock kernel.

I saw the assert failures on stock kernel (v4.6-rc2) too, so at least
it's not something introduced by this patchset.

Thanks,
Eryu

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-08 22:17   ` Dave Chinner
@ 2016-04-11 13:37     ` Brian Foster
  2016-04-11 23:31       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2016-04-11 13:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 4461 bytes --]

On Sat, Apr 09, 2016 at 08:17:06AM +1000, Dave Chinner wrote:
> On Fri, Apr 08, 2016 at 01:18:44PM -0400, Brian Foster wrote:
> > On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > This is the second version of this patch set, first posted and
> > > described here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2016-04/msg00069.html
> > > 
> > > The only change from the first version is splitting up the first
> > > patch into two as Christoph requested - one for the bug fix, the
> > > other for the variable renaming.
> > > 
> > 
> > Did your xfstests testing for this series include generic/233? I'm
> > seeing a consistently reproducible test hang. The test is hanging on a
> > "xfs_quota -x -c off -ug /mnt/scratch" command. The stack is as follows:
> > 
> > [<ffffffffa0772306>] xfs_qm_dquot_walk.isra.8+0x196/0x1b0 [xfs]
> > [<ffffffffa0774a98>] xfs_qm_dqpurge_all+0x78/0x80 [xfs]
> > [<ffffffffa07713e8>] xfs_qm_scall_quotaoff+0x148/0x640 [xfs]
> > [<ffffffffa077733d>] xfs_quota_disable+0x3d/0x50 [xfs]
> > [<ffffffff812c27e3>] SyS_quotactl+0x3b3/0x8c0
> > [<ffffffff81003e17>] do_syscall_64+0x67/0x190
> > [<ffffffff81763f7f>] return_from_SYSCALL_64+0x0/0x7a
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > ... and it looks like the kernel is spinning somehow or another between
> > inode reclaim and xfsaild:
> > 
> >     ...
> >      kworker/1:2-210   [001] ...1   895.750591: xfs_perag_get_tag: dev 253:3 agno 1 refcount 1 caller xfs_reclaim_inodes_ag [xfs]
> >      kworker/1:2-210   [001] ...1   895.750609: xfs_perag_put: dev 253:3 agno 1 refcount 0 caller xfs_reclaim_inodes_ag [xfs]
> >      kworker/1:2-210   [001] ...1   895.750609: xfs_perag_get_tag: dev 253:3 agno 2 refcount 5 caller xfs_reclaim_inodes_ag [xfs]
> >      kworker/1:2-210   [001] ...1   895.750611: xfs_perag_put: dev 253:3 agno 2 refcount 4 caller xfs_reclaim_inodes_ag [xfs]
> >      kworker/1:2-210   [001] ...1   895.750612: xfs_perag_get_tag: dev 253:3 agno 3 refcount 1 caller xfs_reclaim_inodes_ag [xfs]
> >      kworker/1:2-210   [001] ...1   895.750613: xfs_perag_put: dev 253:3 agno 3 refcount 0 caller xfs_reclaim_inodes_ag [xfs]
> >     xfsaild/dm-3-12406 [003] ...2   895.760588: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
> >     xfsaild/dm-3-12406 [003] ...2   895.810595: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
> >     xfsaild/dm-3-12406 [003] ...2   895.860586: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
> >     xfsaild/dm-3-12406 [003] ...2   895.910596: xfs_ail_locked: dev 253:3 lip 0xffff8801f8e65d80 lsn 2/5709 type XFS_LI_QUOTAOFF flags IN_AIL
> >     ...
> 
> No deadlock involving the AIL - it doesn't remove the
> XFS_LI_QUOTAOFF from the AIL - the quota code committing the
> quotaoff-end transactions is what removes that. IOWs, the dquot walk
> has not completed, so quotaoff has not completed, so the
> XFS_LI_QUOTAOFF is still in the AIL.
> 
> IOWs, this looks like xfs_qm_dquot_walk() is skipping dquots because
> xfs_qm_dqpurge is hitting this:
> 
>         xfs_dqlock(dqp);
>         if ((dqp->dq_flags & XFS_DQ_FREEING) || dqp->q_nrefs != 0) {
>                 xfs_dqunlock(dqp);
>                 return -EAGAIN;
>         }
> 
> So that means we've got an inode that probably hasn't been
> reclaimed, because the last thing that happens during reclaim is the
> dquots are detatched from the inode and hence the reference counts
> are dropped.
> 
> > FWIW, this only occurs with patch 6 applied. The test and scratch
> > devices are both 10GB lvm volumes formatted with mkfs defaults (v5).
> 
> I can't see how patch 6 would prevent an inode from being reclaimed,
> as all the changes occur *after* the reclaim decision has been made.
> More investigation needed, I guess...
> 

The attached diff addresses the problem for me. Feel free to fold it
into the original patch.

The regression test I had running failed with an OOM over the weekend.
I hadn't seen that before, but then again I haven't seen this test run
to completion on this system either due to the original problem. I'll
restart it today with this hunk included.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

[-- Attachment #2: __xfs_inode_clear_reclaim.diff --]
[-- Type: text/plain, Size: 1246 bytes --]

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a60db43..749689c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -818,14 +818,15 @@ xfs_inode_set_reclaim_tag(
 STATIC void
 __xfs_inode_clear_reclaim(
 	xfs_perag_t	*pag,
-	xfs_inode_t	*ip)
+	xfs_inode_t	*ip,
+	xfs_ino_t	ino)
 {
 	pag->pag_ici_reclaimable--;
 	if (!pag->pag_ici_reclaimable) {
 		/* clear the reclaim tag from the perag radix tree */
 		spin_lock(&ip->i_mount->m_perag_lock);
 		radix_tree_tag_clear(&ip->i_mount->m_perag_tree,
-				XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
+				XFS_INO_TO_AGNO(ip->i_mount, ino),
 				XFS_ICI_RECLAIM_TAG);
 		spin_unlock(&ip->i_mount->m_perag_lock);
 		trace_xfs_perag_clear_reclaim(ip->i_mount, pag->pag_agno,
@@ -841,7 +842,7 @@ __xfs_inode_clear_reclaim_tag(
 {
 	radix_tree_tag_clear(&pag->pag_ici_root,
 			XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
-	__xfs_inode_clear_reclaim(pag, ip);
+	__xfs_inode_clear_reclaim(pag, ip, ip->i_ino);
 }
 
 /*
@@ -1032,7 +1033,7 @@ reclaim:
 	if (!radix_tree_delete(&pag->pag_ici_root,
 				XFS_INO_TO_AGINO(ip->i_mount, ino)))
 		ASSERT(0);
-	__xfs_inode_clear_reclaim(pag, ip);
+	__xfs_inode_clear_reclaim(pag, ip, ino);
 	spin_unlock(&pag->pag_ici_lock);
 
 	/*

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode
  2016-04-11 13:37     ` Brian Foster
@ 2016-04-11 23:31       ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2016-04-11 23:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Apr 11, 2016 at 09:37:18AM -0400, Brian Foster wrote:
> On Sat, Apr 09, 2016 at 08:17:06AM +1000, Dave Chinner wrote:
> > On Fri, Apr 08, 2016 at 01:18:44PM -0400, Brian Foster wrote:
> > > On Fri, Apr 08, 2016 at 09:37:45AM +1000, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > This is the second version of this patch set, first posted and
> > > > described here:
> > > > 
> > > > http://oss.sgi.com/archives/xfs/2016-04/msg00069.html
> > > > 
> > > > The only change from the first version is splitting up the first
> > > > patch into two as Christoph requested - one for the bug fix, the
> > > > other for the variable renaming.
> > > > 
> > > 
> > > Did your xfstests testing for this series include generic/233? I'm
> > > seeing a consistently reproducible test hang. The test is hanging on a
> > > "xfs_quota -x -c off -ug /mnt/scratch" command. The stack is as follows:
> > > 
> > > [<ffffffffa0772306>] xfs_qm_dquot_walk.isra.8+0x196/0x1b0 [xfs]
> > > [<ffffffffa0774a98>] xfs_qm_dqpurge_all+0x78/0x80 [xfs]
> > > [<ffffffffa07713e8>] xfs_qm_scall_quotaoff+0x148/0x640 [xfs]
> > > [<ffffffffa077733d>] xfs_quota_disable+0x3d/0x50 [xfs]
> > > [<ffffffff812c27e3>] SyS_quotactl+0x3b3/0x8c0
> > > [<ffffffff81003e17>] do_syscall_64+0x67/0x190
> > > [<ffffffff81763f7f>] return_from_SYSCALL_64+0x0/0x7a
> > > [<ffffffffffffffff>] 0xffffffffffffffff
....
> > IOWs, this looks like xfs_qm_dquot_walk() is skipping dquots because
> > xfs_qm_dqpurge is hitting this:
> > 
> >         xfs_dqlock(dqp);
> >         if ((dqp->dq_flags & XFS_DQ_FREEING) || dqp->q_nrefs != 0) {
> >                 xfs_dqunlock(dqp);
> >                 return -EAGAIN;
> >         }
> > 
> > So that means we've got an inode that probably hasn't been
> > reclaimed, because the last thing that happens during reclaim is the
> > dquots are detatched from the inode and hence the reference counts
> > are dropped.
> > 
> > > FWIW, this only occurs with patch 6 applied. The test and scratch
> > > devices are both 10GB lvm volumes formatted with mkfs defaults (v5).
> > 
> > I can't see how patch 6 would prevent an inode from being reclaimed,
> > as all the changes occur *after* the reclaim decision has been made.
> > More investigation needed, I guess...
> > 
> 
> The attached diff addresses the problem for me. Feel free to fold it
> into the original patch.
....
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index a60db43..749689c 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -818,14 +818,15 @@ xfs_inode_set_reclaim_tag(
>  STATIC void
>  __xfs_inode_clear_reclaim(
>  	xfs_perag_t	*pag,
> -	xfs_inode_t	*ip)
> +	xfs_inode_t	*ip,
> +	xfs_ino_t	ino)
>  {
>  	pag->pag_ici_reclaimable--;
>  	if (!pag->pag_ici_reclaimable) {
>  		/* clear the reclaim tag from the perag radix tree */
>  		spin_lock(&ip->i_mount->m_perag_lock);
>  		radix_tree_tag_clear(&ip->i_mount->m_perag_tree,
> -				XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
> +				XFS_INO_TO_AGNO(ip->i_mount, ino),
>  				XFS_ICI_RECLAIM_TAG);
>  		spin_unlock(&ip->i_mount->m_perag_lock);
>  		trace_xfs_perag_clear_reclaim(ip->i_mount, pag->pag_agno,

Yeah, that'll do it. Though I think the fix should be something
different - why do we need ip->i_ino to find the agno or the
xfs_mount when we've already got pag->pag_agno and pag->pag_mount?

I'll clean up all these per-ag tag functions to only take xfs_perag
and an xfs_ino_t where needed and repost once i've tested it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-04-11 23:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-07 23:37 [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Dave Chinner
2016-04-07 23:37 ` [PATCH 1/6] xfs: fix inode validity check in xfs_iflush_cluster Dave Chinner
2016-04-07 23:43   ` Christoph Hellwig
2016-04-07 23:37 ` [PATCH 2/6] xfs: rename variables in xfs_iflush_cluster for clarity Dave Chinner
2016-04-07 23:44   ` Christoph Hellwig
2016-04-07 23:37 ` [PATCH 3/6] xfs: skip stale inodes in xfs_iflush_cluster Dave Chinner
2016-04-07 23:37 ` [PATCH 4/6] xfs: xfs_iflush_cluster has range issues Dave Chinner
2016-04-07 23:37 ` [PATCH 5/6] xfs: xfs_inode_free() isn't RCU safe Dave Chinner
2016-04-07 23:37 ` [PATCH 6/6] xfs: mark reclaimed inodes invalid earlier Dave Chinner
2016-04-07 23:46   ` Christoph Hellwig
2016-04-08  3:28 ` [PATCH 0/6 v2] xfs: xfs_iflush_cluster vs xfs_reclaim_inode Eryu Guan
2016-04-08 11:37   ` Brian Foster
2016-04-10  9:22     ` Eryu Guan
2016-04-11  6:25       ` Eryu Guan
2016-04-08 17:18 ` Brian Foster
2016-04-08 22:17   ` Dave Chinner
2016-04-11 13:37     ` Brian Foster
2016-04-11 23:31       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.