[PATCHSET v2 0/3] xfs: preserve inode health reports for longer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v2 0/3] xfs: preserve inode health reports for longer
@ 2021-06-03  3:12 Darrick J. Wong
  2021-06-03  3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03  3:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, bfoster

Hi all,

This is a quick series to make sure that inode sickness reports stick
around in memory for some amount of time.

v2: rebase to 5.13-rc4

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-inode-health-reports-5.14
---
 fs/xfs/xfs_health.c |    5 +++++
 fs/xfs/xfs_icache.c |   29 +++++++++++++++++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode
  2021-06-03  3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong
@ 2021-06-03  3:12 ` Darrick J. Wong
  2021-06-03  4:21   ` Dave Chinner
  2021-06-03 12:22   ` Brian Foster
  2021-06-03  3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong
  2021-06-03  3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong
  2 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03  3:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, bfoster

From: Darrick J. Wong <djwong@kernel.org>

While running some fuzz tests on inode metadata, I noticed that the
filesystem health report (as provided by xfs_spaceman) failed to report
the file corruption even when spaceman was run immediately after running
xfs_scrub to detect the corruption.  That isn't the intended behavior;
one ought to be able to run scrub to detect errors in the ondisk
metadata and be able to access to those reports for some time after the
scrub.

After running the same sequence through an instrumented kernel, I
discovered the reason why -- scrub igets the file, scans it, marks it
sick, and ireleases the inode.  When the VFS lets go of the incore
inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
inode before it moves to RECLAIM state, iget reinitializes the VFS
state, clears the sick and checked masks, and hands back the inode.  At
this point, the caller has the exact same incore inode, but with all the
health state erased.

In other words, we're erasing the incore inode's health state flags when
we've decided NOT to sever the link between the incore inode and the
ondisk inode.  This is wrong, so we need to remove the lines that zero
the fields from xfs_iget_cache_hit.

As a precaution, we add the same lines into xfs_reclaim_inode just after
we sever the link between incore and ondisk inode.  Strictly speaking
this isn't necessary because once an inode has gone through reclaim it
must go through xfs_inode_alloc (which also zeroes the state) and
xfs_iget is careful to check for mismatches between the inode it pulls
out of the radix tree and the one it wants.

Fixes: 6772c1f11206 ("xfs: track metadata health status")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 396cc54ca03f..c3f912a9231b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -523,9 +523,6 @@ xfs_iget_cache_hit(
 				XFS_INO_TO_AGINO(pag->pag_mount, ino),
 				XFS_ICI_RECLAIM_TAG);
 		inode->i_state = I_NEW;
-		ip->i_sick = 0;
-		ip->i_checked = 0;
-
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
 	} else {
@@ -979,6 +976,8 @@ xfs_reclaim_inode(
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
+	ip->i_sick = 0;
+	ip->i_checked = 0;
 	spin_unlock(&ip->i_flags_lock);

 	xfs_iunlock(ip, XFS_ILOCK_EXCL);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode
  2021-06-03  3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
@ 2021-06-03  4:21   ` Dave Chinner
  2021-06-03 20:41     ` Darrick J. Wong
  2021-06-03 12:22   ` Brian Foster
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-06-03  4:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, bfoster

On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While running some fuzz tests on inode metadata, I noticed that the
> filesystem health report (as provided by xfs_spaceman) failed to report
> the file corruption even when spaceman was run immediately after running
> xfs_scrub to detect the corruption.  That isn't the intended behavior;
> one ought to be able to run scrub to detect errors in the ondisk
> metadata and be able to access to those reports for some time after the
> scrub.
> 
> After running the same sequence through an instrumented kernel, I
> discovered the reason why -- scrub igets the file, scans it, marks it
> sick, and ireleases the inode.  When the VFS lets go of the incore
> inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> inode before it moves to RECLAIM state, iget reinitializes the VFS
> state, clears the sick and checked masks, and hands back the inode.  At
> this point, the caller has the exact same incore inode, but with all the
> health state erased.
> 
> In other words, we're erasing the incore inode's health state flags when
> we've decided NOT to sever the link between the incore inode and the
> ondisk inode.  This is wrong, so we need to remove the lines that zero
> the fields from xfs_iget_cache_hit.
> 
> As a precaution, we add the same lines into xfs_reclaim_inode just after
> we sever the link between incore and ondisk inode.  Strictly speaking
> this isn't necessary because once an inode has gone through reclaim it
> must go through xfs_inode_alloc (which also zeroes the state) and
> xfs_iget is careful to check for mismatches between the inode it pulls
> out of the radix tree and the one it wants.
> 
> Fixes: 6772c1f11206 ("xfs: track metadata health status")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Though I do wonder how long such state will hang around, because
once the inode is IRECLAIMABLE and clean it is only a matter of
seconds before the background inode reclaimer will free it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode
  2021-06-03  4:21   ` Dave Chinner
@ 2021-06-03 20:41     ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03 20:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, bfoster

On Thu, Jun 03, 2021 at 02:21:07PM +1000, Dave Chinner wrote:
> On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While running some fuzz tests on inode metadata, I noticed that the
> > filesystem health report (as provided by xfs_spaceman) failed to report
> > the file corruption even when spaceman was run immediately after running
> > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > one ought to be able to run scrub to detect errors in the ondisk
> > metadata and be able to access to those reports for some time after the
> > scrub.
> > 
> > After running the same sequence through an instrumented kernel, I
> > discovered the reason why -- scrub igets the file, scans it, marks it
> > sick, and ireleases the inode.  When the VFS lets go of the incore
> > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > state, clears the sick and checked masks, and hands back the inode.  At
> > this point, the caller has the exact same incore inode, but with all the
> > health state erased.
> > 
> > In other words, we're erasing the incore inode's health state flags when
> > we've decided NOT to sever the link between the incore inode and the
> > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > the fields from xfs_iget_cache_hit.
> > 
> > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > we sever the link between incore and ondisk inode.  Strictly speaking
> > this isn't necessary because once an inode has gone through reclaim it
> > must go through xfs_inode_alloc (which also zeroes the state) and
> > xfs_iget is careful to check for mismatches between the inode it pulls
> > out of the radix tree and the one it wants.
> > 
> > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |    5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> Looks fine.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> Though I do wonder how long such state will hang around, because
> once the inode is IRECLAIMABLE and clean it is only a matter of
> seconds before the background inode reclaimer will free it...

A future patchset will add the ability to make the per-AG health status
remember that we were forced to reclaim a sick inode:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting

This series used to be the "fixes first" part of that code, but since it
directly intersects with the deferred inactivation changes, I moved it
up to try to fix the problem sooner than later.

Anyway, thanks for the review.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode
  2021-06-03  3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
  2021-06-03  4:21   ` Dave Chinner
@ 2021-06-03 12:22   ` Brian Foster
  2021-06-03 20:41     ` Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Brian Foster @ 2021-06-03 12:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While running some fuzz tests on inode metadata, I noticed that the
> filesystem health report (as provided by xfs_spaceman) failed to report
> the file corruption even when spaceman was run immediately after running
> xfs_scrub to detect the corruption.  That isn't the intended behavior;
> one ought to be able to run scrub to detect errors in the ondisk
> metadata and be able to access to those reports for some time after the
> scrub.
> 
> After running the same sequence through an instrumented kernel, I
> discovered the reason why -- scrub igets the file, scans it, marks it
> sick, and ireleases the inode.  When the VFS lets go of the incore
> inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> inode before it moves to RECLAIM state, iget reinitializes the VFS
> state, clears the sick and checked masks, and hands back the inode.  At
> this point, the caller has the exact same incore inode, but with all the
> health state erased.
> 
> In other words, we're erasing the incore inode's health state flags when
> we've decided NOT to sever the link between the incore inode and the
> ondisk inode.  This is wrong, so we need to remove the lines that zero
> the fields from xfs_iget_cache_hit.
> 
> As a precaution, we add the same lines into xfs_reclaim_inode just after
> we sever the link between incore and ondisk inode.  Strictly speaking
> this isn't necessary because once an inode has gone through reclaim it
> must go through xfs_inode_alloc (which also zeroes the state) and
> xfs_iget is careful to check for mismatches between the inode it pulls
> out of the radix tree and the one it wants.
> 
> Fixes: 6772c1f11206 ("xfs: track metadata health status")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

I think I reviewed this the last time around..

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_icache.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 396cc54ca03f..c3f912a9231b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -523,9 +523,6 @@ xfs_iget_cache_hit(
>  				XFS_INO_TO_AGINO(pag->pag_mount, ino),
>  				XFS_ICI_RECLAIM_TAG);
>  		inode->i_state = I_NEW;
> -		ip->i_sick = 0;
> -		ip->i_checked = 0;
> -
>  		spin_unlock(&ip->i_flags_lock);
>  		spin_unlock(&pag->pag_ici_lock);
>  	} else {
> @@ -979,6 +976,8 @@ xfs_reclaim_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags = XFS_IRECLAIM;
>  	ip->i_ino = 0;
> +	ip->i_sick = 0;
> +	ip->i_checked = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode
  2021-06-03 12:22   ` Brian Foster
@ 2021-06-03 20:41     ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03 20:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david

On Thu, Jun 03, 2021 at 08:22:10AM -0400, Brian Foster wrote:
> On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While running some fuzz tests on inode metadata, I noticed that the
> > filesystem health report (as provided by xfs_spaceman) failed to report
> > the file corruption even when spaceman was run immediately after running
> > xfs_scrub to detect the corruption.  That isn't the intended behavior;
> > one ought to be able to run scrub to detect errors in the ondisk
> > metadata and be able to access to those reports for some time after the
> > scrub.
> > 
> > After running the same sequence through an instrumented kernel, I
> > discovered the reason why -- scrub igets the file, scans it, marks it
> > sick, and ireleases the inode.  When the VFS lets go of the incore
> > inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> > inode before it moves to RECLAIM state, iget reinitializes the VFS
> > state, clears the sick and checked masks, and hands back the inode.  At
> > this point, the caller has the exact same incore inode, but with all the
> > health state erased.
> > 
> > In other words, we're erasing the incore inode's health state flags when
> > we've decided NOT to sever the link between the incore inode and the
> > ondisk inode.  This is wrong, so we need to remove the lines that zero
> > the fields from xfs_iget_cache_hit.
> > 
> > As a precaution, we add the same lines into xfs_reclaim_inode just after
> > we sever the link between incore and ondisk inode.  Strictly speaking
> > this isn't necessary because once an inode has gone through reclaim it
> > must go through xfs_inode_alloc (which also zeroes the state) and
> > xfs_iget is careful to check for mismatches between the inode it pulls
> > out of the radix tree and the one it wants.
> > 
> > Fixes: 6772c1f11206 ("xfs: track metadata health status")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> 
> I think I reviewed this the last time around..

Oops, yes, my bad. :(

--D

> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> >  fs/xfs/xfs_icache.c |    5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 396cc54ca03f..c3f912a9231b 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -523,9 +523,6 @@ xfs_iget_cache_hit(
> >  				XFS_INO_TO_AGINO(pag->pag_mount, ino),
> >  				XFS_ICI_RECLAIM_TAG);
> >  		inode->i_state = I_NEW;
> > -		ip->i_sick = 0;
> > -		ip->i_checked = 0;
> > -
> >  		spin_unlock(&ip->i_flags_lock);
> >  		spin_unlock(&pag->pag_ici_lock);
> >  	} else {
> > @@ -979,6 +976,8 @@ xfs_reclaim_inode(
> >  	spin_lock(&ip->i_flags_lock);
> >  	ip->i_flags = XFS_IRECLAIM;
> >  	ip->i_ino = 0;
> > +	ip->i_sick = 0;
> > +	ip->i_checked = 0;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick
  2021-06-03  3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong
  2021-06-03  3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
@ 2021-06-03  3:12 ` Darrick J. Wong
  2021-06-03  4:34   ` Dave Chinner
  2021-06-03 12:23   ` Brian Foster
  2021-06-03  3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong
  2 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03  3:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, bfoster

From: Darrick J. Wong <djwong@kernel.org>

When we decide to mark an inode sick, clear the DONTCACHE flag so that
the incore inode will be kept around until memory pressure forces it out
of memory.  This increases the chances that the sick status will be
caught by someone compiling a health report later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_health.c |    5 +++++
 fs/xfs/xfs_icache.c |    3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 8e0cb05a7142..824e0b781290 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -231,6 +231,11 @@ xfs_inode_mark_sick(
 	ip->i_sick |= mask;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
+
+	/* Keep this inode around so we don't lose the sickness report. */
+	spin_lock(&VFS_I(ip)->i_lock);
+	VFS_I(ip)->i_state &= ~I_DONTCACHE;
+	spin_unlock(&VFS_I(ip)->i_lock);
 }
 
 /* Mark parts of an inode healed. */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c3f912a9231b..0e2b6c05e604 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -23,6 +23,7 @@
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
 #include "xfs_ialloc.h"
+#include "xfs_health.h"
 
 #include <linux/iversion.h>
 
@@ -648,7 +649,7 @@ xfs_iget_cache_miss(
 	 * time.
 	 */
 	iflags = XFS_INEW;
-	if (flags & XFS_IGET_DONTCACHE)
+	if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip))
 		d_mark_dontcache(VFS_I(ip));
 	ip->i_udquot = NULL;
 	ip->i_gdquot = NULL;


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick
  2021-06-03  3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong
@ 2021-06-03  4:34   ` Dave Chinner
  2021-06-03 20:49     ` Darrick J. Wong
  2021-06-03 12:23   ` Brian Foster
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-06-03  4:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, bfoster

On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When we decide to mark an inode sick, clear the DONTCACHE flag so that
> the incore inode will be kept around until memory pressure forces it out
> of memory.  This increases the chances that the sick status will be
> caught by someone compiling a health report later on.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_health.c |    5 +++++
>  fs/xfs/xfs_icache.c |    3 ++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
> index 8e0cb05a7142..824e0b781290 100644
> --- a/fs/xfs/xfs_health.c
> +++ b/fs/xfs/xfs_health.c
> @@ -231,6 +231,11 @@ xfs_inode_mark_sick(
>  	ip->i_sick |= mask;
>  	ip->i_checked |= mask;
>  	spin_unlock(&ip->i_flags_lock);
> +
> +	/* Keep this inode around so we don't lose the sickness report. */
> +	spin_lock(&VFS_I(ip)->i_lock);
> +	VFS_I(ip)->i_state &= ~I_DONTCACHE;
> +	spin_unlock(&VFS_I(ip)->i_lock);
>  }

Dentries will still be reclaimed, but the VFS will at least hold on
to the inode in this case.

>  /* Mark parts of an inode healed. */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index c3f912a9231b..0e2b6c05e604 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -23,6 +23,7 @@
>  #include "xfs_dquot.h"
>  #include "xfs_reflink.h"
>  #include "xfs_ialloc.h"
> +#include "xfs_health.h"
>  
>  #include <linux/iversion.h>
>  
> @@ -648,7 +649,7 @@ xfs_iget_cache_miss(
>  	 * time.
>  	 */
>  	iflags = XFS_INEW;
> -	if (flags & XFS_IGET_DONTCACHE)
> +	if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip))

Hmmmm. xfs_inode_is_healthy() is kind of heavyweight for just
checking that ip->i_sick == 0. At this point, nobody else can be
accessing the inode, so we don't need masks nor a spinlock for
checking the sick field.

So why not:

	if ((flags & XFS_IGET_DONTCACHE) && !READ_ONCE(ip->i_sick))

Or maybe still use xfs_inode_is_healthy() but convert it to the
simpler, lockless sick check?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick
  2021-06-03  4:34   ` Dave Chinner
@ 2021-06-03 20:49     ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03 20:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, bfoster

On Thu, Jun 03, 2021 at 02:34:46PM +1000, Dave Chinner wrote:
> On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > When we decide to mark an inode sick, clear the DONTCACHE flag so that
> > the incore inode will be kept around until memory pressure forces it out
> > of memory.  This increases the chances that the sick status will be
> > caught by someone compiling a health report later on.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_health.c |    5 +++++
> >  fs/xfs/xfs_icache.c |    3 ++-
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
> > index 8e0cb05a7142..824e0b781290 100644
> > --- a/fs/xfs/xfs_health.c
> > +++ b/fs/xfs/xfs_health.c
> > @@ -231,6 +231,11 @@ xfs_inode_mark_sick(
> >  	ip->i_sick |= mask;
> >  	ip->i_checked |= mask;
> >  	spin_unlock(&ip->i_flags_lock);
> > +
> > +	/* Keep this inode around so we don't lose the sickness report. */
> > +	spin_lock(&VFS_I(ip)->i_lock);
> > +	VFS_I(ip)->i_state &= ~I_DONTCACHE;
> > +	spin_unlock(&VFS_I(ip)->i_lock);
> >  }
> 
> Dentries will still be reclaimed, but the VFS will at least hold on
> to the inode in this case.

Right.

> >  /* Mark parts of an inode healed. */
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index c3f912a9231b..0e2b6c05e604 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -23,6 +23,7 @@
> >  #include "xfs_dquot.h"
> >  #include "xfs_reflink.h"
> >  #include "xfs_ialloc.h"
> > +#include "xfs_health.h"
> >  
> >  #include <linux/iversion.h>
> >  
> > @@ -648,7 +649,7 @@ xfs_iget_cache_miss(
> >  	 * time.
> >  	 */
> >  	iflags = XFS_INEW;
> > -	if (flags & XFS_IGET_DONTCACHE)
> > +	if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip))
> 
> Hmmmm. xfs_inode_is_healthy() is kind of heavyweight for just
> checking that ip->i_sick == 0. At this point, nobody else can be
> accessing the inode, so we don't need masks nor a spinlock for
> checking the sick field.
> 
> So why not:
> 
> 	if ((flags & XFS_IGET_DONTCACHE) && !READ_ONCE(ip->i_sick))
> 
> Or maybe still use xfs_inode_is_healthy() but convert it to the
> simpler, lockless sick check?

As Brian points out, it's totally unnecessary.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick
  2021-06-03  3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong
  2021-06-03  4:34   ` Dave Chinner
@ 2021-06-03 12:23   ` Brian Foster
  2021-06-03 20:48     ` Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Brian Foster @ 2021-06-03 12:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When we decide to mark an inode sick, clear the DONTCACHE flag so that
> the incore inode will be kept around until memory pressure forces it out
> of memory.  This increases the chances that the sick status will be
> caught by someone compiling a health report later on.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_health.c |    5 +++++
>  fs/xfs/xfs_icache.c |    3 ++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
> index 8e0cb05a7142..824e0b781290 100644
> --- a/fs/xfs/xfs_health.c
> +++ b/fs/xfs/xfs_health.c
> @@ -231,6 +231,11 @@ xfs_inode_mark_sick(
>  	ip->i_sick |= mask;
>  	ip->i_checked |= mask;
>  	spin_unlock(&ip->i_flags_lock);
> +
> +	/* Keep this inode around so we don't lose the sickness report. */
> +	spin_lock(&VFS_I(ip)->i_lock);
> +	VFS_I(ip)->i_state &= ~I_DONTCACHE;
> +	spin_unlock(&VFS_I(ip)->i_lock);

If I follow the scrub code correctly, it will grab a dontcache reference
on the inode, so presumably the intent here is to clear that status once
we've identified some problem to keep the inode around. Seems
reasonable.

>  }
>  
>  /* Mark parts of an inode healed. */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index c3f912a9231b..0e2b6c05e604 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -23,6 +23,7 @@
>  #include "xfs_dquot.h"
>  #include "xfs_reflink.h"
>  #include "xfs_ialloc.h"
> +#include "xfs_health.h"
>  
>  #include <linux/iversion.h>
>  
> @@ -648,7 +649,7 @@ xfs_iget_cache_miss(
>  	 * time.
>  	 */
>  	iflags = XFS_INEW;
> -	if (flags & XFS_IGET_DONTCACHE)
> +	if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip))
>  		d_mark_dontcache(VFS_I(ip));

This one I'm less clear on.. we've just allocated ip above and haven't
made it accessible yet. What's the use case for finding an unhealthy
inode here?

Brian

>  	ip->i_udquot = NULL;
>  	ip->i_gdquot = NULL;
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick
  2021-06-03 12:23   ` Brian Foster
@ 2021-06-03 20:48     ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03 20:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david

On Thu, Jun 03, 2021 at 08:23:21AM -0400, Brian Foster wrote:
> On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > When we decide to mark an inode sick, clear the DONTCACHE flag so that
> > the incore inode will be kept around until memory pressure forces it out
> > of memory.  This increases the chances that the sick status will be
> > caught by someone compiling a health report later on.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_health.c |    5 +++++
> >  fs/xfs/xfs_icache.c |    3 ++-
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
> > index 8e0cb05a7142..824e0b781290 100644
> > --- a/fs/xfs/xfs_health.c
> > +++ b/fs/xfs/xfs_health.c
> > @@ -231,6 +231,11 @@ xfs_inode_mark_sick(
> >  	ip->i_sick |= mask;
> >  	ip->i_checked |= mask;
> >  	spin_unlock(&ip->i_flags_lock);
> > +
> > +	/* Keep this inode around so we don't lose the sickness report. */
> > +	spin_lock(&VFS_I(ip)->i_lock);
> > +	VFS_I(ip)->i_state &= ~I_DONTCACHE;
> > +	spin_unlock(&VFS_I(ip)->i_lock);
> 
> If I follow the scrub code correctly, it will grab a dontcache reference
> on the inode, so presumably the intent here is to clear that status once
> we've identified some problem to keep the inode around. Seems
> reasonable.

<nod> I'll expand the comment:

	/*
	 * Keep this inode around so we don't lose the sickness report.
	 * Scrub grabs inodes with DONTCACHE assuming that most inode
	 * are ok, which is not the case here.
	 */

> 
> >  }
> >  
> >  /* Mark parts of an inode healed. */
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index c3f912a9231b..0e2b6c05e604 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -23,6 +23,7 @@
> >  #include "xfs_dquot.h"
> >  #include "xfs_reflink.h"
> >  #include "xfs_ialloc.h"
> > +#include "xfs_health.h"
> >  
> >  #include <linux/iversion.h>
> >  
> > @@ -648,7 +649,7 @@ xfs_iget_cache_miss(
> >  	 * time.
> >  	 */
> >  	iflags = XFS_INEW;
> > -	if (flags & XFS_IGET_DONTCACHE)
> > +	if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip))
> >  		d_mark_dontcache(VFS_I(ip));
> 
> This one I'm less clear on.. we've just allocated ip above and haven't
> made it accessible yet. What's the use case for finding an unhealthy
> inode here?

Hm.  I think I went overboard looking for DONTCACHE here, and it doesn't
make any sense to make this change.  Ok, dropped.

--D

> 
> Brian
> 
> >  	ip->i_udquot = NULL;
> >  	ip->i_gdquot = NULL;
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/3] xfs: don't let background reclaim forget sick inodes
  2021-06-03  3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong
  2021-06-03  3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
  2021-06-03  3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong
@ 2021-06-03  3:12 ` Darrick J. Wong
  2021-06-03  4:42   ` Dave Chinner
  2 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03  3:12 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, bfoster

From: Darrick J. Wong <djwong@kernel.org>

It's important that the filesystem retain its memory of sick inodes for
a little while after problems are found so that reports can be collected
about what was wrong.  Don't let background inode reclamation free sick
inodes unless we're under memory pressure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0e2b6c05e604..54285d1ad574 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -911,7 +911,8 @@ xfs_dqrele_all_inodes(
  */
 static bool
 xfs_reclaim_igrab(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	struct xfs_eofblocks	*eofb)
 {
 	ASSERT(rcu_read_lock_held());
 
@@ -922,6 +923,17 @@ xfs_reclaim_igrab(
 		spin_unlock(&ip->i_flags_lock);
 		return false;
 	}
+
+	/*
+	 * Don't reclaim a sick inode unless we're under memory pressure or the
+	 * filesystem is unmounting.
+	 */
+	if (ip->i_sick && eofb == NULL &&
+	    !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) {
+		spin_unlock(&ip->i_flags_lock);
+		return false;
+	}
+
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
 	spin_unlock(&ip->i_flags_lock);
 	return true;
@@ -1606,7 +1618,8 @@ xfs_blockgc_free_quota(
 static inline bool
 xfs_icwalk_igrab(
 	enum xfs_icwalk_goal	goal,
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	struct xfs_eofblocks	*eofb)
 {
 	switch (goal) {
 	case XFS_ICWALK_DQRELE:
@@ -1614,7 +1627,7 @@ xfs_icwalk_igrab(
 	case XFS_ICWALK_BLOCKGC:
 		return xfs_blockgc_igrab(ip);
 	case XFS_ICWALK_RECLAIM:
-		return xfs_reclaim_igrab(ip);
+		return xfs_reclaim_igrab(ip, eofb);
 	default:
 		return false;
 	}
@@ -1703,7 +1716,7 @@ xfs_icwalk_ag(
 		for (i = 0; i < nr_found; i++) {
 			struct xfs_inode *ip = batch[i];
 
-			if (done || !xfs_icwalk_igrab(goal, ip))
+			if (done || !xfs_icwalk_igrab(goal, ip, eofb))
 				batch[i] = NULL;
 
 			/*


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] xfs: don't let background reclaim forget sick inodes
  2021-06-03  3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong
@ 2021-06-03  4:42   ` Dave Chinner
  2021-06-03 12:31     ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2021-06-03  4:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, bfoster

On Wed, Jun 02, 2021 at 08:12:52PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> It's important that the filesystem retain its memory of sick inodes for
> a little while after problems are found so that reports can be collected
> about what was wrong.  Don't let background inode reclamation free sick
> inodes unless we're under memory pressure.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c |   21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0e2b6c05e604..54285d1ad574 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -911,7 +911,8 @@ xfs_dqrele_all_inodes(
>   */
>  static bool
>  xfs_reclaim_igrab(
> -	struct xfs_inode	*ip)
> +	struct xfs_inode	*ip,
> +	struct xfs_eofblocks	*eofb)
>  {
>  	ASSERT(rcu_read_lock_held());
>  
> @@ -922,6 +923,17 @@ xfs_reclaim_igrab(
>  		spin_unlock(&ip->i_flags_lock);
>  		return false;
>  	}
> +
> +	/*
> +	 * Don't reclaim a sick inode unless we're under memory pressure or the
> +	 * filesystem is unmounting.
> +	 */
> +	if (ip->i_sick && eofb == NULL &&
> +	    !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) {
> +		spin_unlock(&ip->i_flags_lock);
> +		return false;
> +	}

Using the "eofb == NULL" as a proxy for being under memory pressure
is ... a bit obtuse. If we've got a handful of sick inodes, then
there is no problem with just leaving the in memory regardless of
memory pressure. If we've got lots of sick inodes, we're likely to
end up in a shutdown state or be unmounted for checking real soon.

I'd just leave sick inodes around until unmount or shutdown occurs;
lots of sick inodes means repair is necessary right now, so
shutdown+unmount is the right solution here, not memory reclaim....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] xfs: don't let background reclaim forget sick inodes
  2021-06-03  4:42   ` Dave Chinner
@ 2021-06-03 12:31     ` Brian Foster
  2021-06-03 21:30       ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2021-06-03 12:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Thu, Jun 03, 2021 at 02:42:42PM +1000, Dave Chinner wrote:
> On Wed, Jun 02, 2021 at 08:12:52PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > It's important that the filesystem retain its memory of sick inodes for
> > a little while after problems are found so that reports can be collected
> > about what was wrong.  Don't let background inode reclamation free sick
> > inodes unless we're under memory pressure.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c |   21 +++++++++++++++++----
> >  1 file changed, 17 insertions(+), 4 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 0e2b6c05e604..54285d1ad574 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -911,7 +911,8 @@ xfs_dqrele_all_inodes(
> >   */
> >  static bool
> >  xfs_reclaim_igrab(
> > -	struct xfs_inode	*ip)
> > +	struct xfs_inode	*ip,
> > +	struct xfs_eofblocks	*eofb)
> >  {
> >  	ASSERT(rcu_read_lock_held());
> >  
> > @@ -922,6 +923,17 @@ xfs_reclaim_igrab(
> >  		spin_unlock(&ip->i_flags_lock);
> >  		return false;
> >  	}
> > +
> > +	/*
> > +	 * Don't reclaim a sick inode unless we're under memory pressure or the
> > +	 * filesystem is unmounting.
> > +	 */
> > +	if (ip->i_sick && eofb == NULL &&
> > +	    !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) {
> > +		spin_unlock(&ip->i_flags_lock);
> > +		return false;
> > +	}
> 
> Using the "eofb == NULL" as a proxy for being under memory pressure
> is ... a bit obtuse. If we've got a handful of sick inodes, then
> there is no problem with just leaving the in memory regardless of
> memory pressure. If we've got lots of sick inodes, we're likely to
> end up in a shutdown state or be unmounted for checking real soon.
> 

Agreed.. it would be nice to see more explicit logic here. Using the
existence or not of an optional parameter meant to provide various
controls is quite fragile.

> I'd just leave sick inodes around until unmount or shutdown occurs;
> lots of sick inodes means repair is necessary right now, so
> shutdown+unmount is the right solution here, not memory reclaim....
> 

That seems like a dependency on a loose correlation and rather
dangerous.. we're either assuming action on behalf of a user before the
built up state becomes a broader problem for the system or that somehow
a cascade of in-core inode problems is going to lead to a shutdown. I
don't think that is a guarantee, or even necessarily likely. I think if
we were to do something like pin sick inodes in memory indefinitely, as
you've pointed out in the past for other such things, we should at least
consider breakdown conditions and potential for unbound behavior.

IOW, if scrub decides it wants to pin sick inodes until shutdown, it
should probably implement some kind of worst case threshold where it
actually initiates shutdown based on broad health state. If we can't
reasonably define something like that, then to me that is a pretty clear
indication that an indefinite pinning strategy is probably too fragile.
OTOH, perhaps scrub has enough knowledge to implement some kind of
policy where a sick object is pinned until we know the state has been
queried at least once, then reclaim can have it? I guess we still may
want to be careful about things like how many sick objects a single
scrub scan can produce before there's an opportunity for userspace to
query status; it's not clear to me how much of an issue that might be..

In any event, this all seems moderately more involved to get right vs
what the current patch proposes. I think this patch is a reasonable step
if we can clean up the logic a bit. Perhaps define a flag that contexts
can use to explicitly reclaim or skip unhealthy inodes?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] xfs: don't let background reclaim forget sick inodes
  2021-06-03 12:31     ` Brian Foster
@ 2021-06-03 21:30       ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-03 21:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Thu, Jun 03, 2021 at 08:31:49AM -0400, Brian Foster wrote:
> On Thu, Jun 03, 2021 at 02:42:42PM +1000, Dave Chinner wrote:
> > On Wed, Jun 02, 2021 at 08:12:52PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > It's important that the filesystem retain its memory of sick inodes for
> > > a little while after problems are found so that reports can be collected
> > > about what was wrong.  Don't let background inode reclamation free sick
> > > inodes unless we're under memory pressure.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_icache.c |   21 +++++++++++++++++----
> > >  1 file changed, 17 insertions(+), 4 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 0e2b6c05e604..54285d1ad574 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -911,7 +911,8 @@ xfs_dqrele_all_inodes(
> > >   */
> > >  static bool
> > >  xfs_reclaim_igrab(
> > > -	struct xfs_inode	*ip)
> > > +	struct xfs_inode	*ip,
> > > +	struct xfs_eofblocks	*eofb)
> > >  {
> > >  	ASSERT(rcu_read_lock_held());
> > >  
> > > @@ -922,6 +923,17 @@ xfs_reclaim_igrab(
> > >  		spin_unlock(&ip->i_flags_lock);
> > >  		return false;
> > >  	}
> > > +
> > > +	/*
> > > +	 * Don't reclaim a sick inode unless we're under memory pressure or the
> > > +	 * filesystem is unmounting.
> > > +	 */
> > > +	if (ip->i_sick && eofb == NULL &&
> > > +	    !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) {
> > > +		spin_unlock(&ip->i_flags_lock);
> > > +		return false;
> > > +	}
> > 
> > Using the "eofb == NULL" as a proxy for being under memory pressure
> > is ... a bit obtuse. If we've got a handful of sick inodes, then
> > there is no problem with just leaving the in memory regardless of
> > memory pressure. If we've got lots of sick inodes, we're likely to
> > end up in a shutdown state or be unmounted for checking real soon.
> > 
> 
> Agreed.. it would be nice to see more explicit logic here. Using the
> existence or not of an optional parameter meant to provide various
> controls is quite fragile.

Ok, I'll add a new private icwalk flag for reclaim callers to indicate
that it's ok to reclaim sick inodes:

	/* Don't reclaim a sick inode unless the caller asked for it. */
	if (ip->i_sick && icw &&
	    (icw->icw_flags & XFS_ICWALK_FLAG_RECLAIM_SICK)) {
		spin_unlock(&ip->i_flags_lock);
		return false;
	}

And then xfs_reclaim_inodes becomes:

void
xfs_reclaim_inodes(
	struct xfs_mount	*mp)
{
	struct xfs_icwalk	icw = {
		.icw_flags	= 0,
	};

	if (xfs_want_reclaim_sick(mp))
		icw.icw_flags |= XFS_ICWALK_FLAG_RECLAIM_SICK;

	while (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
		xfs_ail_push_all_sync(mp->m_ail);
		xfs_icwalk(mp, XFS_ICWALK_RECLAIM, &icw);
	}
}

Similar changes apply to xfs_reclaim_inodes_nr.

> > I'd just leave sick inodes around until unmount or shutdown occurs;
> > lots of sick inodes means repair is necessary right now, so
> > shutdown+unmount is the right solution here, not memory reclaim....
> > 
> 
> That seems like a dependency on a loose correlation and rather
> dangerous.. we're either assuming action on behalf of a user before the
> built up state becomes a broader problem for the system or that somehow
> a cascade of in-core inode problems is going to lead to a shutdown. I
> don't think that is a guarantee, or even necessarily likely. I think if
> we were to do something like pin sick inodes in memory indefinitely, as
> you've pointed out in the past for other such things, we should at least
> consider breakdown conditions and potential for unbound behavior.

Yes.  The subsequent health reporting patchset that I linked a few
responses ago is intended to help with the pinning behavior.  With it,
we'll be able to save the fact that inodes within a given AG were sick
even if we're forced to give back the memory.  At a later time, the
sysadmin can download the health report and initiate a scan that will
recover the specific sick info and schedule downtime or perform repairs.

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting

The trouble is, this is a user ABI change, so I'm trying to keep it out
of the landing path of deferred inactivation.

> IOW, if scrub decides it wants to pin sick inodes until shutdown, it
> should probably implement some kind of worst case threshold where it
> actually initiates shutdown based on broad health state.

It already can.  As an example, "xfs_scrub -a 1000 -e shutdown" will
shut down the filesystem after 1,000 errors.

> If we can't
> reasonably define something like that, then to me that is a pretty clear
> indication that an indefinite pinning strategy is probably too fragile.

This might be the case anyway.  Remember how I was working on some code
to set sick flags any time we saw a corruption anywhere in the
filesystem?  If that ever gets fully implemented, we could very well end
up pinning tons of memory that will cause the system (probably) to swap
itself to death because we won't let go of the bad inodes.

> OTOH, perhaps scrub has enough knowledge to implement some kind of
> policy where a sick object is pinned until we know the state has been
> queried at least once, then reclaim can have it? I guess we still may
> want to be careful about things like how many sick objects a single
> scrub scan can produce before there's an opportunity for userspace to
> query status; it's not clear to me how much of an issue that might be..
> 
> In any event, this all seems moderately more involved to get right vs
> what the current patch proposes. I think this patch is a reasonable step
> if we can clean up the logic a bit. Perhaps define a flag that contexts
> can use to explicitly reclaim or skip unhealthy inodes?

Done.

--D

> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHSET v3 0/3] xfs: preserve inode health reports for longer
@ 2021-06-06 17:54 Darrick J. Wong
  2021-06-06 17:54 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-06 17:54 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, Dave Chinner, linux-xfs, david, bfoster

Hi all,

This is a quick series to make sure that inode sickness reports stick
around in memory for some amount of time.

v2: rebase to 5.13-rc4
v3: require explicit request to reclaim sick inodes, drop weird icache
    miss interaction with DONTCACHE

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-inode-health-reports-5.14
---
 fs/xfs/xfs_health.c |    9 +++++++++
 fs/xfs/xfs_icache.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 50 insertions(+), 9 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode
  2021-06-06 17:54 [PATCHSET v3 0/3] xfs: preserve inode health reports for longer Darrick J. Wong
@ 2021-06-06 17:54 ` Darrick J. Wong
  2021-06-08 14:55   ` Carlos Maiolino
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2021-06-06 17:54 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, Dave Chinner, linux-xfs, david, bfoster

From: Darrick J. Wong <djwong@kernel.org>

While running some fuzz tests on inode metadata, I noticed that the
filesystem health report (as provided by xfs_spaceman) failed to report
the file corruption even when spaceman was run immediately after running
xfs_scrub to detect the corruption.  That isn't the intended behavior;
one ought to be able to run scrub to detect errors in the ondisk
metadata and be able to access to those reports for some time after the
scrub.

After running the same sequence through an instrumented kernel, I
discovered the reason why -- scrub igets the file, scans it, marks it
sick, and ireleases the inode.  When the VFS lets go of the incore
inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
inode before it moves to RECLAIM state, iget reinitializes the VFS
state, clears the sick and checked masks, and hands back the inode.  At
this point, the caller has the exact same incore inode, but with all the
health state erased.

In other words, we're erasing the incore inode's health state flags when
we've decided NOT to sever the link between the incore inode and the
ondisk inode.  This is wrong, so we need to remove the lines that zero
the fields from xfs_iget_cache_hit.

As a precaution, we add the same lines into xfs_reclaim_inode just after
we sever the link between incore and ondisk inode.  Strictly speaking
this isn't necessary because once an inode has gone through reclaim it
must go through xfs_inode_alloc (which also zeroes the state) and
xfs_iget is careful to check for mismatches between the inode it pulls
out of the radix tree and the one it wants.

Fixes: 6772c1f11206 ("xfs: track metadata health status")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 396cc54ca03f..c3f912a9231b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -523,9 +523,6 @@ xfs_iget_cache_hit(
 				XFS_INO_TO_AGINO(pag->pag_mount, ino),
 				XFS_ICI_RECLAIM_TAG);
 		inode->i_state = I_NEW;
-		ip->i_sick = 0;
-		ip->i_checked = 0;
-
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
 	} else {
@@ -979,6 +976,8 @@ xfs_reclaim_inode(
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags = XFS_IRECLAIM;
 	ip->i_ino = 0;
+	ip->i_sick = 0;
+	ip->i_checked = 0;
 	spin_unlock(&ip->i_flags_lock);

 	xfs_iunlock(ip, XFS_ILOCK_EXCL);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode
  2021-06-06 17:54 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
@ 2021-06-08 14:55   ` Carlos Maiolino
  0 siblings, 0 replies; 17+ messages in thread
From: Carlos Maiolino @ 2021-06-08 14:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, Dave Chinner, linux-xfs, david

On Sun, Jun 06, 2021 at 10:54:11AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While running some fuzz tests on inode metadata, I noticed that the
> filesystem health report (as provided by xfs_spaceman) failed to report
> the file corruption even when spaceman was run immediately after running
> xfs_scrub to detect the corruption.  That isn't the intended behavior;
> one ought to be able to run scrub to detect errors in the ondisk
> metadata and be able to access to those reports for some time after the
> scrub.
> 
> After running the same sequence through an instrumented kernel, I
> discovered the reason why -- scrub igets the file, scans it, marks it
> sick, and ireleases the inode.  When the VFS lets go of the incore
> inode, it moves to RECLAIMABLE state.  If spaceman igets the incore
> inode before it moves to RECLAIM state, iget reinitializes the VFS
> state, clears the sick and checked masks, and hands back the inode.  At
> this point, the caller has the exact same incore inode, but with all the
> health state erased.
> 
> In other words, we're erasing the incore inode's health state flags when
> we've decided NOT to sever the link between the incore inode and the
> ondisk inode.  This is wrong, so we need to remove the lines that zero
> the fields from xfs_iget_cache_hit.
> 
> As a precaution, we add the same lines into xfs_reclaim_inode just after
> we sever the link between incore and ondisk inode.  Strictly speaking
> this isn't necessary because once an inode has gone through reclaim it
> must go through xfs_inode_alloc (which also zeroes the state) and
> xfs_iget is careful to check for mismatches between the inode it pulls
> out of the radix tree and the one it wants.
> 
> ---

Looks good.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>

>  fs/xfs/xfs_icache.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 396cc54ca03f..c3f912a9231b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -523,9 +523,6 @@ xfs_iget_cache_hit(
>  				XFS_INO_TO_AGINO(pag->pag_mount, ino),
>  				XFS_ICI_RECLAIM_TAG);
>  		inode->i_state = I_NEW;
> -		ip->i_sick = 0;
> -		ip->i_checked = 0;
> -
>  		spin_unlock(&ip->i_flags_lock);
>  		spin_unlock(&pag->pag_ici_lock);
>  	} else {
> @@ -979,6 +976,8 @@ xfs_reclaim_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	ip->i_flags = XFS_IRECLAIM;
>  	ip->i_ino = 0;
> +	ip->i_sick = 0;
> +	ip->i_checked = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> 

-- 
Carlos


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-06-08 14:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-03  3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong
2021-06-03  3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
2021-06-03  4:21   ` Dave Chinner
2021-06-03 20:41     ` Darrick J. Wong
2021-06-03 12:22   ` Brian Foster
2021-06-03 20:41     ` Darrick J. Wong
2021-06-03  3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong
2021-06-03  4:34   ` Dave Chinner
2021-06-03 20:49     ` Darrick J. Wong
2021-06-03 12:23   ` Brian Foster
2021-06-03 20:48     ` Darrick J. Wong
2021-06-03  3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong
2021-06-03  4:42   ` Dave Chinner
2021-06-03 12:31     ` Brian Foster
2021-06-03 21:30       ` Darrick J. Wong
2021-06-06 17:54 [PATCHSET v3 0/3] xfs: preserve inode health reports for longer Darrick J. Wong
2021-06-06 17:54 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong
2021-06-08 14:55   ` Carlos Maiolino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.