* [PATCHSET v2 0/3] xfs: preserve inode health reports for longer @ 2021-06-03 3:12 Darrick J. Wong 2021-06-03 3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 3:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs, david, bfoster Hi all, This is a quick series to make sure that inode sickness reports stick around in memory for some amount of time. v2: rebase to 5.13-rc4 If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-inode-health-reports-5.14 --- fs/xfs/xfs_health.c | 5 +++++ fs/xfs/xfs_icache.c | 29 +++++++++++++++++++++-------- 2 files changed, 26 insertions(+), 8 deletions(-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode 2021-06-03 3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong @ 2021-06-03 3:12 ` Darrick J. Wong 2021-06-03 4:21 ` Dave Chinner 2021-06-03 12:22 ` Brian Foster 2021-06-03 3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong 2021-06-03 3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong 2 siblings, 2 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 3:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs, david, bfoster From: Darrick J. Wong <djwong@kernel.org> While running some fuzz tests on inode metadata, I noticed that the filesystem health report (as provided by xfs_spaceman) failed to report the file corruption even when spaceman was run immediately after running xfs_scrub to detect the corruption. That isn't the intended behavior; one ought to be able to run scrub to detect errors in the ondisk metadata and be able to access to those reports for some time after the scrub. After running the same sequence through an instrumented kernel, I discovered the reason why -- scrub igets the file, scans it, marks it sick, and ireleases the inode. When the VFS lets go of the incore inode, it moves to RECLAIMABLE state. If spaceman igets the incore inode before it moves to RECLAIM state, iget reinitializes the VFS state, clears the sick and checked masks, and hands back the inode. At this point, the caller has the exact same incore inode, but with all the health state erased. In other words, we're erasing the incore inode's health state flags when we've decided NOT to sever the link between the incore inode and the ondisk inode. This is wrong, so we need to remove the lines that zero the fields from xfs_iget_cache_hit. As a precaution, we add the same lines into xfs_reclaim_inode just after we sever the link between incore and ondisk inode. Strictly speaking this isn't necessary because once an inode has gone through reclaim it must go through xfs_inode_alloc (which also zeroes the state) and xfs_iget is careful to check for mismatches between the inode it pulls out of the radix tree and the one it wants. Fixes: 6772c1f11206 ("xfs: track metadata health status") Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_icache.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 396cc54ca03f..c3f912a9231b 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -523,9 +523,6 @@ xfs_iget_cache_hit( XFS_INO_TO_AGINO(pag->pag_mount, ino), XFS_ICI_RECLAIM_TAG); inode->i_state = I_NEW; - ip->i_sick = 0; - ip->i_checked = 0; - spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); } else { @@ -979,6 +976,8 @@ xfs_reclaim_inode( spin_lock(&ip->i_flags_lock); ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; + ip->i_sick = 0; + ip->i_checked = 0; spin_unlock(&ip->i_flags_lock); xfs_iunlock(ip, XFS_ILOCK_EXCL); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode 2021-06-03 3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong @ 2021-06-03 4:21 ` Dave Chinner 2021-06-03 20:41 ` Darrick J. Wong 2021-06-03 12:22 ` Brian Foster 1 sibling, 1 reply; 17+ messages in thread From: Dave Chinner @ 2021-06-03 4:21 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, bfoster On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > While running some fuzz tests on inode metadata, I noticed that the > filesystem health report (as provided by xfs_spaceman) failed to report > the file corruption even when spaceman was run immediately after running > xfs_scrub to detect the corruption. That isn't the intended behavior; > one ought to be able to run scrub to detect errors in the ondisk > metadata and be able to access to those reports for some time after the > scrub. > > After running the same sequence through an instrumented kernel, I > discovered the reason why -- scrub igets the file, scans it, marks it > sick, and ireleases the inode. When the VFS lets go of the incore > inode, it moves to RECLAIMABLE state. If spaceman igets the incore > inode before it moves to RECLAIM state, iget reinitializes the VFS > state, clears the sick and checked masks, and hands back the inode. At > this point, the caller has the exact same incore inode, but with all the > health state erased. > > In other words, we're erasing the incore inode's health state flags when > we've decided NOT to sever the link between the incore inode and the > ondisk inode. This is wrong, so we need to remove the lines that zero > the fields from xfs_iget_cache_hit. > > As a precaution, we add the same lines into xfs_reclaim_inode just after > we sever the link between incore and ondisk inode. Strictly speaking > this isn't necessary because once an inode has gone through reclaim it > must go through xfs_inode_alloc (which also zeroes the state) and > xfs_iget is careful to check for mismatches between the inode it pulls > out of the radix tree and the one it wants. > > Fixes: 6772c1f11206 ("xfs: track metadata health status") > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_icache.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) Looks fine. Reviewed-by: Dave Chinner <dchinner@redhat.com> Though I do wonder how long such state will hang around, because once the inode is IRECLAIMABLE and clean it is only a matter of seconds before the background inode reclaimer will free it... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode 2021-06-03 4:21 ` Dave Chinner @ 2021-06-03 20:41 ` Darrick J. Wong 0 siblings, 0 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 20:41 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, bfoster On Thu, Jun 03, 2021 at 02:21:07PM +1000, Dave Chinner wrote: > On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > While running some fuzz tests on inode metadata, I noticed that the > > filesystem health report (as provided by xfs_spaceman) failed to report > > the file corruption even when spaceman was run immediately after running > > xfs_scrub to detect the corruption. That isn't the intended behavior; > > one ought to be able to run scrub to detect errors in the ondisk > > metadata and be able to access to those reports for some time after the > > scrub. > > > > After running the same sequence through an instrumented kernel, I > > discovered the reason why -- scrub igets the file, scans it, marks it > > sick, and ireleases the inode. When the VFS lets go of the incore > > inode, it moves to RECLAIMABLE state. If spaceman igets the incore > > inode before it moves to RECLAIM state, iget reinitializes the VFS > > state, clears the sick and checked masks, and hands back the inode. At > > this point, the caller has the exact same incore inode, but with all the > > health state erased. > > > > In other words, we're erasing the incore inode's health state flags when > > we've decided NOT to sever the link between the incore inode and the > > ondisk inode. This is wrong, so we need to remove the lines that zero > > the fields from xfs_iget_cache_hit. > > > > As a precaution, we add the same lines into xfs_reclaim_inode just after > > we sever the link between incore and ondisk inode. Strictly speaking > > this isn't necessary because once an inode has gone through reclaim it > > must go through xfs_inode_alloc (which also zeroes the state) and > > xfs_iget is careful to check for mismatches between the inode it pulls > > out of the radix tree and the one it wants. > > > > Fixes: 6772c1f11206 ("xfs: track metadata health status") > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/xfs/xfs_icache.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > Looks fine. > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > Though I do wonder how long such state will hang around, because > once the inode is IRECLAIMABLE and clean it is only a matter of > seconds before the background inode reclaimer will free it... A future patchset will add the ability to make the per-AG health status remember that we were forced to reclaim a sick inode: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting This series used to be the "fixes first" part of that code, but since it directly intersects with the deferred inactivation changes, I moved it up to try to fix the problem sooner than later. Anyway, thanks for the review. --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode 2021-06-03 3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong 2021-06-03 4:21 ` Dave Chinner @ 2021-06-03 12:22 ` Brian Foster 2021-06-03 20:41 ` Darrick J. Wong 1 sibling, 1 reply; 17+ messages in thread From: Brian Foster @ 2021-06-03 12:22 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, david On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > While running some fuzz tests on inode metadata, I noticed that the > filesystem health report (as provided by xfs_spaceman) failed to report > the file corruption even when spaceman was run immediately after running > xfs_scrub to detect the corruption. That isn't the intended behavior; > one ought to be able to run scrub to detect errors in the ondisk > metadata and be able to access to those reports for some time after the > scrub. > > After running the same sequence through an instrumented kernel, I > discovered the reason why -- scrub igets the file, scans it, marks it > sick, and ireleases the inode. When the VFS lets go of the incore > inode, it moves to RECLAIMABLE state. If spaceman igets the incore > inode before it moves to RECLAIM state, iget reinitializes the VFS > state, clears the sick and checked masks, and hands back the inode. At > this point, the caller has the exact same incore inode, but with all the > health state erased. > > In other words, we're erasing the incore inode's health state flags when > we've decided NOT to sever the link between the incore inode and the > ondisk inode. This is wrong, so we need to remove the lines that zero > the fields from xfs_iget_cache_hit. > > As a precaution, we add the same lines into xfs_reclaim_inode just after > we sever the link between incore and ondisk inode. Strictly speaking > this isn't necessary because once an inode has gone through reclaim it > must go through xfs_inode_alloc (which also zeroes the state) and > xfs_iget is careful to check for mismatches between the inode it pulls > out of the radix tree and the one it wants. > > Fixes: 6772c1f11206 ("xfs: track metadata health status") > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- I think I reviewed this the last time around.. Reviewed-by: Brian Foster <bfoster@redhat.com> > fs/xfs/xfs_icache.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 396cc54ca03f..c3f912a9231b 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -523,9 +523,6 @@ xfs_iget_cache_hit( > XFS_INO_TO_AGINO(pag->pag_mount, ino), > XFS_ICI_RECLAIM_TAG); > inode->i_state = I_NEW; > - ip->i_sick = 0; > - ip->i_checked = 0; > - > spin_unlock(&ip->i_flags_lock); > spin_unlock(&pag->pag_ici_lock); > } else { > @@ -979,6 +976,8 @@ xfs_reclaim_inode( > spin_lock(&ip->i_flags_lock); > ip->i_flags = XFS_IRECLAIM; > ip->i_ino = 0; > + ip->i_sick = 0; > + ip->i_checked = 0; > spin_unlock(&ip->i_flags_lock); > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode 2021-06-03 12:22 ` Brian Foster @ 2021-06-03 20:41 ` Darrick J. Wong 0 siblings, 0 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 20:41 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, david On Thu, Jun 03, 2021 at 08:22:10AM -0400, Brian Foster wrote: > On Wed, Jun 02, 2021 at 08:12:41PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > While running some fuzz tests on inode metadata, I noticed that the > > filesystem health report (as provided by xfs_spaceman) failed to report > > the file corruption even when spaceman was run immediately after running > > xfs_scrub to detect the corruption. That isn't the intended behavior; > > one ought to be able to run scrub to detect errors in the ondisk > > metadata and be able to access to those reports for some time after the > > scrub. > > > > After running the same sequence through an instrumented kernel, I > > discovered the reason why -- scrub igets the file, scans it, marks it > > sick, and ireleases the inode. When the VFS lets go of the incore > > inode, it moves to RECLAIMABLE state. If spaceman igets the incore > > inode before it moves to RECLAIM state, iget reinitializes the VFS > > state, clears the sick and checked masks, and hands back the inode. At > > this point, the caller has the exact same incore inode, but with all the > > health state erased. > > > > In other words, we're erasing the incore inode's health state flags when > > we've decided NOT to sever the link between the incore inode and the > > ondisk inode. This is wrong, so we need to remove the lines that zero > > the fields from xfs_iget_cache_hit. > > > > As a precaution, we add the same lines into xfs_reclaim_inode just after > > we sever the link between incore and ondisk inode. Strictly speaking > > this isn't necessary because once an inode has gone through reclaim it > > must go through xfs_inode_alloc (which also zeroes the state) and > > xfs_iget is careful to check for mismatches between the inode it pulls > > out of the radix tree and the one it wants. > > > > Fixes: 6772c1f11206 ("xfs: track metadata health status") > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > I think I reviewed this the last time around.. Oops, yes, my bad. :( --D > > Reviewed-by: Brian Foster <bfoster@redhat.com> > > > fs/xfs/xfs_icache.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 396cc54ca03f..c3f912a9231b 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -523,9 +523,6 @@ xfs_iget_cache_hit( > > XFS_INO_TO_AGINO(pag->pag_mount, ino), > > XFS_ICI_RECLAIM_TAG); > > inode->i_state = I_NEW; > > - ip->i_sick = 0; > > - ip->i_checked = 0; > > - > > spin_unlock(&ip->i_flags_lock); > > spin_unlock(&pag->pag_ici_lock); > > } else { > > @@ -979,6 +976,8 @@ xfs_reclaim_inode( > > spin_lock(&ip->i_flags_lock); > > ip->i_flags = XFS_IRECLAIM; > > ip->i_ino = 0; > > + ip->i_sick = 0; > > + ip->i_checked = 0; > > spin_unlock(&ip->i_flags_lock); > > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick 2021-06-03 3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong 2021-06-03 3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong @ 2021-06-03 3:12 ` Darrick J. Wong 2021-06-03 4:34 ` Dave Chinner 2021-06-03 12:23 ` Brian Foster 2021-06-03 3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong 2 siblings, 2 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 3:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs, david, bfoster From: Darrick J. Wong <djwong@kernel.org> When we decide to mark an inode sick, clear the DONTCACHE flag so that the incore inode will be kept around until memory pressure forces it out of memory. This increases the chances that the sick status will be caught by someone compiling a health report later on. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_health.c | 5 +++++ fs/xfs/xfs_icache.c | 3 ++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c index 8e0cb05a7142..824e0b781290 100644 --- a/fs/xfs/xfs_health.c +++ b/fs/xfs/xfs_health.c @@ -231,6 +231,11 @@ xfs_inode_mark_sick( ip->i_sick |= mask; ip->i_checked |= mask; spin_unlock(&ip->i_flags_lock); + + /* Keep this inode around so we don't lose the sickness report. */ + spin_lock(&VFS_I(ip)->i_lock); + VFS_I(ip)->i_state &= ~I_DONTCACHE; + spin_unlock(&VFS_I(ip)->i_lock); } /* Mark parts of an inode healed. */ diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index c3f912a9231b..0e2b6c05e604 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -23,6 +23,7 @@ #include "xfs_dquot.h" #include "xfs_reflink.h" #include "xfs_ialloc.h" +#include "xfs_health.h" #include <linux/iversion.h> @@ -648,7 +649,7 @@ xfs_iget_cache_miss( * time. */ iflags = XFS_INEW; - if (flags & XFS_IGET_DONTCACHE) + if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip)) d_mark_dontcache(VFS_I(ip)); ip->i_udquot = NULL; ip->i_gdquot = NULL; ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick 2021-06-03 3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong @ 2021-06-03 4:34 ` Dave Chinner 2021-06-03 20:49 ` Darrick J. Wong 2021-06-03 12:23 ` Brian Foster 1 sibling, 1 reply; 17+ messages in thread From: Dave Chinner @ 2021-06-03 4:34 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, bfoster On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > When we decide to mark an inode sick, clear the DONTCACHE flag so that > the incore inode will be kept around until memory pressure forces it out > of memory. This increases the chances that the sick status will be > caught by someone compiling a health report later on. > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_health.c | 5 +++++ > fs/xfs/xfs_icache.c | 3 ++- > 2 files changed, 7 insertions(+), 1 deletion(-) > > > diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c > index 8e0cb05a7142..824e0b781290 100644 > --- a/fs/xfs/xfs_health.c > +++ b/fs/xfs/xfs_health.c > @@ -231,6 +231,11 @@ xfs_inode_mark_sick( > ip->i_sick |= mask; > ip->i_checked |= mask; > spin_unlock(&ip->i_flags_lock); > + > + /* Keep this inode around so we don't lose the sickness report. */ > + spin_lock(&VFS_I(ip)->i_lock); > + VFS_I(ip)->i_state &= ~I_DONTCACHE; > + spin_unlock(&VFS_I(ip)->i_lock); > } Dentries will still be reclaimed, but the VFS will at least hold on to the inode in this case. > /* Mark parts of an inode healed. */ > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index c3f912a9231b..0e2b6c05e604 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -23,6 +23,7 @@ > #include "xfs_dquot.h" > #include "xfs_reflink.h" > #include "xfs_ialloc.h" > +#include "xfs_health.h" > > #include <linux/iversion.h> > > @@ -648,7 +649,7 @@ xfs_iget_cache_miss( > * time. > */ > iflags = XFS_INEW; > - if (flags & XFS_IGET_DONTCACHE) > + if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip)) Hmmmm. xfs_inode_is_healthy() is kind of heavyweight for just checking that ip->i_sick == 0. At this point, nobody else can be accessing the inode, so we don't need masks nor a spinlock for checking the sick field. So why not: if ((flags & XFS_IGET_DONTCACHE) && !READ_ONCE(ip->i_sick)) Or maybe still use xfs_inode_is_healthy() but convert it to the simpler, lockless sick check? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick 2021-06-03 4:34 ` Dave Chinner @ 2021-06-03 20:49 ` Darrick J. Wong 0 siblings, 0 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 20:49 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, bfoster On Thu, Jun 03, 2021 at 02:34:46PM +1000, Dave Chinner wrote: > On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > When we decide to mark an inode sick, clear the DONTCACHE flag so that > > the incore inode will be kept around until memory pressure forces it out > > of memory. This increases the chances that the sick status will be > > caught by someone compiling a health report later on. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/xfs/xfs_health.c | 5 +++++ > > fs/xfs/xfs_icache.c | 3 ++- > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c > > index 8e0cb05a7142..824e0b781290 100644 > > --- a/fs/xfs/xfs_health.c > > +++ b/fs/xfs/xfs_health.c > > @@ -231,6 +231,11 @@ xfs_inode_mark_sick( > > ip->i_sick |= mask; > > ip->i_checked |= mask; > > spin_unlock(&ip->i_flags_lock); > > + > > + /* Keep this inode around so we don't lose the sickness report. */ > > + spin_lock(&VFS_I(ip)->i_lock); > > + VFS_I(ip)->i_state &= ~I_DONTCACHE; > > + spin_unlock(&VFS_I(ip)->i_lock); > > } > > Dentries will still be reclaimed, but the VFS will at least hold on > to the inode in this case. Right. > > /* Mark parts of an inode healed. */ > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index c3f912a9231b..0e2b6c05e604 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -23,6 +23,7 @@ > > #include "xfs_dquot.h" > > #include "xfs_reflink.h" > > #include "xfs_ialloc.h" > > +#include "xfs_health.h" > > > > #include <linux/iversion.h> > > > > @@ -648,7 +649,7 @@ xfs_iget_cache_miss( > > * time. > > */ > > iflags = XFS_INEW; > > - if (flags & XFS_IGET_DONTCACHE) > > + if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip)) > > Hmmmm. xfs_inode_is_healthy() is kind of heavyweight for just > checking that ip->i_sick == 0. At this point, nobody else can be > accessing the inode, so we don't need masks nor a spinlock for > checking the sick field. > > So why not: > > if ((flags & XFS_IGET_DONTCACHE) && !READ_ONCE(ip->i_sick)) > > Or maybe still use xfs_inode_is_healthy() but convert it to the > simpler, lockless sick check? As Brian points out, it's totally unnecessary. --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick 2021-06-03 3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong 2021-06-03 4:34 ` Dave Chinner @ 2021-06-03 12:23 ` Brian Foster 2021-06-03 20:48 ` Darrick J. Wong 1 sibling, 1 reply; 17+ messages in thread From: Brian Foster @ 2021-06-03 12:23 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, david On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > When we decide to mark an inode sick, clear the DONTCACHE flag so that > the incore inode will be kept around until memory pressure forces it out > of memory. This increases the chances that the sick status will be > caught by someone compiling a health report later on. > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_health.c | 5 +++++ > fs/xfs/xfs_icache.c | 3 ++- > 2 files changed, 7 insertions(+), 1 deletion(-) > > > diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c > index 8e0cb05a7142..824e0b781290 100644 > --- a/fs/xfs/xfs_health.c > +++ b/fs/xfs/xfs_health.c > @@ -231,6 +231,11 @@ xfs_inode_mark_sick( > ip->i_sick |= mask; > ip->i_checked |= mask; > spin_unlock(&ip->i_flags_lock); > + > + /* Keep this inode around so we don't lose the sickness report. */ > + spin_lock(&VFS_I(ip)->i_lock); > + VFS_I(ip)->i_state &= ~I_DONTCACHE; > + spin_unlock(&VFS_I(ip)->i_lock); If I follow the scrub code correctly, it will grab a dontcache reference on the inode, so presumably the intent here is to clear that status once we've identified some problem to keep the inode around. Seems reasonable. > } > > /* Mark parts of an inode healed. */ > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index c3f912a9231b..0e2b6c05e604 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -23,6 +23,7 @@ > #include "xfs_dquot.h" > #include "xfs_reflink.h" > #include "xfs_ialloc.h" > +#include "xfs_health.h" > > #include <linux/iversion.h> > > @@ -648,7 +649,7 @@ xfs_iget_cache_miss( > * time. > */ > iflags = XFS_INEW; > - if (flags & XFS_IGET_DONTCACHE) > + if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip)) > d_mark_dontcache(VFS_I(ip)); This one I'm less clear on.. we've just allocated ip above and haven't made it accessible yet. What's the use case for finding an unhealthy inode here? Brian > ip->i_udquot = NULL; > ip->i_gdquot = NULL; > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick 2021-06-03 12:23 ` Brian Foster @ 2021-06-03 20:48 ` Darrick J. Wong 0 siblings, 0 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 20:48 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, david On Thu, Jun 03, 2021 at 08:23:21AM -0400, Brian Foster wrote: > On Wed, Jun 02, 2021 at 08:12:46PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > When we decide to mark an inode sick, clear the DONTCACHE flag so that > > the incore inode will be kept around until memory pressure forces it out > > of memory. This increases the chances that the sick status will be > > caught by someone compiling a health report later on. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/xfs/xfs_health.c | 5 +++++ > > fs/xfs/xfs_icache.c | 3 ++- > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c > > index 8e0cb05a7142..824e0b781290 100644 > > --- a/fs/xfs/xfs_health.c > > +++ b/fs/xfs/xfs_health.c > > @@ -231,6 +231,11 @@ xfs_inode_mark_sick( > > ip->i_sick |= mask; > > ip->i_checked |= mask; > > spin_unlock(&ip->i_flags_lock); > > + > > + /* Keep this inode around so we don't lose the sickness report. */ > > + spin_lock(&VFS_I(ip)->i_lock); > > + VFS_I(ip)->i_state &= ~I_DONTCACHE; > > + spin_unlock(&VFS_I(ip)->i_lock); > > If I follow the scrub code correctly, it will grab a dontcache reference > on the inode, so presumably the intent here is to clear that status once > we've identified some problem to keep the inode around. Seems > reasonable. <nod> I'll expand the comment: /* * Keep this inode around so we don't lose the sickness report. * Scrub grabs inodes with DONTCACHE assuming that most inode * are ok, which is not the case here. */ > > > } > > > > /* Mark parts of an inode healed. */ > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index c3f912a9231b..0e2b6c05e604 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -23,6 +23,7 @@ > > #include "xfs_dquot.h" > > #include "xfs_reflink.h" > > #include "xfs_ialloc.h" > > +#include "xfs_health.h" > > > > #include <linux/iversion.h> > > > > @@ -648,7 +649,7 @@ xfs_iget_cache_miss( > > * time. > > */ > > iflags = XFS_INEW; > > - if (flags & XFS_IGET_DONTCACHE) > > + if ((flags & XFS_IGET_DONTCACHE) && xfs_inode_is_healthy(ip)) > > d_mark_dontcache(VFS_I(ip)); > > This one I'm less clear on.. we've just allocated ip above and haven't > made it accessible yet. What's the use case for finding an unhealthy > inode here? Hm. I think I went overboard looking for DONTCACHE here, and it doesn't make any sense to make this change. Ok, dropped. --D > > Brian > > > ip->i_udquot = NULL; > > ip->i_gdquot = NULL; > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/3] xfs: don't let background reclaim forget sick inodes 2021-06-03 3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong 2021-06-03 3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong 2021-06-03 3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong @ 2021-06-03 3:12 ` Darrick J. Wong 2021-06-03 4:42 ` Dave Chinner 2 siblings, 1 reply; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 3:12 UTC (permalink / raw) To: djwong; +Cc: linux-xfs, david, bfoster From: Darrick J. Wong <djwong@kernel.org> It's important that the filesystem retain its memory of sick inodes for a little while after problems are found so that reports can be collected about what was wrong. Don't let background inode reclamation free sick inodes unless we're under memory pressure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_icache.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 0e2b6c05e604..54285d1ad574 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -911,7 +911,8 @@ xfs_dqrele_all_inodes( */ static bool xfs_reclaim_igrab( - struct xfs_inode *ip) + struct xfs_inode *ip, + struct xfs_eofblocks *eofb) { ASSERT(rcu_read_lock_held()); @@ -922,6 +923,17 @@ xfs_reclaim_igrab( spin_unlock(&ip->i_flags_lock); return false; } + + /* + * Don't reclaim a sick inode unless we're under memory pressure or the + * filesystem is unmounting. + */ + if (ip->i_sick && eofb == NULL && + !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) { + spin_unlock(&ip->i_flags_lock); + return false; + } + __xfs_iflags_set(ip, XFS_IRECLAIM); spin_unlock(&ip->i_flags_lock); return true; @@ -1606,7 +1618,8 @@ xfs_blockgc_free_quota( static inline bool xfs_icwalk_igrab( enum xfs_icwalk_goal goal, - struct xfs_inode *ip) + struct xfs_inode *ip, + struct xfs_eofblocks *eofb) { switch (goal) { case XFS_ICWALK_DQRELE: @@ -1614,7 +1627,7 @@ xfs_icwalk_igrab( case XFS_ICWALK_BLOCKGC: return xfs_blockgc_igrab(ip); case XFS_ICWALK_RECLAIM: - return xfs_reclaim_igrab(ip); + return xfs_reclaim_igrab(ip, eofb); default: return false; } @@ -1703,7 +1716,7 @@ xfs_icwalk_ag( for (i = 0; i < nr_found; i++) { struct xfs_inode *ip = batch[i]; - if (done || !xfs_icwalk_igrab(goal, ip)) + if (done || !xfs_icwalk_igrab(goal, ip, eofb)) batch[i] = NULL; /* ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] xfs: don't let background reclaim forget sick inodes 2021-06-03 3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong @ 2021-06-03 4:42 ` Dave Chinner 2021-06-03 12:31 ` Brian Foster 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2021-06-03 4:42 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, bfoster On Wed, Jun 02, 2021 at 08:12:52PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > It's important that the filesystem retain its memory of sick inodes for > a little while after problems are found so that reports can be collected > about what was wrong. Don't let background inode reclamation free sick > inodes unless we're under memory pressure. > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/xfs/xfs_icache.c | 21 +++++++++++++++++---- > 1 file changed, 17 insertions(+), 4 deletions(-) > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 0e2b6c05e604..54285d1ad574 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -911,7 +911,8 @@ xfs_dqrele_all_inodes( > */ > static bool > xfs_reclaim_igrab( > - struct xfs_inode *ip) > + struct xfs_inode *ip, > + struct xfs_eofblocks *eofb) > { > ASSERT(rcu_read_lock_held()); > > @@ -922,6 +923,17 @@ xfs_reclaim_igrab( > spin_unlock(&ip->i_flags_lock); > return false; > } > + > + /* > + * Don't reclaim a sick inode unless we're under memory pressure or the > + * filesystem is unmounting. > + */ > + if (ip->i_sick && eofb == NULL && > + !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) { > + spin_unlock(&ip->i_flags_lock); > + return false; > + } Using the "eofb == NULL" as a proxy for being under memory pressure is ... a bit obtuse. If we've got a handful of sick inodes, then there is no problem with just leaving the in memory regardless of memory pressure. If we've got lots of sick inodes, we're likely to end up in a shutdown state or be unmounted for checking real soon. I'd just leave sick inodes around until unmount or shutdown occurs; lots of sick inodes means repair is necessary right now, so shutdown+unmount is the right solution here, not memory reclaim.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] xfs: don't let background reclaim forget sick inodes 2021-06-03 4:42 ` Dave Chinner @ 2021-06-03 12:31 ` Brian Foster 2021-06-03 21:30 ` Darrick J. Wong 0 siblings, 1 reply; 17+ messages in thread From: Brian Foster @ 2021-06-03 12:31 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs On Thu, Jun 03, 2021 at 02:42:42PM +1000, Dave Chinner wrote: > On Wed, Jun 02, 2021 at 08:12:52PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > It's important that the filesystem retain its memory of sick inodes for > > a little while after problems are found so that reports can be collected > > about what was wrong. Don't let background inode reclamation free sick > > inodes unless we're under memory pressure. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/xfs/xfs_icache.c | 21 +++++++++++++++++---- > > 1 file changed, 17 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 0e2b6c05e604..54285d1ad574 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -911,7 +911,8 @@ xfs_dqrele_all_inodes( > > */ > > static bool > > xfs_reclaim_igrab( > > - struct xfs_inode *ip) > > + struct xfs_inode *ip, > > + struct xfs_eofblocks *eofb) > > { > > ASSERT(rcu_read_lock_held()); > > > > @@ -922,6 +923,17 @@ xfs_reclaim_igrab( > > spin_unlock(&ip->i_flags_lock); > > return false; > > } > > + > > + /* > > + * Don't reclaim a sick inode unless we're under memory pressure or the > > + * filesystem is unmounting. > > + */ > > + if (ip->i_sick && eofb == NULL && > > + !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) { > > + spin_unlock(&ip->i_flags_lock); > > + return false; > > + } > > Using the "eofb == NULL" as a proxy for being under memory pressure > is ... a bit obtuse. If we've got a handful of sick inodes, then > there is no problem with just leaving the in memory regardless of > memory pressure. If we've got lots of sick inodes, we're likely to > end up in a shutdown state or be unmounted for checking real soon. > Agreed.. it would be nice to see more explicit logic here. Using the existence or not of an optional parameter meant to provide various controls is quite fragile. > I'd just leave sick inodes around until unmount or shutdown occurs; > lots of sick inodes means repair is necessary right now, so > shutdown+unmount is the right solution here, not memory reclaim.... > That seems like a dependency on a loose correlation and rather dangerous.. we're either assuming action on behalf of a user before the built up state becomes a broader problem for the system or that somehow a cascade of in-core inode problems is going to lead to a shutdown. I don't think that is a guarantee, or even necessarily likely. I think if we were to do something like pin sick inodes in memory indefinitely, as you've pointed out in the past for other such things, we should at least consider breakdown conditions and potential for unbound behavior. IOW, if scrub decides it wants to pin sick inodes until shutdown, it should probably implement some kind of worst case threshold where it actually initiates shutdown based on broad health state. If we can't reasonably define something like that, then to me that is a pretty clear indication that an indefinite pinning strategy is probably too fragile. OTOH, perhaps scrub has enough knowledge to implement some kind of policy where a sick object is pinned until we know the state has been queried at least once, then reclaim can have it? I guess we still may want to be careful about things like how many sick objects a single scrub scan can produce before there's an opportunity for userspace to query status; it's not clear to me how much of an issue that might be.. In any event, this all seems moderately more involved to get right vs what the current patch proposes. I think this patch is a reasonable step if we can clean up the logic a bit. Perhaps define a flag that contexts can use to explicitly reclaim or skip unhealthy inodes? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] xfs: don't let background reclaim forget sick inodes 2021-06-03 12:31 ` Brian Foster @ 2021-06-03 21:30 ` Darrick J. Wong 0 siblings, 0 replies; 17+ messages in thread From: Darrick J. Wong @ 2021-06-03 21:30 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chinner, linux-xfs On Thu, Jun 03, 2021 at 08:31:49AM -0400, Brian Foster wrote: > On Thu, Jun 03, 2021 at 02:42:42PM +1000, Dave Chinner wrote: > > On Wed, Jun 02, 2021 at 08:12:52PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > It's important that the filesystem retain its memory of sick inodes for > > > a little while after problems are found so that reports can be collected > > > about what was wrong. Don't let background inode reclamation free sick > > > inodes unless we're under memory pressure. > > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > > --- > > > fs/xfs/xfs_icache.c | 21 +++++++++++++++++---- > > > 1 file changed, 17 insertions(+), 4 deletions(-) > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index 0e2b6c05e604..54285d1ad574 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -911,7 +911,8 @@ xfs_dqrele_all_inodes( > > > */ > > > static bool > > > xfs_reclaim_igrab( > > > - struct xfs_inode *ip) > > > + struct xfs_inode *ip, > > > + struct xfs_eofblocks *eofb) > > > { > > > ASSERT(rcu_read_lock_held()); > > > > > > @@ -922,6 +923,17 @@ xfs_reclaim_igrab( > > > spin_unlock(&ip->i_flags_lock); > > > return false; > > > } > > > + > > > + /* > > > + * Don't reclaim a sick inode unless we're under memory pressure or the > > > + * filesystem is unmounting. > > > + */ > > > + if (ip->i_sick && eofb == NULL && > > > + !(ip->i_mount->m_flags & XFS_MOUNT_UNMOUNTING)) { > > > + spin_unlock(&ip->i_flags_lock); > > > + return false; > > > + } > > > > Using the "eofb == NULL" as a proxy for being under memory pressure > > is ... a bit obtuse. If we've got a handful of sick inodes, then > > there is no problem with just leaving the in memory regardless of > > memory pressure. If we've got lots of sick inodes, we're likely to > > end up in a shutdown state or be unmounted for checking real soon. > > > > Agreed.. it would be nice to see more explicit logic here. Using the > existence or not of an optional parameter meant to provide various > controls is quite fragile. Ok, I'll add a new private icwalk flag for reclaim callers to indicate that it's ok to reclaim sick inodes: /* Don't reclaim a sick inode unless the caller asked for it. */ if (ip->i_sick && icw && (icw->icw_flags & XFS_ICWALK_FLAG_RECLAIM_SICK)) { spin_unlock(&ip->i_flags_lock); return false; } And then xfs_reclaim_inodes becomes: void xfs_reclaim_inodes( struct xfs_mount *mp) { struct xfs_icwalk icw = { .icw_flags = 0, }; if (xfs_want_reclaim_sick(mp)) icw.icw_flags |= XFS_ICWALK_FLAG_RECLAIM_SICK; while (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) { xfs_ail_push_all_sync(mp->m_ail); xfs_icwalk(mp, XFS_ICWALK_RECLAIM, &icw); } } Similar changes apply to xfs_reclaim_inodes_nr. > > I'd just leave sick inodes around until unmount or shutdown occurs; > > lots of sick inodes means repair is necessary right now, so > > shutdown+unmount is the right solution here, not memory reclaim.... > > > > That seems like a dependency on a loose correlation and rather > dangerous.. we're either assuming action on behalf of a user before the > built up state becomes a broader problem for the system or that somehow > a cascade of in-core inode problems is going to lead to a shutdown. I > don't think that is a guarantee, or even necessarily likely. I think if > we were to do something like pin sick inodes in memory indefinitely, as > you've pointed out in the past for other such things, we should at least > consider breakdown conditions and potential for unbound behavior. Yes. The subsequent health reporting patchset that I linked a few responses ago is intended to help with the pinning behavior. With it, we'll be able to save the fact that inodes within a given AG were sick even if we're forced to give back the memory. At a later time, the sysadmin can download the health report and initiate a scan that will recover the specific sick info and schedule downtime or perform repairs. https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting The trouble is, this is a user ABI change, so I'm trying to keep it out of the landing path of deferred inactivation. > IOW, if scrub decides it wants to pin sick inodes until shutdown, it > should probably implement some kind of worst case threshold where it > actually initiates shutdown based on broad health state. It already can. As an example, "xfs_scrub -a 1000 -e shutdown" will shut down the filesystem after 1,000 errors. > If we can't > reasonably define something like that, then to me that is a pretty clear > indication that an indefinite pinning strategy is probably too fragile. This might be the case anyway. Remember how I was working on some code to set sick flags any time we saw a corruption anywhere in the filesystem? If that ever gets fully implemented, we could very well end up pinning tons of memory that will cause the system (probably) to swap itself to death because we won't let go of the bad inodes. > OTOH, perhaps scrub has enough knowledge to implement some kind of > policy where a sick object is pinned until we know the state has been > queried at least once, then reclaim can have it? I guess we still may > want to be careful about things like how many sick objects a single > scrub scan can produce before there's an opportunity for userspace to > query status; it's not clear to me how much of an issue that might be.. > > In any event, this all seems moderately more involved to get right vs > what the current patch proposes. I think this patch is a reasonable step > if we can clean up the logic a bit. Perhaps define a flag that contexts > can use to explicitly reclaim or skip unhealthy inodes? Done. --D > > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHSET v3 0/3] xfs: preserve inode health reports for longer @ 2021-06-06 17:54 Darrick J. Wong 2021-06-06 17:54 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong 0 siblings, 1 reply; 17+ messages in thread From: Darrick J. Wong @ 2021-06-06 17:54 UTC (permalink / raw) To: djwong; +Cc: Brian Foster, Dave Chinner, linux-xfs, david, bfoster Hi all, This is a quick series to make sure that inode sickness reports stick around in memory for some amount of time. v2: rebase to 5.13-rc4 v3: require explicit request to reclaim sick inodes, drop weird icache miss interaction with DONTCACHE If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-inode-health-reports-5.14 --- fs/xfs/xfs_health.c | 9 +++++++++ fs/xfs/xfs_icache.c | 50 +++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 50 insertions(+), 9 deletions(-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode 2021-06-06 17:54 [PATCHSET v3 0/3] xfs: preserve inode health reports for longer Darrick J. Wong @ 2021-06-06 17:54 ` Darrick J. Wong 2021-06-08 14:55 ` Carlos Maiolino 0 siblings, 1 reply; 17+ messages in thread From: Darrick J. Wong @ 2021-06-06 17:54 UTC (permalink / raw) To: djwong; +Cc: Brian Foster, Dave Chinner, linux-xfs, david, bfoster From: Darrick J. Wong <djwong@kernel.org> While running some fuzz tests on inode metadata, I noticed that the filesystem health report (as provided by xfs_spaceman) failed to report the file corruption even when spaceman was run immediately after running xfs_scrub to detect the corruption. That isn't the intended behavior; one ought to be able to run scrub to detect errors in the ondisk metadata and be able to access to those reports for some time after the scrub. After running the same sequence through an instrumented kernel, I discovered the reason why -- scrub igets the file, scans it, marks it sick, and ireleases the inode. When the VFS lets go of the incore inode, it moves to RECLAIMABLE state. If spaceman igets the incore inode before it moves to RECLAIM state, iget reinitializes the VFS state, clears the sick and checked masks, and hands back the inode. At this point, the caller has the exact same incore inode, but with all the health state erased. In other words, we're erasing the incore inode's health state flags when we've decided NOT to sever the link between the incore inode and the ondisk inode. This is wrong, so we need to remove the lines that zero the fields from xfs_iget_cache_hit. As a precaution, we add the same lines into xfs_reclaim_inode just after we sever the link between incore and ondisk inode. Strictly speaking this isn't necessary because once an inode has gone through reclaim it must go through xfs_inode_alloc (which also zeroes the state) and xfs_iget is careful to check for mismatches between the inode it pulls out of the radix tree and the one it wants. Fixes: 6772c1f11206 ("xfs: track metadata health status") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_icache.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 396cc54ca03f..c3f912a9231b 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -523,9 +523,6 @@ xfs_iget_cache_hit( XFS_INO_TO_AGINO(pag->pag_mount, ino), XFS_ICI_RECLAIM_TAG); inode->i_state = I_NEW; - ip->i_sick = 0; - ip->i_checked = 0; - spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); } else { @@ -979,6 +976,8 @@ xfs_reclaim_inode( spin_lock(&ip->i_flags_lock); ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; + ip->i_sick = 0; + ip->i_checked = 0; spin_unlock(&ip->i_flags_lock); xfs_iunlock(ip, XFS_ILOCK_EXCL); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode 2021-06-06 17:54 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong @ 2021-06-08 14:55 ` Carlos Maiolino 0 siblings, 0 replies; 17+ messages in thread From: Carlos Maiolino @ 2021-06-08 14:55 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Brian Foster, Dave Chinner, linux-xfs, david On Sun, Jun 06, 2021 at 10:54:11AM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > While running some fuzz tests on inode metadata, I noticed that the > filesystem health report (as provided by xfs_spaceman) failed to report > the file corruption even when spaceman was run immediately after running > xfs_scrub to detect the corruption. That isn't the intended behavior; > one ought to be able to run scrub to detect errors in the ondisk > metadata and be able to access to those reports for some time after the > scrub. > > After running the same sequence through an instrumented kernel, I > discovered the reason why -- scrub igets the file, scans it, marks it > sick, and ireleases the inode. When the VFS lets go of the incore > inode, it moves to RECLAIMABLE state. If spaceman igets the incore > inode before it moves to RECLAIM state, iget reinitializes the VFS > state, clears the sick and checked masks, and hands back the inode. At > this point, the caller has the exact same incore inode, but with all the > health state erased. > > In other words, we're erasing the incore inode's health state flags when > we've decided NOT to sever the link between the incore inode and the > ondisk inode. This is wrong, so we need to remove the lines that zero > the fields from xfs_iget_cache_hit. > > As a precaution, we add the same lines into xfs_reclaim_inode just after > we sever the link between incore and ondisk inode. Strictly speaking > this isn't necessary because once an inode has gone through reclaim it > must go through xfs_inode_alloc (which also zeroes the state) and > xfs_iget is careful to check for mismatches between the inode it pulls > out of the radix tree and the one it wants. > > --- Looks good. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> > fs/xfs/xfs_icache.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 396cc54ca03f..c3f912a9231b 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -523,9 +523,6 @@ xfs_iget_cache_hit( > XFS_INO_TO_AGINO(pag->pag_mount, ino), > XFS_ICI_RECLAIM_TAG); > inode->i_state = I_NEW; > - ip->i_sick = 0; > - ip->i_checked = 0; > - > spin_unlock(&ip->i_flags_lock); > spin_unlock(&pag->pag_ici_lock); > } else { > @@ -979,6 +976,8 @@ xfs_reclaim_inode( > spin_lock(&ip->i_flags_lock); > ip->i_flags = XFS_IRECLAIM; > ip->i_ino = 0; > + ip->i_sick = 0; > + ip->i_checked = 0; > spin_unlock(&ip->i_flags_lock); > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > -- Carlos ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-06-08 14:55 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-03 3:12 [PATCHSET v2 0/3] xfs: preserve inode health reports for longer Darrick J. Wong 2021-06-03 3:12 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong 2021-06-03 4:21 ` Dave Chinner 2021-06-03 20:41 ` Darrick J. Wong 2021-06-03 12:22 ` Brian Foster 2021-06-03 20:41 ` Darrick J. Wong 2021-06-03 3:12 ` [PATCH 2/3] xfs: drop IDONTCACHE on inodes when we mark them sick Darrick J. Wong 2021-06-03 4:34 ` Dave Chinner 2021-06-03 20:49 ` Darrick J. Wong 2021-06-03 12:23 ` Brian Foster 2021-06-03 20:48 ` Darrick J. Wong 2021-06-03 3:12 ` [PATCH 3/3] xfs: don't let background reclaim forget sick inodes Darrick J. Wong 2021-06-03 4:42 ` Dave Chinner 2021-06-03 12:31 ` Brian Foster 2021-06-03 21:30 ` Darrick J. Wong 2021-06-06 17:54 [PATCHSET v3 0/3] xfs: preserve inode health reports for longer Darrick J. Wong 2021-06-06 17:54 ` [PATCH 1/3] xfs: only reset incore inode health state flags when reclaiming an inode Darrick J. Wong 2021-06-08 14:55 ` Carlos Maiolino
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.