Re: [PATCH 1/6] xfs: use s_inodes in xfs_qm_dqrele_all_inodes

From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-xfs@vger.kernel.org, hch@infradead.org
Subject: Re: [PATCH 1/6] xfs: use s_inodes in xfs_qm_dqrele_all_inodes
Date: Tue, 30 Mar 2021 14:07:47 +1100	[thread overview]
Message-ID: <20210330030747.GT63242@dread.disaster.area> (raw)
In-Reply-To: <20210330023656.GK4090233@magnolia>

On Mon, Mar 29, 2021 at 07:36:56PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 30, 2021 at 11:44:07AM +1100, Dave Chinner wrote:
> > On Thu, Mar 25, 2021 at 05:21:18PM -0700, Darrick J. Wong wrote:
> > > From: Christoph Hellwig <hch@lst.de>
> > > 
> > > Using xfs_inode_walk in xfs_qm_dqrele_all_inodes is complete overkill,
> > > given that function simplify wants to iterate all live inodes known
> > > to the VFS.  Just iterate over the s_inodes list.
> > 
> > I'm not sure that assertion is true. We attach dquots during inode
> > inactivation after the VFS has removed the inode from the s_inodes
> > list and evicted the inode. Hence there is a window between the
> > inode being removed from the sb->s_inodes lists and it being marked
> > XFS_IRECLAIMABLE where we can attach dquots to the inode.
> > 
> > Indeed, an inode marked XFS_IRECLAIMABLE that has gone through
> > evict -> destroy -> inactive -> nlink != 0 -> xfs_free_ eofblocks()
> > can have referenced dquots attached to it and require dqrele() to be
> > called to release them.
> 
> Why do the dquots need to remain attached after destroy_inode?

They don't. But that's not the problem here.

> We can
> easily reattach them during inactivation (v3 did this), and I don't know
> why an inode needs dquots once we're through making metadata updates.

Yes, they get re-attached for truncation, attr removal, EOF block
freeing, etc. Only on the unlinked inode path in inactivation do
they get removed once all the work tha tmodifies the dquots is done.

But many of the paths don't detach them again because they are
multi-use.  e.g xfs_free_eofblocks() will attach dquots, but doesn't
detatch them because it's called from more placed than than the
inactivation path.

I'm sure this can all be cleaned up, but I *really* don't like the
idea of a "walk all XFS inodes" scan that actually only walks the
inodes with VFS references and not -all XFS inodes-.

And there's other problems with doing sb->s_inodes list walks -
namely the global lock. While we are doing this walk (might be tens
of millions of inodes!) we can hold the s_inode_list_lock for a long
time and we cannot instantiate new inodes or evict inodes to/from
the cache while that lock is held. The XFS inode walk is lockless
and we don't hold off anything to do wiht cache instantiation and
freeing, so it has less impact on the running system.

If everything is clean and don't block on locks anywhere, the
s_inodes list walk needs a cond_resched() in it. Again, tens
(hundreds) of millions of inodes can be on that list mean it can
hold the CPU for a long time.

Next, igrab() takes a reference to the inode which will mark them
referenced. THis walk grabs every inode in the filesysetm cache,
so marks them all referenced and makes it harder to reclaim them
under memory pressure. This perturbs working set behaviour.

inode list walks and igrab/iput don't come for free - they perturb
the working set, LRU orders, cause lock contention, long tail
latencies, etc. The XFS inode cache walk might not be the prettiest
thing, but it doesn't have any of these nasty side effects.

So, in general, I don't think we should be adding new inode list
walks anywhere, not even deep in XFS where nobody else might care...

> > Hence I think that xfs_qm_dqrele_all_inodes() is broken if all it is
> > doing is walking vfs referenced inodes, because it doesn't actually
> > release the dquots attached to reclaimable inodes. If this did
> > actually release all dquots, then there wouldn't be a need for the
> > xfs_qm_dqdetach() call in xfs_reclaim_inode() just before it's
> > handed to RCU to be freed....
> 
> Why does it work now, then?  The current code /also/ leaves the dquots
> attached to reclaimable inodes, and the quotaoff scan ignores
> IRECLAIMABLE inodes.

Luck, I think.

> Has it simply been the case that the dqpurge spins
> until reclaim runs, and reclaim gets run quickly enough (or quotaoff runs
> infrequently enough) that nobody's complained?

Yup, that's my assumption - quotaoff is rare, inode reclaim runs
every 5s - and so we haven't noticed it because nobody has looked
closely at how dquots vs inode reclaim works recently...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com