Re: [PATCH 4/4] vfs: keep inodes with page cache off the inode shrinker LRU

From: Johannes Weiner <hannes@cmpxchg.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Roman Gushchin <guro@fb.com>, Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 4/4] vfs: keep inodes with page cache off the inode shrinker LRU
Date: Wed, 16 Jun 2021 00:54:15 -0400	[thread overview]
Message-ID: <YMmD9xhBm9wGqYhf@cmpxchg.org> (raw)
In-Reply-To: <20210616012008.GE2419729@dread.disaster.area>

On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote:
> > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> > > >  			shadow = workingset_eviction(page, target_memcg);
> > > >  		__delete_from_page_cache(page, shadow);
> > > >  		xa_unlock_irq(&mapping->i_pages);
> > > > +		if (mapping_shrinkable(mapping))
> > > > +			inode_add_lru(mapping->host);
> > > > +		spin_unlock(&mapping->host->i_lock);
> > > >  
> > > 
> > > No. Inode locks have absolutely no place serialising core vmscan
> > > algorithms.
> > 
> > What if, and hear me out on this one, core vmscan algorithms change
> > the state of the inode?
> 
> Then the core vmscan algorithm has a layering violation.

You're just playing a word game here.

If we want the inode reclaimability to be dependent on the state of
the page cache, then reclaim and truncation change the state of the
inode. It's really that simple.

We can recognize that fact in the code, or we can be obtuse about it.

> > The alternative to propagating this change from where it occurs is
> > simply that the outer layer now has to do stupid things to infer it -
> > essentially needing to busy poll. See below.
> > 
> > > Really, all this complexity because:
> > > 
> > > | The issue is mostly that shrinker pools attract pressure based on
> > > | their size, and when objects get skipped the shrinkers remember this
> > > | as deferred reclaim work.
> > > 
> > > And so you want inodes with reclaimable mappings to avoid being
> > > considered deferred work for the inode shrinker. That's what is
> > > occuring because we are currently returning LRU_RETRY to them, or
> > > would also happen if we returned LRU_SKIP or LRU_ROTATE just to
> > > ignore them.
> > > 
> > > However, it's trivial to change inode_lru_isolate() to do something
> > > like:
> > > 
> > > 	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> > > 		if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> > > 			spin_unlock(&inode->i_lock);
> > > 			return LRU_ROTATE_NODEFER;
> > > 		}
> > > 		.....
> > > 	}
> > 
> > Yeah, that's busy polling. It's simple and inefficient.
> 
> That's how the dentry and inode cache shrinkers work - the LRUs are
> lazily maintained to keep LRU list lock contention out of the fast
> paths. The trade off for that is some level of inefficiency in
> managing inodes and dentries that shouldn't be or aren't ideally
> placed on the LRU. IOWs, we push the decision on whether inodes
> should be on the LRU or whether they should be reclaimed into the
> shrinker scan logic, not the fast path lookup logic where inodes are
> referenced or dirtied or cleaned.

That's just not true.

The way the inode shrinker works is that all indefinite pins that make
an inode unreclaimable for reasons other than its age are kept off the
LRU. Yes, we're lazy about removing them if pins are established after
the object is already queued. But we don't queue objects with that
state, and sites that clear this pin update the LRU.

That's true when the inode is pinned by a refcount or by the dirty
state. I'm proposing to handle the page cache state in exactly the
same fashion.

It's you who is arguing we should deal with those particular pins
differently than how we deal with all others, and I find it difficult
to follow your reasoning for that.

> IOWs, the biggest problem with increasing the rate at which we move
> inodes on and off the LRU list is lock contention. It's already a
> major problem and hence the return values from the LRU code to say
> whether the inode was added/removed or not to the LRU.
> 
> By increasing the number inode LRU addition sites by an order of
> magnitude throughout the mm/ subsystem, we greatly increase the
> difficulty of managing the inode LRU. We no longer have a nice,
> predictable set of locations where the inodes are added to and
> removed from the LRUs, and so now we've got a huge increase in the
> number of different, unpredictable vectors that can lead to lock
> contention on the LRUs...

Rotating the same unreclaimable inodes over and over is BETTER for
lock/LRU contention than adding them once at the end of their life?

> > We can rotate the busy inodes to the end of the list when we see them,
> > but all *new* inodes get placed ahead of them and push them out the
> > back.  If you have a large set of populated inodes paired with an
> > ongoing stream of metadata operations creating one-off inodes, you end
> > up continuously shoveling through hundreds of thousand of the same
> > pinned inodes to get to the few reclaimable ones mixed in.
> 
> IDGI.
> 
> What has a huge stream of metadata dirty inodes going through the
> cache have to do with deciding when to reclaim a data inode with a
> non-zero page cache residency?

It's simply about the work it needs to perform in order to find
reclaimable nodes in the presence of a large number of permanently
unreclaimable ones.

> I mean, this whole patchset is intended to prevent the inode from
> being reclaimed until the page cache on the inode has been reclaimed
> by memory pressure, so what does it matter how often those data inodes
> are rotated through the LRU? They can't be reclaimed until the page
> cache frees the pages, so the presence or absence of other inodes on
> the list is entirely irrelevant.
> 
> For any workload that invloves streaming inodes through the inode
> cache, the inode shrinker is going to be scanning through the
> hundreds of thousands of inodes anyway. Once put in that context,
> the cost rotating on data inodes with page cache attached is going
> to be noise, regardless of whether it is inefficient or not.

How do you figure that?

If you have 1,000 inodes pinned by page cache, and then start
streaming, you have to rotate 1,000 inodes before you get to the first
reclaimable one. It depends on memory pressure how many streaming
inodes you were able to add before the inode pool is capped and
reclaim runs again. At least in theory, the worst case is having to
rotate every unreclaimable inode for every one you can reclaim.

Whether this is likely to happen in practice or not, I don't
understand why we should special case page cache pins from existing
pins and build a pathological cornercase into the shrinker behavior.

> And, really, if the inode cache has so many unreferenced inode with
> attached page cache attached to them that the size of the inode
> cache becomes an issue, then the page reclaim algorithms are totally
> failing to reclaim page cache quickly enough to maintian the desired
> system cache balance. i.e. We are supposed to be keeping a balance
> between the inode cache size and the page cache size, and if we
> can't keep that balance because inode reclaim can't make progress
> because page cache pinning inodes, then we have a page reclaim
> problem, not an inode reclaim problem.

We may never reclaim that page cache if we have no reason to do so.

Say you have a git tree fully cached, and now somebody runs grep on a
large subdirectory with lots of files. We don't kick out the git tree
cache for stuff that is only touched once - reclaim goes directly for
the cache coming off the grep. But the shrinker needs to first rotate
through the entire git tree to get to the grep inodes. Hopefully by
the time it gets to them, their page cache will have gone. If not,
we'll rotate them and go through the entire git tree inodes again.

What's the point in doing this when we know the exact point in time
where it makes sense to start scanning the git tree inodes?

Numbers talk, I get it, and I'll try to get some benchmarking done
next week. But do you really need numbers to understand how poorly
this algorithm performs?

The strongest argument you can make then is that the algorithm simply
never really matters. And that seems like a stretch.

> THere's also the problem of lack of visibility. RIght now, we know
> that all the unreferenced VFS inodes in the system are on the LRU
> and that they are generally all accounted for either by active
> references or walking the LRU. This change now creates a situation
> when inodes with page cache attached are not actively references but
> now also can't be found by walking the LRU list. i.e. we have inodes
> that are unaccounted for by various statistics. How do we know how
> many inodes are pinned by the page cache? We can't see them via
> inode shrinker count tracing, we can't see them via looking up all
> the active references to the inode (because there are none), and so
> on.

Can you elaborate on which statistics you are referring to?

I can see nr_inodes and nr_unused, but I don't see the delta being
broken down further into open and dirty - and if dirty whether it's
the inode that's dirty, or the page cache. So it appears we already
don't know how many inodes can be pinned today by the page cache.

From what I can see there seems to be a general lack of insight into
the state of the icache that may need a more general solution that is
outside the scope of my patch here.

> Essentially we have nothing tracking these inodes at all at this
> point, so if we ever miss a call in the mm/ to add the inode to the
> LRU that you've sprinked throughout the mm/ code, we essentially
> leak the inode. It's hard enough tracking down bugs that result in
> "VFS: busy inodes after unmount" errors without actively adding
> infrastructure that intentionally makes inodes disappear from
> tracking.

Well yes, but that's no different than missing an iput(). And there is
a lot more of those than there are page cache deletion sides.

> > Meanwhile, the MM is standing by, going "You know, I could tell you
> > exactly when those actually become reclaimable..."
> > 
> > If you think this is the more elegant implementation, simply because
> > it follows a made-up information hierarchy that doesn't actually map
> > to the real world, then I don't know what to tell you.
> > 
> > We take i_count and dirty inodes off the list because it's silly to
> > rotate them over and over, when we could just re-add them the moment
> > the pin goes away. It's not a stretch to do the same for populated
> > inodes, which usually make up a much bigger share of the pool.
> 
> We don't take dirty inodes off the LRU. We only take referenced
> inodes off the LRU in the shrinker isolation function because we can
> guarantee that when the inode reference is dropped via iput() the
> inode will be added back to the LRU if it needs to be placed there.
> 
> This is all all internal to the inode cache functionality
> (fs/inode.c) and that's the way it should remain because anything
> else will rapidly become unmaintainable.

You don't seem to understand how this code actually works - this is
patently false. I think this conversation could be more productive if
you spent some time catching up on that first.