Re: [PATCH 1/2] mm: use slab size in the slab shrinking ratio calculation

From: Josef Bacik <josef@toxicpanda.com>
To: Minchan Kim <minchan@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>,
	hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org,
	linux-mm@kvack.org, kernel-team@fb.com,
	Josef Bacik <jbacik@fb.com>,
	mhocko@kernel.org, cl@linux.com, david@fromorbit.com
Subject: Re: [PATCH 1/2] mm: use slab size in the slab shrinking ratio calculation
Date: Tue, 4 Jul 2017 09:21:37 -0400	[thread overview]
Message-ID: <20170704132136.GB6807@destiny> (raw)
In-Reply-To: <20170704030100.GA16432@bbox>

On Tue, Jul 04, 2017 at 12:01:00PM +0900, Minchan Kim wrote:
> On Mon, Jul 03, 2017 at 09:50:07AM -0400, Josef Bacik wrote:
> > On Mon, Jul 03, 2017 at 10:33:03AM +0900, Minchan Kim wrote:
> > > Hello,
> > > 
> > > On Fri, Jun 30, 2017 at 11:03:24AM -0400, Josef Bacik wrote:
> > > > On Fri, Jun 30, 2017 at 11:17:13AM +0900, Minchan Kim wrote:
> > > > 
> > > > <snip>
> > > > 
> > > > > > 
> > > > > > Because this static step down wastes cycles.  Why loop 10 times when you could
> > > > > > set the target at actual usage and try to get everything in one go?  Most
> > > > > > shrinkable slabs adhere to this default of in use first model, which means that
> > > > > > we have to hit an object in the lru twice before it is freed.  So in order to
> > > > > 
> > > > > I didn't know that.
> > > > > 
> > > > > > reclaim anything we have to scan a slab cache's entire lru at least once before
> > > > > > any reclaim starts happening.  If we're doing this static step down thing we
> > > > > 
> > > > > If it's really true, I think that shrinker should be fixed first.
> > > > > 
> > > > 
> > > > Easier said than done.  I've fixed this for the super shrinkers, but like I said
> > > > below, all it takes is some asshole doing find / -exec stat {} \; twice to put
> > > > us back in the same situation again.  There's no aging mechanism other than
> > > > memory reclaim, so we get into this shitty situation of aging+reclaiming at the
> > > > same time.
> > > 
> > > What's different with normal page cache problem?
> > > 
> > 
> > What's different is reclaiming a page from pagecache gives you a page,
> > reclaiming 10k objects from slab may only give you one page if you are super
> > unlucky.  I'm nothing in life if I'm not unlucky.
> > 
> > > It has the same problem you mentioned so need to peek what VM does to
> > > address it.
> > > 
> > > It has two LRU list, active and inactive and maintain the size ratio 1:1.
> > > New page is on inactive and if they are two-touched, the page will be
> > > promoted into active list which is same problem.
> > > However, once reclaim is triggered, VM will can move quickly them from
> > > active to inactive with remove referenced flag untill the ratio is matched.
> > > So, VM can reclaim pages from inactive list, easily.
> > > 
> > > Can we apply similar mechanism into the problematical slab?
> > > How about adding shrink_slab(xxx, ACTIVE|INACTIVE) in somewhere of VM
> > > for demotion of objects from active list and adding the logic to move
> > > inactive object to active list when the cache hit happens to the FS?
> > > 
> > 
> > I did this too!  This worked out ok, but was a bit complex and the problem was
> > solved just as well by dropping the INUSE first approach.  I think that Dave's
> > approach to having a separate aging mechanism is a good compliment to these
> > patches.
> 
> There are two problems you are try to address.
> 
> 1. slab *page* reclaim
> 
> Your claim is that it's hard to reclaim a page by slab fragmentation so need to
> reclaim objects more aggressively.
> 
> Basically, aggressive scanning doesn't guarantee to reclaim a page but it just
> increases the possibility. Even, if we think slab works with merging feature(i.e.,
> it mixes same size several type objects in a slab), the possibility will be huge
> dropped if you try to bail out on a certain shrinker. So for working well,
> we should increase aggressiveness too much to sweep every objects from all shrinker.
> I guess that's why your patch makes the logic very aggressive.
> In here, my concern with that aggressive is to reclaim all objects too early
> and it ends up making void caching scheme. I'm not sure it's gain in the end.
>

Well the fact is what we have doesn't work, and I've been staring at this
problem for a few months and I don't have a better solution.

And keep in mind we're talking about a purely slab workload, something that
isn't likely to be a common case.  And even if our scan target is 2x, we aren't
going to reclaim the entire cache before we bail out.  We only scan in
'batch_size' chunks, which generally is 1024.  In the worst case that we have
one in use object on every slab page we own then yes we're fucked, but we're
still fucked with the current code, only with the current code it'll take us 20
minutes of looping in the vm vs. seconds scanning the whole list twice.

> 2. stream-workload
> 
> Your claim is that every objects can have INUSE flag in that workload so they
> need to scan full-cycle with removing the flag and finally, next cycle,
> objects can be reclaimed. On the situation, static incremental scanning would
> make deep prorioty drop which causes unncessary CPU cycle waste.
> 
> Actually, there isn't nice solution for that at the moment. Page cache try
> to solve it with multi-level LRU and as you said, it would solve the
> problem. However, it would be too complicated so you could be okay with
> Dave's suggestion which periodic aging(i.e., LFU) but it's not free so that
> it could increase runtime latency.
> 
> The point is that such workload is hard to solve in general and just
> general agreessive scanning is not a good solution because it can sweep
> other shrinkers which don't have such problem so I hope it should be
> solved by a specific shrinker itself rather than general VM level.

The only problem I see here is our shrinker list is just a list, there's no
order or anything and we just walk through one at a time.  We could mitigate
this problem by ordering the list based on objects, but this isn't necessarily a
good indication of overall size.  Consider xfs_buf, where each slab object is
also hiding 1 page, so for every slab object we free we also free 1 page.  This
may appear to be a smaller slab by object measures, but may actually be larger.
We could definitely make this aspect of the shrinker smarter, but these patches
here need to still be in place in general to solve the problem of us not being
aggressive enough currently.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>