[rfc] superblock shrinker accumulating excessive deferred counts

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [rfc] superblock shrinker accumulating excessive deferred counts
@ 2017-07-12 20:42 David Rientjes
  2017-07-17  5:06 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: David Rientjes @ 2017-07-12 20:42 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Greg Thelen, Andrew Morton, Johannes Weiner, Vladimir Davydov,
	Dave Chinner, linux-kernel

Hi Al and everyone,

We're encountering an issue where the per-shrinker per-node deferred 
counts grow excessively large for the superblock shrinker.  This appears 
to be long-standing behavior, so reaching out to you to see if there's any 
subtleties being overlooked since there is a reference to memory pressure 
and GFP_NOFS allocations growing total_scan purposefully.

This is a side effect of super_cache_count() returning the appropriate 
count but super_cache_scan() refusing to do anything about it and 
immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.

An unlucky thread will grab the per-node shrinker->nr_deferred[nid] count 
and increase it by

	(2 * nr_scanned * super_cache_count()) / (nr_eligible + 1)

While total_scan is capped to a sane limit, and restricts the amount of 
scanning that this thread actually does, if super_cache_scan() immediately 
responds with SHRINK_STOP because of GFP_NOFS, the end result of doing any 
of this is that nr_deferred just increased.  If we have a burst of 
GFP_NOFS allocations, this grows it potentially very largely, which we 
have seen in practice, and no matter how much __GFP_FS scanning is done 
capped by total_scan, we can never fully get down to batch_count == 1024.

This seems troublesome to me and my first inclination was to avoid 
counting *any* objects at all for GFP_NOFS but then I notice the comment 
in do_shrink_slab():

	/*
	 * We need to avoid excessive windup on filesystem shrinkers
	 * due to large numbers of GFP_NOFS allocations causing the
	 * shrinkers to return -1 all the time. This results in a large
	 * nr being built up so when a shrink that can do some work
	 * comes along it empties the entire cache due to nr >>>
	 * freeable. This is bad for sustaining a working set in
	 * memory.
	 *
	 * Hence only allow the shrinker to scan the entire cache when
	 * a large delta change is calculated directly.
	 */

I assume the comment is referring to "excessive windup" only in terms of 
total_scan, although it doesn't impact next_deferred at all.  The problem 
here seems to be next_deferred always grows extremely large.

I'd like to do this, but am checking for anything subtle that this relies 
on wrt memory pressure or implict intended behavior.

Thanks for looking at this!
---
 fs/super.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/super.c b/fs/super.c
--- a/fs/super.c
+++ b/fs/super.c
@@ -65,13 +65,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink,

 	sb = container_of(shrink, struct super_block, s_shrink);

-	/*
-	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
-	 * to recurse into the FS that called us in clear_inode() and friends..
-	 */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!trylock_super(sb))
 		return SHRINK_STOP;

@@ -116,6 +109,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	struct super_block *sb;
 	long	total_objects = 0;

+	/*
+	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
+	 * to recurse into the FS that called us in clear_inode() and friends..
+	 */
+	if (!(sc->gfp_mask & __GFP_FS))
+		return 0;
+
 	sb = container_of(shrink, struct super_block, s_shrink);

 	/*

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] superblock shrinker accumulating excessive deferred counts
  2017-07-12 20:42 [rfc] superblock shrinker accumulating excessive deferred counts David Rientjes
@ 2017-07-17  5:06 ` Dave Chinner
  2017-07-17 20:37   ` David Rientjes
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2017-07-17  5:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alexander Viro, Greg Thelen, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-kernel

On Wed, Jul 12, 2017 at 01:42:35PM -0700, David Rientjes wrote:
> Hi Al and everyone,
> 
> We're encountering an issue where the per-shrinker per-node deferred 
> counts grow excessively large for the superblock shrinker.  This appears 
> to be long-standing behavior, so reaching out to you to see if there's any 
> subtleties being overlooked since there is a reference to memory pressure 
> and GFP_NOFS allocations growing total_scan purposefully.

There are plenty of land mines^W^Wsubtleties in this code....

> This is a side effect of super_cache_count() returning the appropriate 
> count but super_cache_scan() refusing to do anything about it and 
> immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.

Yup. Happens during things like memory allocations in filesystem
transaction context. e.g. when your memory pressure is generated by
GFP_NOFS allocations within transactions whilst doing directory
traversals (say 'chown -R' across an entire filesystem), then we
can't do direct reclaim on the caches that are generating the memory
pressure and so have to defer all the work to either kswapd or the
next GFP_KERNEL allocation context that triggers reclaim.

> An unlucky thread will grab the per-node shrinker->nr_deferred[nid] count 
> and increase it by
> 
> 	(2 * nr_scanned * super_cache_count()) / (nr_eligible + 1)
> 
> While total_scan is capped to a sane limit, and restricts the amount of 
> scanning that this thread actually does, if super_cache_scan() immediately 
> responds with SHRINK_STOP because of GFP_NOFS, the end result of doing any 
> of this is that nr_deferred just increased. 

Yes, by design.

> If we have a burst of 
> GFP_NOFS allocations, this grows it potentially very largely, which we 
> have seen in practice,

Yes, by design.

> and no matter how much __GFP_FS scanning is done 
> capped by total_scan, we can never fully get down to batch_count == 1024.

I don't see a batch_count variable in the shrinker code anywhere,
so I'm not sure what you mean by this.

Can you post a shrinker trace that shows the deferred count wind
up and then display the problem you're trying to describe?

> This seems troublesome to me and my first inclination was to avoid 
> counting *any* objects at all for GFP_NOFS but then I notice the comment 
> in do_shrink_slab():
> 
> 	/*
> 	 * We need to avoid excessive windup on filesystem shrinkers
> 	 * due to large numbers of GFP_NOFS allocations causing the
> 	 * shrinkers to return -1 all the time. This results in a large
> 	 * nr being built up so when a shrink that can do some work
> 	 * comes along it empties the entire cache due to nr >>>
> 	 * freeable. This is bad for sustaining a working set in
> 	 * memory.
> 	 *
> 	 * Hence only allow the shrinker to scan the entire cache when
> 	 * a large delta change is calculated directly.
> 	 */
> 
> I assume the comment is referring to "excessive windup" only in terms of 
> total_scan, although it doesn't impact next_deferred at all.  The problem 
> here seems to be next_deferred always grows extremely large.

"excessive windup" means the deferred count kept growing without
bound and so when work was finally able to be done, then amount of
work deferred would  trash the entire cache in one go. Think of a
spring - you can use it to smooth peaks and troughs in steady state
conditions, but transient conditions can wind the spring up so tight
that it can't be controlled when it is released. That's the
"excessive windup" part of the description above.

How do we control springs? By adding a damper to reduce the
speed at which it can react to large step changes, hence making it
harder to step outside the bounds of controlled behaviour. In this
case, the damper is the delta based clamping of total_scan.

i.e. light memory pressure generates small deltas, but we can have
so much GFP_NOFS allocation that we can still defer large amounts of
work. Under light memory pressure, we want to release this spring
more quickly than the current memory pressure indicates, but not so
fast that we create a great big explosion of work and unbalance the
system it is more important to maintain the working set in light
memory pressure conditions than it is to free lots of memory.

However, if we have heavy memory pressure (e.g. priority has wound
up) then the delta scan will cross the trigger threshold of "do lots
of work now, we need the memory" and we'll dump the entire deferred
work count into this execution of the shrinker, because memory is
needed right now....

> I'd like to do this, but am checking for anything subtle that this relies 
> on wrt memory pressure or implict intended behavior.

If we *don't* count and defer the work that we should have done
under GFP_NOFS reclaim contexts, we end up with caches that memory
reclaim will not shrink until GFP_NOFS generated memory pressure
stops completely. This is, generally speaking, bad for application
performance because they get blocked waiting for memory to be freed
from caches that memory reclaim can't put any significant pressure
on..

OTOH, if we don't damp down the deferred count scanning on small
deltas, then we end up with filesystem caches being trashed in light
memory pressure conditions. This is, generally speaking, bad for
workloads that rely on filesystem caches for performance (e.g git,
NFS servers, etc).

What we have now is effectively a brute force solution that finds
a decent middle ground most of the time. It's not perfect, but I'm
yet to find a better solution....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] superblock shrinker accumulating excessive deferred counts
  2017-07-17  5:06 ` Dave Chinner
@ 2017-07-17 20:37   ` David Rientjes
  2017-07-17 21:50     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: David Rientjes @ 2017-07-17 20:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Greg Thelen, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Hugh Dickins, linux-kernel

On Mon, 17 Jul 2017, Dave Chinner wrote:

> > This is a side effect of super_cache_count() returning the appropriate 
> > count but super_cache_scan() refusing to do anything about it and 
> > immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.
> 
> Yup. Happens during things like memory allocations in filesystem
> transaction context. e.g. when your memory pressure is generated by
> GFP_NOFS allocations within transactions whilst doing directory
> traversals (say 'chown -R' across an entire filesystem), then we
> can't do direct reclaim on the caches that are generating the memory
> pressure and so have to defer all the work to either kswapd or the
> next GFP_KERNEL allocation context that triggers reclaim.
> 

Thanks for looking into this, Dave!

The number of GFP_NOFS allocations that build up the deferred counts can 
be unbounded, however, so this can become excessive, and the oom killer 
will not kill any processes in this context.  Although the motivation to 
do additional reclaim because of past GFP_NOFS reclaim attempts is 
worthwhile, I think it should be limited because currently it only 
increases until something is able to start draining these excess counts.  
Having 10,000 GFP_NOFS reclaim attempts store up 
(2 * nr_scanned * freeable) / (nr_eligible + 1) objects 10,000 times 
such that it exceeds freeable by many magnitudes doesn't seem like a 
particularly useful thing.  For reference, we have seen nr_deferred for a 
single node to be > 10,000,000,000 in practice.  total_scan is limited to 
2 * freeable for each call to do_shrink_slab(), but such an excessive 
deferred count will guarantee it retries 2 * freeable each time instead of 
the proportion of lru scanned as intended.

What breaks if we limit the nr_deferred counts to freeable * 4, for 
example?

> > and no matter how much __GFP_FS scanning is done 
> > capped by total_scan, we can never fully get down to batch_count == 1024.
> 
> I don't see a batch_count variable in the shrinker code anywhere,
> so I'm not sure what you mean by this.
> 

batch_size == 1024, sorry.

> Can you post a shrinker trace that shows the deferred count wind
> up and then display the problem you're trying to describe?
> 

All threads contending on the list_lru's nlru->lock because they are all 
stuck in super_cache_count() while one thread is iterating through an 
excessive number of deferred objects in super_cache_scan(), contending for 
the same locks and nr_deferred never substantially goes down.

The problem with the superblock shrinker, which is why I emailed Al 
originally, is also that it is SHRINKER_MEMCG_AWARE.  Our 
list_lru_shrink_count() is only representative for the list_lru of 
sc->memcg, which is used in both super_cache_count() and 
super_cache_scan() for various math.  The nr_deferred counts from the 
do_shrink_slab() logic, however, are per-nid and, as such, various memcgs 
get penalized with excessive counts that they do not have freeable to 
begin with.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] superblock shrinker accumulating excessive deferred counts
  2017-07-17 20:37   ` David Rientjes
@ 2017-07-17 21:50     ` Dave Chinner
  2017-07-19  0:28       ` David Rientjes
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2017-07-17 21:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alexander Viro, Greg Thelen, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Hugh Dickins, linux-kernel

On Mon, Jul 17, 2017 at 01:37:35PM -0700, David Rientjes wrote:
> On Mon, 17 Jul 2017, Dave Chinner wrote:
> 
> > > This is a side effect of super_cache_count() returning the appropriate 
> > > count but super_cache_scan() refusing to do anything about it and 
> > > immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.
> > 
> > Yup. Happens during things like memory allocations in filesystem
> > transaction context. e.g. when your memory pressure is generated by
> > GFP_NOFS allocations within transactions whilst doing directory
> > traversals (say 'chown -R' across an entire filesystem), then we
> > can't do direct reclaim on the caches that are generating the memory
> > pressure and so have to defer all the work to either kswapd or the
> > next GFP_KERNEL allocation context that triggers reclaim.
> > 
> 
> Thanks for looking into this, Dave!
> 
> The number of GFP_NOFS allocations that build up the deferred counts can 
> be unbounded, however, so this can become excessive, and the oom killer 
> will not kill any processes in this context.  Although the motivation to 
> do additional reclaim because of past GFP_NOFS reclaim attempts is 
> worthwhile, I think it should be limited because currently it only 
> increases until something is able to start draining these excess counts.  

Usually kswapd is kicked in by this point and starts doing work. Why
isn't kswapd doing the shrinker work in the background?

> Having 10,000 GFP_NOFS reclaim attempts store up 
> (2 * nr_scanned * freeable) / (nr_eligible + 1) objects 10,000 times 
> such that it exceeds freeable by many magnitudes doesn't seem like a 
> particularly useful thing.  For reference, we have seen nr_deferred for a 
> single node to be > 10,000,000,000 in practice.

What is the workload, and where is that much GFP_NOFS allocation
coming from?

> total_scan is limited to 
> 2 * freeable for each call to do_shrink_slab(), but such an excessive 
> deferred count will guarantee it retries 2 * freeable each time instead of 
> the proportion of lru scanned as intended.
> 
> What breaks if we limit the nr_deferred counts to freeable * 4, for 
> example?

No solutions are viable until the cause of the windup is known and
understood....

> > Can you post a shrinker trace that shows the deferred count wind
> > up and then display the problem you're trying to describe?
> > 
> 
> All threads contending on the list_lru's nlru->lock because they are all 
> stuck in super_cache_count() while one thread is iterating through an 
> excessive number of deferred objects in super_cache_scan(), contending for 
> the same locks and nr_deferred never substantially goes down.

Ugh. The per-node lru list count was designed to run unlocked and so
avoid this sort of (known) scalability problem.

Ah, see the difference between list_lru_count_node() and
list_lru_count_one(). list_lru_count_one() should only take locks
for memcg lookups if it is trying to shrink a memcg. That needs to
be fixed before anything else and, if possible, the memcg lookup be
made lockless....

IIRC, the memcg shrinkers all set sc->nid = 0, as the memcg LRUs are
not per-node lists - they are just a single linked lists and so
there are other scalability problems with memcgs, too.

> The problem with the superblock shrinker, which is why I emailed Al 
> originally, is also that it is SHRINKER_MEMCG_AWARE.  Our 
> list_lru_shrink_count() is only representative for the list_lru of 
> sc->memcg, which is used in both super_cache_count() and 
> super_cache_scan() for various math.  The nr_deferred counts from the 
> do_shrink_slab() logic, however, are per-nid and, as such, various memcgs 
> get penalized with excessive counts that they do not have freeable to 
> begin with.

Yup, the memcg shrinking was shoe-horned into the per-node LRU
infrastructure, and the high level accounting is completely unaware
of the fact that memcgs have their own private LRUs. We left the
windup in place because slab caches are shared, and it's possible
that memory can't be freed because pages have objects from different
memcgs pinning them. Hence we need to bleed at least some of that
"we can't make progress" count back into the global "deferred
reclaim" pool to get other contexts to do some reclaim.

Perhaps that's the source of the problem - memcgs have nasty
behaviours when they have very little reclaimable objects (look at
all the "we need to ve able to reclaim every single object" fixes),
so I would not be surprised if it's a single memcg under extreme
memory pressure that is causing windups. Still, I think the lock
contention problems should be sorted first - removing the shrinker
serialisation will change behaviour significantly in these
situations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] superblock shrinker accumulating excessive deferred counts
  2017-07-17 21:50     ` Dave Chinner
@ 2017-07-19  0:28       ` David Rientjes
  2017-07-19  1:33         ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: David Rientjes @ 2017-07-19  0:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Greg Thelen, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Hugh Dickins, linux-kernel

On Tue, 18 Jul 2017, Dave Chinner wrote:

> > Thanks for looking into this, Dave!
> > 
> > The number of GFP_NOFS allocations that build up the deferred counts can 
> > be unbounded, however, so this can become excessive, and the oom killer 
> > will not kill any processes in this context.  Although the motivation to 
> > do additional reclaim because of past GFP_NOFS reclaim attempts is 
> > worthwhile, I think it should be limited because currently it only 
> > increases until something is able to start draining these excess counts.  
> 
> Usually kswapd is kicked in by this point and starts doing work. Why
> isn't kswapd doing the shrinker work in the background?
> 

It is, and often gets preempted itself while in lru scanning or 
shrink_slab(), most often super_cache_count() itself.  The issue is that 
it gets preempted by networking packets being sent in irq context which 
ends up eating up GFP_ATOMIC memory.  One of the key traits of this is 
that per-zone free memory is far below the min watermarks so not only is 
there insufficient memory for GFP_NOFS, but also insufficient memory for 
GFP_ATOMIC.  Kswapd will only slab shrink a proportion of the lru scanned 
if it is not lucky enough to grab the excess nr_deferred.  And meanwhile 
other threads end up increasing it.

It's various workloads and I can't show a specific example of GFP_NOFS 
allocations in flight because we have made changes to prevent this, 
specifically ignoring nr_deferred counts for SHRINKER_MEMCG_AWARE 
shrinkers since they are largely erroneous.  This can also occur if we 
cannot grab the trylock on the superblock itself.

> Ugh. The per-node lru list count was designed to run unlocked and so
> avoid this sort of (known) scalability problem.
> 
> Ah, see the difference between list_lru_count_node() and
> list_lru_count_one(). list_lru_count_one() should only take locks
> for memcg lookups if it is trying to shrink a memcg. That needs to
> be fixed before anything else and, if possible, the memcg lookup be
> made lockless....
> 

We've done that as part of this fix, actually, by avoiding doing resizing 
of these list_lru's when the number of memcg cache ids increase.  We just 
preallocate the max amount, MEMCG_CACHES_MAX_SIZE, to do lockless reads 
since the lock there is only needed to prevent concurrent remapping.

> Yup, the memcg shrinking was shoe-horned into the per-node LRU
> infrastructure, and the high level accounting is completely unaware
> of the fact that memcgs have their own private LRUs. We left the
> windup in place because slab caches are shared, and it's possible
> that memory can't be freed because pages have objects from different
> memcgs pinning them. Hence we need to bleed at least some of that
> "we can't make progress" count back into the global "deferred
> reclaim" pool to get other contexts to do some reclaim.
> 

Right, now we've patched our kernel to avoid looking at the nr_deferred 
count for SHRINKER_MEMCG_AWARE but that's obviously a short-term solution, 
and I'm not sure that we can spare the tax to get per-memcg per-node 
deferred counts.  It seems that some other metadata would be needed in 
this case to indicate excessive windup for slab shrinking that cannot 
actually do any scanning in super_cache_scan().  

Vladimir, do you have a suggestion, or is there someone else that is 
working on this?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] superblock shrinker accumulating excessive deferred counts
  2017-07-19  0:28       ` David Rientjes
@ 2017-07-19  1:33         ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2017-07-19  1:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Alexander Viro, Greg Thelen, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, Hugh Dickins, linux-kernel

On Tue, Jul 18, 2017 at 05:28:14PM -0700, David Rientjes wrote:
> On Tue, 18 Jul 2017, Dave Chinner wrote:
> 
> > > Thanks for looking into this, Dave!
> > > 
> > > The number of GFP_NOFS allocations that build up the deferred counts can 
> > > be unbounded, however, so this can become excessive, and the oom killer 
> > > will not kill any processes in this context.  Although the motivation to 
> > > do additional reclaim because of past GFP_NOFS reclaim attempts is 
> > > worthwhile, I think it should be limited because currently it only 
> > > increases until something is able to start draining these excess counts.  
> > 
> > Usually kswapd is kicked in by this point and starts doing work. Why
> > isn't kswapd doing the shrinker work in the background?
> > 
> 
> It is, and often gets preempted itself while in lru scanning or 
> shrink_slab(), most often super_cache_count() itself.  The issue is that 
> it gets preempted by networking packets being sent in irq context which 
> ends up eating up GFP_ATOMIC memory. 

That seems like a separate architectural problem - memory allocation
threads preempting the memory reclaim threads they depend on for
progress seems like a more general priority inversion problem to me,
not a shrinker problem. It's almost impossible to work around this
sort of "supply can't keep up with demand because demand has higher
priority and starves supply" problem by hacking around in the supply
context...

> One of the key traits of this is 
> that per-zone free memory is far below the min watermarks so not only is 
> there insufficient memory for GFP_NOFS, but also insufficient memory for 
> GFP_ATOMIC.  Kswapd will only slab shrink a proportion of the lru scanned 
> if it is not lucky enough to grab the excess nr_deferred.  And meanwhile 
> other threads end up increasing it.

It sounds very much like GFP_KERNEL kswapd reclaim context needs to
run with higher priority than the network driver ISR threads. Or, if
the drivers actually do large amounts of memory allocation in IRQ
context, then that work needs to be moved into ISRs that can be
scheduled appropriately to prevent starvation of memory reclaim.
i.e. the network drivers should be dropping packets because they
can't get memory, not per-empting reclaim infrastructure in an
attempt to get more memory allocated because packets are incoming...

> It's various workloads and I can't show a specific example of GFP_NOFS 
> allocations in flight because we have made changes to prevent this, 
> specifically ignoring nr_deferred counts for SHRINKER_MEMCG_AWARE 
> shrinkers since they are largely erroneous.  This can also occur if we 
> cannot grab the trylock on the superblock itself.

Which should be pretty rare.

> 
> > Ugh. The per-node lru list count was designed to run unlocked and so
> > avoid this sort of (known) scalability problem.
> > 
> > Ah, see the difference between list_lru_count_node() and
> > list_lru_count_one(). list_lru_count_one() should only take locks
> > for memcg lookups if it is trying to shrink a memcg. That needs to
> > be fixed before anything else and, if possible, the memcg lookup be
> > made lockless....
> > 
> 
> We've done that as part of this fix, actually, by avoiding doing resizing 
> of these list_lru's when the number of memcg cache ids increase.  We just 
> preallocate the max amount, MEMCG_CACHES_MAX_SIZE, to do lockless reads 
> since the lock there is only needed to prevent concurrent remapping.

And if you've fixed this, why is the system getting stuck counting
the number of objects on the LRU? Or does that just move the
serialisation to the scan call itself?

If so, I suspect this is going to be another case of direct reclaim
trying to drive unbound parallelism through the shrinkers which
don't have any parallelism at all because the caches being shrunk
only have a single list in memcg contexts. There's nothing quite
like having a thundering heard of allocations all trying to run
direct reclaim at the same time and getting backed up in the same
shrinker context because the shrinker effectively serialises access
to the cache....

> > Yup, the memcg shrinking was shoe-horned into the per-node LRU
> > infrastructure, and the high level accounting is completely unaware
> > of the fact that memcgs have their own private LRUs. We left the
> > windup in place because slab caches are shared, and it's possible
> > that memory can't be freed because pages have objects from different
> > memcgs pinning them. Hence we need to bleed at least some of that
> > "we can't make progress" count back into the global "deferred
> > reclaim" pool to get other contexts to do some reclaim.
> > 
> 
> Right, now we've patched our kernel to avoid looking at the nr_deferred 
> count for SHRINKER_MEMCG_AWARE but that's obviously a short-term solution, 
> and I'm not sure that we can spare the tax to get per-memcg per-node 
> deferred counts.

I very much doubt it - it was too expensive to even consider a few
years ago and the cost hasn't gone down at all...

But, really, what I'm hearing at the moment is that the shrinker
issues are ony a symptom of a deeper architectural problem and not
the cause. It sounds to me like it's simply a case of severe
demand-driven breakdown because the GFP_KERNEL memory reclaim
mechanisms are being starved of CPU time by allocation contexts that
can't do direct reclaim. That problem needs to be solved first, then
we can look at what happens when GFP_KERNEL reclaim contexts are
given the CPU time they need to keep up with interrupt context
GFP_ATOMIC allocation demand....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-07-19  1:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-12 20:42 [rfc] superblock shrinker accumulating excessive deferred counts David Rientjes
2017-07-17  5:06 ` Dave Chinner
2017-07-17 20:37   ` David Rientjes
2017-07-17 21:50     ` Dave Chinner
2017-07-19  0:28       ` David Rientjes
2017-07-19  1:33         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).