All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-19  7:13 ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-19  7:13 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, riel, dchinner, guroan, Kernel Team, hannes

Sorry, once more, now with fsdevel@ in cc, asked by Dave.
--

Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
with accumulating of dying memory cgroups. This is a serious problem:
on most of our machines we've seen thousands on dying cgroups, and
the corresponding memory footprint was measured in hundreds of megabytes.
The problem was also independently discovered by other companies.

The fixes were reverted due to xfs regression investigated by Dave Chinner.
Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
which caused Rik van Riel to propose a patch [3], which aimed to fix the
regression. The idea is to accumulate small memory pressure and apply it
periodically, so that we don't overscan small shrinker lists. According
to Jan Kara's data [4], Rik's patch partially fixed the regression,
but not entirely.

The path forward isn't entirely clear now, and the status quo isn't acceptable
due to memcg leak bug. Dave and Michal's position is to focus on dying memory
cgroup case and apply some artificial memory pressure on corresponding slabs
(probably, during cgroup deletion process). This approach can theoretically
be less harmful for the subtle scanning balance, and not cause any regressions.

In my opinion, it's not necessarily true. Slab objects can be shared between
cgroups, and often can't be reclaimed on cgroup removal without an impact on the
rest of the system. Applying constant artificial memory pressure precisely only
on objects accounted to dying cgroups is challenging and will likely
cause a quite significant overhead. Also, by "forgetting" of some slab objects
under light or even moderate memory pressure, we're wasting memory, which can be
used for something useful. Dying cgroups are just making this problem more
obvious because of their size.

So, using "natural" memory pressure in a way, that all slabs objects are scanned
periodically, seems to me as the best solution. The devil is in details, and how
to do it without causing any regressions, is an open question now.

Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
is a potential option to consider.

It will be nice to discuss the problem on LSF/MM, agree on general path and
make a potential list of benchmarks, which can be used to prove the solution.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
[3] https://lkml.org/lkml/2019/1/28/1865
[4] https://lkml.org/lkml/2019/2/8/336

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-19  7:13 ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-19  7:13 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, riel, dchinner, guroan, Kernel Team, hannes

Sorry, once more, now with fsdevel@ in cc, asked by Dave.
--

Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
with accumulating of dying memory cgroups. This is a serious problem:
on most of our machines we've seen thousands on dying cgroups, and
the corresponding memory footprint was measured in hundreds of megabytes.
The problem was also independently discovered by other companies.

The fixes were reverted due to xfs regression investigated by Dave Chinner.
Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
which caused Rik van Riel to propose a patch [3], which aimed to fix the
regression. The idea is to accumulate small memory pressure and apply it
periodically, so that we don't overscan small shrinker lists. According
to Jan Kara's data [4], Rik's patch partially fixed the regression,
but not entirely.

The path forward isn't entirely clear now, and the status quo isn't acceptable
due to memcg leak bug. Dave and Michal's position is to focus on dying memory
cgroup case and apply some artificial memory pressure on corresponding slabs
(probably, during cgroup deletion process). This approach can theoretically
be less harmful for the subtle scanning balance, and not cause any regressions.

In my opinion, it's not necessarily true. Slab objects can be shared between
cgroups, and often can't be reclaimed on cgroup removal without an impact on the
rest of the system. Applying constant artificial memory pressure precisely only
on objects accounted to dying cgroups is challenging and will likely
cause a quite significant overhead. Also, by "forgetting" of some slab objects
under light or even moderate memory pressure, we're wasting memory, which can be
used for something useful. Dying cgroups are just making this problem more
obvious because of their size.

So, using "natural" memory pressure in a way, that all slabs objects are scanned
periodically, seems to me as the best solution. The devil is in details, and how
to do it without causing any regressions, is an open question now.

Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
is a potential option to consider.

It will be nice to discuss the problem on LSF/MM, agree on general path and
make a potential list of benchmarks, which can be used to prove the solution.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
[3] https://lkml.org/lkml/2019/1/28/1865
[4] https://lkml.org/lkml/2019/2/8/336


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [LSF/MM ATTEND] MM track: dying memory cgroups and slab reclaim issue, memcg, THP
       [not found] ` <20190219092323.GH4525@dhcp22.suse.cz>
@ 2019-02-19 16:21   ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-19 16:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Tue, Feb 19, 2019 at 10:23:23AM +0100, Michal Hocko wrote:
> Hi Roman,
> you were not explicit here, but is this meant to be also an ATTEND
> request as well? MM track presumably?

Yes, please.

I'd be interested in discussing the problem described above, as well
as any other memcg- and THP/hugepages related topics.

Thank you!

Roman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-19  7:13 ` Roman Gushchin
@ 2019-02-20  2:47   ` Dave Chinner
  -1 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-20  2:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote:
> Sorry, once more, now with fsdevel@ in cc, asked by Dave.
> --
> 
> Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
> with accumulating of dying memory cgroups. This is a serious problem:
> on most of our machines we've seen thousands on dying cgroups, and
> the corresponding memory footprint was measured in hundreds of megabytes.
> The problem was also independently discovered by other companies.
> 
> The fixes were reverted due to xfs regression investigated by Dave Chinner.

Context: it wasn't one regression that I investigated. We had
multiple bug reports with different regressions, and I saw evidence
on my own machines that something wasn't right because of the change
in the IO patterns in certain benchmarks. Some of the problems were
caused by the first patch, some were caused by the second patch.

This also affects ext4 (i.e. it's a general problem, not an XFS
problem) as has been reported a couple of times, including this one
overnight:

https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@stwm.de/

> Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
> which caused Rik van Riel to propose a patch [3], which aimed to fix the
> regression. The idea is to accumulate small memory pressure and apply it
> periodically, so that we don't overscan small shrinker lists. According
> to Jan Kara's data [4], Rik's patch partially fixed the regression,
> but not entirely.

Rik's patch was buggy and made an invalid assumptions about how a
cache with a small number of freeable objects is a "small cache", so
any comaprisons made with it are essentially worthless.

More details about the problems with the patch and approach here:

https://lore.kernel.org/stable/20190131224905.GN31397@rh/

> The path forward isn't entirely clear now, and the status quo isn't acceptable
> due to memcg leak bug. Dave and Michal's position is to focus on dying memory
> cgroup case and apply some artificial memory pressure on corresponding slabs
> (probably, during cgroup deletion process). This approach can theoretically
> be less harmful for the subtle scanning balance, and not cause any regressions.

I outlined the dying memcg problem in patch[0] of the revert series:

https://lore.kernel.org/linux-mm/20190130041707.27750-1-david@fromorbit.com/

It basically documents the solution I proposed for dying memcg
cleanup:

dgc> e.g. add a garbage collector via a background workqueue that sits on
dgc> the dying memcg calling something like:
dgc> 
dgc> void drop_slab_memcg(struct mem_cgroup *dying_memcg)
dgc> {
dgc>         unsigned long freed;
dgc> 
dgc>         do {
dgc>                 struct mem_cgroup *memcg = NULL;
dgc> 
dgc>                 freed = 0;
dgc>                 memcg = mem_cgroup_iter(dying_memcg, NULL, NULL);
dgc>                 do {
dgc>                         freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0);
dgc>                 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
dgc>         } while (freed > 0);
dgc> }

This is a pretty trivial piece of code and doesn't requiring
changing the core memory reclaim code at all.

> In my opinion, it's not necessarily true. Slab objects can be shared between
> cgroups, and often can't be reclaimed on cgroup removal without an impact on the
> rest of the system.

I've already pointed out that it is preferable for shared objects
to stay in cache, not face expedited reclaim:

https://lore.kernel.org/linux-mm/20190131221904.GL4205@dastard/

dgc> However, the memcg reaper *doesn't need to be perfect* to solve the
dgc> "takes too long to clean up dying memcgs". Even if it leaves shared
dgc> objects behind (which we want to do!), it still trims those memcgs
dgc> down to /just the shared objects still in use/.  And given that
dgc> objects shared by memcgs are in the minority (according to past
dgc> discussions about the difficulies of accounting them correctly) I
dgc> think this is just fine.
dgc> 
dgc> Besides, those reamining shared objects are the ones we want to
dgc> naturally age out under memory pressure, but otherwise the memcgs
dgc> will have been shaken clean of all other objects accounted to them.
dgc> i.e. the "dying memcg" memory footprint goes down massively and the
dgc> "long term buildup" of dying memcgs basically goes away.

This all seems like pretty desirable cross-memcg working set
maintenance behaviour to me...

> Applying constant artificial memory pressure precisely only
> on objects accounted to dying cgroups is challenging and will likely
> cause a quite significant overhead.

I don't know where you got that from - the above example is clearly
a once-off cleanup.

And executing it via a workqueue in the async memcg cleanup path
(which already runs through multiple work queues to run and wait for
different stages of cleanup) is not complex or challenging, nor is
it likely to add additional overhead because it means we will avoid
the long term shrinker scanning overhead that cleanup currently
requires.

> Also, by "forgetting" of some slab objects
> under light or even moderate memory pressure, we're wasting memory, which can be
> used for something useful.

Cached memory is not "forgotten" or "wasted memory". If the scan is
too small and not used, it is deferred to the next shrinker
invocation. This batching behaviour is intentionally done for scan
efficiency purposes. Don't take my word for it, read the discussion
that went along with commit 0b1fb40a3b12 ("mm: vmscan: shrink all
slab objects if tight on memory")

https://lore.kernel.org/lkml/20140115012541.ad302526.akpm@linux-foundation.org/

From Andrew:

akpm> Actually, the intent of batching is to limit the number of calls to
akpm> ->scan().  At least, that was the intent when I wrote it!  This is a
akpm> good principle and we should keep doing it.  If we're going to send the
akpm> CPU away to tread on a pile of cold cachelines, we should make sure
akpm> that it does a good amount of work while it's there.

IOWs, the "small scan" proposals defeat existing shrinker efficiency
optimisations. This change in behaviour is where the CPU usage
regressions in "small cache" scanning comes from.  As Andrew said:
scan batching is a good principle and we should keep doing it.

> Dying cgroups are just making this problem more
> obvious because of their size.

Dying cgroups see this as a problem only because they have extremely
poor life cycle management. Expediting dying memcg cache cleanup is
the way to fix this and that does not need us to change global memory
reclaim behaviour.

> So, using "natural" memory pressure in a way, that all slabs objects are scanned
> periodically, seems to me as the best solution. The devil is in details, and how
> to do it without causing any regressions, is an open question now.
> 
> Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
> is a potential option to consider.

That should be done once the memcg gc thread has shrunk the caches
down to just the shared objects (which we want to keep in cache!)
that reference the dying memcg. That will get rid of all the
remaining references and allow the memcg to be reclaimed completely.

> It will be nice to discuss the problem on LSF/MM, agree on general path and
> make a potential list of benchmarks, which can be used to prove the solution.

In reality, it comes down to this - should we:

	a) add a small amount of code into the subsystem to perform
	expedited reaping of subsystem owned objects and test against
	the known, specific reproducing workload; or

	b) change global memory reclaim algorithms in a way that
	affects every linux machine and workload in some way,
	resulting in us having to revalidate and rebalance memory
	reclaim for a large number of common workloads across all
	filesystems and subsystems that use shrinkers, on a wide
	range of different storage hardware and on both headless and
	desktop machines.

And when we look at it this way, if we end up with option b) as the
preferred solution then we've well and truly jumped the shark.  The
validation effort required for option b) is way out of proportion
with the small niche of machines and environments affected by the
dying memcg problem and the risk of regressions for users outside
these memcg-heavy environments is extremely high (as has already
been proven).

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-20  2:47   ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-20  2:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote:
> Sorry, once more, now with fsdevel@ in cc, asked by Dave.
> --
> 
> Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
> with accumulating of dying memory cgroups. This is a serious problem:
> on most of our machines we've seen thousands on dying cgroups, and
> the corresponding memory footprint was measured in hundreds of megabytes.
> The problem was also independently discovered by other companies.
> 
> The fixes were reverted due to xfs regression investigated by Dave Chinner.

Context: it wasn't one regression that I investigated. We had
multiple bug reports with different regressions, and I saw evidence
on my own machines that something wasn't right because of the change
in the IO patterns in certain benchmarks. Some of the problems were
caused by the first patch, some were caused by the second patch.

This also affects ext4 (i.e. it's a general problem, not an XFS
problem) as has been reported a couple of times, including this one
overnight:

https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@stwm.de/

> Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
> which caused Rik van Riel to propose a patch [3], which aimed to fix the
> regression. The idea is to accumulate small memory pressure and apply it
> periodically, so that we don't overscan small shrinker lists. According
> to Jan Kara's data [4], Rik's patch partially fixed the regression,
> but not entirely.

Rik's patch was buggy and made an invalid assumptions about how a
cache with a small number of freeable objects is a "small cache", so
any comaprisons made with it are essentially worthless.

More details about the problems with the patch and approach here:

https://lore.kernel.org/stable/20190131224905.GN31397@rh/

> The path forward isn't entirely clear now, and the status quo isn't acceptable
> due to memcg leak bug. Dave and Michal's position is to focus on dying memory
> cgroup case and apply some artificial memory pressure on corresponding slabs
> (probably, during cgroup deletion process). This approach can theoretically
> be less harmful for the subtle scanning balance, and not cause any regressions.

I outlined the dying memcg problem in patch[0] of the revert series:

https://lore.kernel.org/linux-mm/20190130041707.27750-1-david@fromorbit.com/

It basically documents the solution I proposed for dying memcg
cleanup:

dgc> e.g. add a garbage collector via a background workqueue that sits on
dgc> the dying memcg calling something like:
dgc> 
dgc> void drop_slab_memcg(struct mem_cgroup *dying_memcg)
dgc> {
dgc>         unsigned long freed;
dgc> 
dgc>         do {
dgc>                 struct mem_cgroup *memcg = NULL;
dgc> 
dgc>                 freed = 0;
dgc>                 memcg = mem_cgroup_iter(dying_memcg, NULL, NULL);
dgc>                 do {
dgc>                         freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0);
dgc>                 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
dgc>         } while (freed > 0);
dgc> }

This is a pretty trivial piece of code and doesn't requiring
changing the core memory reclaim code at all.

> In my opinion, it's not necessarily true. Slab objects can be shared between
> cgroups, and often can't be reclaimed on cgroup removal without an impact on the
> rest of the system.

I've already pointed out that it is preferable for shared objects
to stay in cache, not face expedited reclaim:

https://lore.kernel.org/linux-mm/20190131221904.GL4205@dastard/

dgc> However, the memcg reaper *doesn't need to be perfect* to solve the
dgc> "takes too long to clean up dying memcgs". Even if it leaves shared
dgc> objects behind (which we want to do!), it still trims those memcgs
dgc> down to /just the shared objects still in use/.  And given that
dgc> objects shared by memcgs are in the minority (according to past
dgc> discussions about the difficulies of accounting them correctly) I
dgc> think this is just fine.
dgc> 
dgc> Besides, those reamining shared objects are the ones we want to
dgc> naturally age out under memory pressure, but otherwise the memcgs
dgc> will have been shaken clean of all other objects accounted to them.
dgc> i.e. the "dying memcg" memory footprint goes down massively and the
dgc> "long term buildup" of dying memcgs basically goes away.

This all seems like pretty desirable cross-memcg working set
maintenance behaviour to me...

> Applying constant artificial memory pressure precisely only
> on objects accounted to dying cgroups is challenging and will likely
> cause a quite significant overhead.

I don't know where you got that from - the above example is clearly
a once-off cleanup.

And executing it via a workqueue in the async memcg cleanup path
(which already runs through multiple work queues to run and wait for
different stages of cleanup) is not complex or challenging, nor is
it likely to add additional overhead because it means we will avoid
the long term shrinker scanning overhead that cleanup currently
requires.

> Also, by "forgetting" of some slab objects
> under light or even moderate memory pressure, we're wasting memory, which can be
> used for something useful.

Cached memory is not "forgotten" or "wasted memory". If the scan is
too small and not used, it is deferred to the next shrinker
invocation. This batching behaviour is intentionally done for scan
efficiency purposes. Don't take my word for it, read the discussion
that went along with commit 0b1fb40a3b12 ("mm: vmscan: shrink all
slab objects if tight on memory")

https://lore.kernel.org/lkml/20140115012541.ad302526.akpm@linux-foundation.org/

From Andrew:

akpm> Actually, the intent of batching is to limit the number of calls to
akpm> ->scan().  At least, that was the intent when I wrote it!  This is a
akpm> good principle and we should keep doing it.  If we're going to send the
akpm> CPU away to tread on a pile of cold cachelines, we should make sure
akpm> that it does a good amount of work while it's there.

IOWs, the "small scan" proposals defeat existing shrinker efficiency
optimisations. This change in behaviour is where the CPU usage
regressions in "small cache" scanning comes from.  As Andrew said:
scan batching is a good principle and we should keep doing it.

> Dying cgroups are just making this problem more
> obvious because of their size.

Dying cgroups see this as a problem only because they have extremely
poor life cycle management. Expediting dying memcg cache cleanup is
the way to fix this and that does not need us to change global memory
reclaim behaviour.

> So, using "natural" memory pressure in a way, that all slabs objects are scanned
> periodically, seems to me as the best solution. The devil is in details, and how
> to do it without causing any regressions, is an open question now.
> 
> Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
> is a potential option to consider.

That should be done once the memcg gc thread has shrunk the caches
down to just the shared objects (which we want to keep in cache!)
that reference the dying memcg. That will get rid of all the
remaining references and allow the memcg to be reclaimed completely.

> It will be nice to discuss the problem on LSF/MM, agree on general path and
> make a potential list of benchmarks, which can be used to prove the solution.

In reality, it comes down to this - should we:

	a) add a small amount of code into the subsystem to perform
	expedited reaping of subsystem owned objects and test against
	the known, specific reproducing workload; or

	b) change global memory reclaim algorithms in a way that
	affects every linux machine and workload in some way,
	resulting in us having to revalidate and rebalance memory
	reclaim for a large number of common workloads across all
	filesystems and subsystems that use shrinkers, on a wide
	range of different storage hardware and on both headless and
	desktop machines.

And when we look at it this way, if we end up with option b) as the
preferred solution then we've well and truly jumped the shark.  The
validation effort required for option b) is way out of proportion
with the small niche of machines and environments affected by the
dying memcg problem and the risk of regressions for users outside
these memcg-heavy environments is extremely high (as has already
been proven).

Cheers,

Dave
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-20  2:47   ` Dave Chinner
@ 2019-02-20  5:50     ` Dave Chinner
  -1 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-20  5:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Wed, Feb 20, 2019 at 01:47:23PM +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote:
> > Sorry, once more, now with fsdevel@ in cc, asked by Dave.
> > --
> > 
> > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
> > with accumulating of dying memory cgroups. This is a serious problem:
> > on most of our machines we've seen thousands on dying cgroups, and
> > the corresponding memory footprint was measured in hundreds of megabytes.
> > The problem was also independently discovered by other companies.
> > 
> > The fixes were reverted due to xfs regression investigated by Dave Chinner.
> 
> Context: it wasn't one regression that I investigated. We had
> multiple bug reports with different regressions, and I saw evidence
> on my own machines that something wasn't right because of the change
> in the IO patterns in certain benchmarks. Some of the problems were
> caused by the first patch, some were caused by the second patch.
> 
> This also affects ext4 (i.e. it's a general problem, not an XFS
> problem) as has been reported a couple of times, including this one
> overnight:
> 
> https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@stwm.de/
> 
> > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
> > which caused Rik van Riel to propose a patch [3], which aimed to fix the
> > regression. The idea is to accumulate small memory pressure and apply it
> > periodically, so that we don't overscan small shrinker lists. According
> > to Jan Kara's data [4], Rik's patch partially fixed the regression,
> > but not entirely.
> 
> Rik's patch was buggy and made an invalid assumptions about how a
> cache with a small number of freeable objects is a "small cache", so
> any comaprisons made with it are essentially worthless.
> 
> More details about the problems with the patch and approach here:
> 
> https://lore.kernel.org/stable/20190131224905.GN31397@rh/

So, long story short, the dying memcg problem is actually a
regression caused by previous shrinker changes, and the change in
4.18-rc1 was an attempt to fix the regression (which caused evenmore
widespread problems) and Rik's patch is another different attempt to
fix the original regression.


The original regression broke the small scan accumulation algorithm
in the shrinker, but i don't think that anyone actually understood
how this was supposed to work and so the attempts to fix the
regression haven't actually restored the original behaviour. The
problematic commit:

9092c71bb724 ("mm: use sc->priority for slab shrink targets")

which was included in 4.16-rc1.

This changed the delta calculation and so any cache with less than
4096 freeable objects would now end up with a zero delta count.
This means caches with few freeable objects had no scan pressure at
all and nothing would get accumulated for later scanning. Prior to
this change, such scans would result in single digit scan counts,
which would get deferred and acummulated until the overal delta +
deferred count went over the batch size and ti would scan the cache.

IOWs, the above commit prevented accumulation of light pressure on
caches and they'd only get scanned when extreme memory pressure
occurs.

The fix that went into 4.18-rc1 change this to make the minimum scan
pressure the batch size, so instead of 0 pressure, it went to having
extreme pressure on small caches. What wasn't used in a scan got
deferred, and so the shrinker would wind up and keep heavy pressure
on the cache even when there was only light memory pressure. IOWs,
instead of having a scan count in the single digits under light
memory pressure, those caches now had continual scan counts 1-2
orders of magnitude larger. i.e. way more agressive than in 4.15 and
oler kernels. hence it introduced a different, more severe set of
regressions than the one it was trying to fix.

IOWs, the dying memcg issue is irrelevant here. The real problem
that needs fixing is a shrinker regression that occurred in 4.16-rc1,
not 4.18-rc1.

I'm just going to fix the original regression in the shrinker
algorithm by restoring the gradual accumulation behaviour, and this
whole series of problems can be put to bed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-20  5:50     ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-20  5:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Wed, Feb 20, 2019 at 01:47:23PM +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 07:13:33AM +0000, Roman Gushchin wrote:
> > Sorry, once more, now with fsdevel@ in cc, asked by Dave.
> > --
> > 
> > Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
> > with accumulating of dying memory cgroups. This is a serious problem:
> > on most of our machines we've seen thousands on dying cgroups, and
> > the corresponding memory footprint was measured in hundreds of megabytes.
> > The problem was also independently discovered by other companies.
> > 
> > The fixes were reverted due to xfs regression investigated by Dave Chinner.
> 
> Context: it wasn't one regression that I investigated. We had
> multiple bug reports with different regressions, and I saw evidence
> on my own machines that something wasn't right because of the change
> in the IO patterns in certain benchmarks. Some of the problems were
> caused by the first patch, some were caused by the second patch.
> 
> This also affects ext4 (i.e. it's a general problem, not an XFS
> problem) as has been reported a couple of times, including this one
> overnight:
> 
> https://lore.kernel.org/lkml/4113759.4IQ3NfHFaI@stwm.de/
> 
> > Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
> > which caused Rik van Riel to propose a patch [3], which aimed to fix the
> > regression. The idea is to accumulate small memory pressure and apply it
> > periodically, so that we don't overscan small shrinker lists. According
> > to Jan Kara's data [4], Rik's patch partially fixed the regression,
> > but not entirely.
> 
> Rik's patch was buggy and made an invalid assumptions about how a
> cache with a small number of freeable objects is a "small cache", so
> any comaprisons made with it are essentially worthless.
> 
> More details about the problems with the patch and approach here:
> 
> https://lore.kernel.org/stable/20190131224905.GN31397@rh/

So, long story short, the dying memcg problem is actually a
regression caused by previous shrinker changes, and the change in
4.18-rc1 was an attempt to fix the regression (which caused evenmore
widespread problems) and Rik's patch is another different attempt to
fix the original regression.


The original regression broke the small scan accumulation algorithm
in the shrinker, but i don't think that anyone actually understood
how this was supposed to work and so the attempts to fix the
regression haven't actually restored the original behaviour. The
problematic commit:

9092c71bb724 ("mm: use sc->priority for slab shrink targets")

which was included in 4.16-rc1.

This changed the delta calculation and so any cache with less than
4096 freeable objects would now end up with a zero delta count.
This means caches with few freeable objects had no scan pressure at
all and nothing would get accumulated for later scanning. Prior to
this change, such scans would result in single digit scan counts,
which would get deferred and acummulated until the overal delta +
deferred count went over the batch size and ti would scan the cache.

IOWs, the above commit prevented accumulation of light pressure on
caches and they'd only get scanned when extreme memory pressure
occurs.

The fix that went into 4.18-rc1 change this to make the minimum scan
pressure the batch size, so instead of 0 pressure, it went to having
extreme pressure on small caches. What wasn't used in a scan got
deferred, and so the shrinker would wind up and keep heavy pressure
on the cache even when there was only light memory pressure. IOWs,
instead of having a scan count in the single digits under light
memory pressure, those caches now had continual scan counts 1-2
orders of magnitude larger. i.e. way more agressive than in 4.15 and
oler kernels. hence it introduced a different, more severe set of
regressions than the one it was trying to fix.

IOWs, the dying memcg issue is irrelevant here. The real problem
that needs fixing is a shrinker regression that occurred in 4.16-rc1,
not 4.18-rc1.

I'm just going to fix the original regression in the shrinker
algorithm by restoring the gradual accumulation behaviour, and this
whole series of problems can be put to bed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-20  5:50     ` Dave Chinner
@ 2019-02-20  7:27       ` Dave Chinner
  -1 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-20  7:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> I'm just going to fix the original regression in the shrinker
> algorithm by restoring the gradual accumulation behaviour, and this
> whole series of problems can be put to bed.

Something like this lightly smoke tested patch below. It may be
slightly more agressive than the original code for really small
freeable values (i.e. < 100) but otherwise should be roughly
equivalent to historic accumulation behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

mm: fix shrinker scan accumulation regression

From: Dave Chinner <dchinner@redhat.com>

Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
in 4.16-rc1 broke the shrinker scan accumulation algorithm for small
freeable caches. This was active when there isn't enough work to run
a full batch scan -  the shrinker is supposed to defer that work
until a future shrinker call. That then is fed back into the work to
do on the next call, and if the work is larger than a batch it will
run the scan. This is an efficiency mechanism that prevents repeated
small scans of caches from consuming too much CPU.

It also has the effect of ensure that caches with small numbers of
freeable objects are slowly scanned. While an individual shrinker
scan may not result in work to do, if the cache is queried enough
times then the work will accumulate and the cache will be scanned
and freed. This protects small and otherwise in use caches from
excessive scanning under light memory pressure, but keeps cross
caceh reclaim amounts fairly balalnced over time.

The change in the above commit broke all this with the way it
calculates the delta value. Instead of it being calculated to keep
the freeable:scan shrinker count in the same ratio as the previous
page cache freeable:scanned pass, it calculates the delta from the
relcaim priority based on a logarithmic scale and applies this to
the freeable count before anything else is done.

This means that the resolution of the delta calculation is (1 <<
priority) and so for low pritority reclaim the cacluated delta does
not go above zero unless there are at least 4096 freeable objects.
This completely defeats the accumulation of work for caches with few
freeable objects.

Old code (ignoring seeks scaling)

	delta ~= (pages_scanned * freeable) / pages_freeable

	Accumulation resoution: pages_scanned / pages_freeable

4.16 code:

	delta ~= freeable >> priority

	Accumulation resolution: (1 << priority)

IOWs, the old code would almost always result in delta being
non-zero when freeable was non zero, and hence it would always
accumulate scan even on the smallest of freeable caches regardless
of the reclaim pressure being applied. The new code won't accumulate
or scan the smallest of freeable caches until it reaches  priority
1. This is extreme memory pressure, just before th OOM killer is to
be kicked.

We want to retain the priority mechanism to scale the work the
shrinker does, but we also want to ensure it accumulates
appropriately, too. In this case, offset the delta by
ilog2(freeable) so that there is a slow accumulation of work. Use
this regardless of the delta calculated so that we don't decrease
the amount of work as the priority increases past the point where
delta is non-zero.

New code:

	delta ~= ilog2(freeable) + (freeable >> priority)

	Accumulation resolution: ilog2(freeable)

Typical delta calculations from different code (ignoring seek
scaling), keeping in mind that batch size is 128 by default and 1024
for superblock shrinkers.

freeable = 1

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		1
1.32	  1	    9		0	batch		1
1:12	  1	    6		0	batch		1
1:6	  1	    3		0	batch		1
1:1	  1	    1		1	batch		1

freeable = 10

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		3
1.32	  1	    9		0	batch		3
1:12	  1	    6		0	batch		3
1:6	  2	    3		0	batch		3
1:1	 10	    1		10	batch		10

freeable = 100

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		6
1.32	  3	    9		0	batch		6
1:12	  6	    6		1	batch		7
1:6	 16	    3		12	batch		18
1:1	100	    1		100	batch		100

freeable = 1000

ratio	4.15	priority	4.16	4.18		new
1:100	 10	   12		0	batch		9
1.32	 32	    9		1	batch		10
1:12	 60	    6		16	batch		26
1:6	160	    3		120	batch		130
1:1	1000	    1		1000	max(1000,batch)	1000

freeable = 10000

ratio	4.15	priority	4.16	4.18		new
1:100	 100	   12		2	batch		16
1.32	 320	    9		19	batch		35
1:12	 600	    6		160	max(160,batch)	175
1:6	1600	    3		1250	1250		1265
1:1	10000	    1		10000	10000		10000

It's pretty clear why the 4.18 algorithm caused such a problem - it
massively changed the balance of reclaim when all that was actually
required was a small tweak to always accumulating a small delta for
caches with very small freeable counts.

Fixes: 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e979705bbf32..9cc58e9f1f54 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -479,7 +479,16 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 
 	total_scan = nr;
 	if (shrinker->seeks) {
-		delta = freeable >> priority;
+		/*
+		 * Use a small non-zero offset for delta so that if the scan
+		 * priority is low we always accumulate some pressure on caches
+		 * that have few freeable objects in them. This allows light
+		 * memory pressure to turn over caches with few freeable objects
+		 * slowly without the need for memory pressure priority to wind
+		 * up to the point where (freeable >> priority) is non-zero.
+		 */
+		delta = ilog2(freeable);
+		delta += freeable >> priority;
 		delta *= 4;
 		do_div(delta, shrinker->seeks);
 	} else {

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-20  7:27       ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-20  7:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> I'm just going to fix the original regression in the shrinker
> algorithm by restoring the gradual accumulation behaviour, and this
> whole series of problems can be put to bed.

Something like this lightly smoke tested patch below. It may be
slightly more agressive than the original code for really small
freeable values (i.e. < 100) but otherwise should be roughly
equivalent to historic accumulation behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

mm: fix shrinker scan accumulation regression

From: Dave Chinner <dchinner@redhat.com>

Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
in 4.16-rc1 broke the shrinker scan accumulation algorithm for small
freeable caches. This was active when there isn't enough work to run
a full batch scan -  the shrinker is supposed to defer that work
until a future shrinker call. That then is fed back into the work to
do on the next call, and if the work is larger than a batch it will
run the scan. This is an efficiency mechanism that prevents repeated
small scans of caches from consuming too much CPU.

It also has the effect of ensure that caches with small numbers of
freeable objects are slowly scanned. While an individual shrinker
scan may not result in work to do, if the cache is queried enough
times then the work will accumulate and the cache will be scanned
and freed. This protects small and otherwise in use caches from
excessive scanning under light memory pressure, but keeps cross
caceh reclaim amounts fairly balalnced over time.

The change in the above commit broke all this with the way it
calculates the delta value. Instead of it being calculated to keep
the freeable:scan shrinker count in the same ratio as the previous
page cache freeable:scanned pass, it calculates the delta from the
relcaim priority based on a logarithmic scale and applies this to
the freeable count before anything else is done.

This means that the resolution of the delta calculation is (1 <<
priority) and so for low pritority reclaim the cacluated delta does
not go above zero unless there are at least 4096 freeable objects.
This completely defeats the accumulation of work for caches with few
freeable objects.

Old code (ignoring seeks scaling)

	delta ~= (pages_scanned * freeable) / pages_freeable

	Accumulation resoution: pages_scanned / pages_freeable

4.16 code:

	delta ~= freeable >> priority

	Accumulation resolution: (1 << priority)

IOWs, the old code would almost always result in delta being
non-zero when freeable was non zero, and hence it would always
accumulate scan even on the smallest of freeable caches regardless
of the reclaim pressure being applied. The new code won't accumulate
or scan the smallest of freeable caches until it reaches  priority
1. This is extreme memory pressure, just before th OOM killer is to
be kicked.

We want to retain the priority mechanism to scale the work the
shrinker does, but we also want to ensure it accumulates
appropriately, too. In this case, offset the delta by
ilog2(freeable) so that there is a slow accumulation of work. Use
this regardless of the delta calculated so that we don't decrease
the amount of work as the priority increases past the point where
delta is non-zero.

New code:

	delta ~= ilog2(freeable) + (freeable >> priority)

	Accumulation resolution: ilog2(freeable)

Typical delta calculations from different code (ignoring seek
scaling), keeping in mind that batch size is 128 by default and 1024
for superblock shrinkers.

freeable = 1

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		1
1.32	  1	    9		0	batch		1
1:12	  1	    6		0	batch		1
1:6	  1	    3		0	batch		1
1:1	  1	    1		1	batch		1

freeable = 10

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		3
1.32	  1	    9		0	batch		3
1:12	  1	    6		0	batch		3
1:6	  2	    3		0	batch		3
1:1	 10	    1		10	batch		10

freeable = 100

ratio	4.15	priority	4.16	4.18		new
1:100	  1	   12		0	batch		6
1.32	  3	    9		0	batch		6
1:12	  6	    6		1	batch		7
1:6	 16	    3		12	batch		18
1:1	100	    1		100	batch		100

freeable = 1000

ratio	4.15	priority	4.16	4.18		new
1:100	 10	   12		0	batch		9
1.32	 32	    9		1	batch		10
1:12	 60	    6		16	batch		26
1:6	160	    3		120	batch		130
1:1	1000	    1		1000	max(1000,batch)	1000

freeable = 10000

ratio	4.15	priority	4.16	4.18		new
1:100	 100	   12		2	batch		16
1.32	 320	    9		19	batch		35
1:12	 600	    6		160	max(160,batch)	175
1:6	1600	    3		1250	1250		1265
1:1	10000	    1		10000	10000		10000

It's pretty clear why the 4.18 algorithm caused such a problem - it
massively changed the balance of reclaim when all that was actually
required was a small tweak to always accumulating a small delta for
caches with very small freeable counts.

Fixes: 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e979705bbf32..9cc58e9f1f54 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -479,7 +479,16 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 
 	total_scan = nr;
 	if (shrinker->seeks) {
-		delta = freeable >> priority;
+		/*
+		 * Use a small non-zero offset for delta so that if the scan
+		 * priority is low we always accumulate some pressure on caches
+		 * that have few freeable objects in them. This allows light
+		 * memory pressure to turn over caches with few freeable objects
+		 * slowly without the need for memory pressure priority to wind
+		 * up to the point where (freeable >> priority) is non-zero.
+		 */
+		delta = ilog2(freeable);
+		delta += freeable >> priority;
 		delta *= 4;
 		do_div(delta, shrinker->seeks);
 	} else {


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-20  7:27       ` Dave Chinner
@ 2019-02-20 16:20         ` Johannes Weiner
  -1 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2019-02-20 16:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, lsf-pc, linux-fsdevel, linux-mm, riel, dchinner,
	guroan, Kernel Team

On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> freeable = 1
> 
> ratio	4.15	priority	4.16	4.18		new
> 1:100	  1	   12		0	batch		1
> 1.32	  1	    9		0	batch		1
> 1:12	  1	    6		0	batch		1
> 1:6	  1	    3		0	batch		1
> 1:1	  1	    1		1	batch		1

> @@ -479,7 +479,16 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  
>  	total_scan = nr;
>  	if (shrinker->seeks) {
> -		delta = freeable >> priority;
> +		/*
> +		 * Use a small non-zero offset for delta so that if the scan
> +		 * priority is low we always accumulate some pressure on caches
> +		 * that have few freeable objects in them. This allows light
> +		 * memory pressure to turn over caches with few freeable objects
> +		 * slowly without the need for memory pressure priority to wind
> +		 * up to the point where (freeable >> priority) is non-zero.
> +		 */
> +		delta = ilog2(freeable);

The idea makes sense to me, but log2 fails us when freeable is
1. fls() should work, though.

> +		delta += freeable >> priority;
>  		delta *= 4;
>  		do_div(delta, shrinker->seeks);
>  	} else {

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-20 16:20         ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2019-02-20 16:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, lsf-pc, linux-fsdevel, linux-mm, riel, dchinner,
	guroan, Kernel Team

On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> freeable = 1
> 
> ratio	4.15	priority	4.16	4.18		new
> 1:100	  1	   12		0	batch		1
> 1.32	  1	    9		0	batch		1
> 1:12	  1	    6		0	batch		1
> 1:6	  1	    3		0	batch		1
> 1:1	  1	    1		1	batch		1

> @@ -479,7 +479,16 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  
>  	total_scan = nr;
>  	if (shrinker->seeks) {
> -		delta = freeable >> priority;
> +		/*
> +		 * Use a small non-zero offset for delta so that if the scan
> +		 * priority is low we always accumulate some pressure on caches
> +		 * that have few freeable objects in them. This allows light
> +		 * memory pressure to turn over caches with few freeable objects
> +		 * slowly without the need for memory pressure priority to wind
> +		 * up to the point where (freeable >> priority) is non-zero.
> +		 */
> +		delta = ilog2(freeable);

The idea makes sense to me, but log2 fails us when freeable is
1. fls() should work, though.

> +		delta += freeable >> priority;
>  		delta *= 4;
>  		do_div(delta, shrinker->seeks);
>  	} else {


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-20  7:27       ` Dave Chinner
@ 2019-02-21 22:46         ` Roman Gushchin
  -1 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-21 22:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > I'm just going to fix the original regression in the shrinker
> > algorithm by restoring the gradual accumulation behaviour, and this
> > whole series of problems can be put to bed.
> 
> Something like this lightly smoke tested patch below. It may be
> slightly more agressive than the original code for really small
> freeable values (i.e. < 100) but otherwise should be roughly
> equivalent to historic accumulation behaviour.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> mm: fix shrinker scan accumulation regression
> 
> From: Dave Chinner <dchinner@redhat.com>

JFYI: I'm testing this patch in our environment for fixing
the memcg memory leak.

It will take a couple of days to get reliable results.

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-21 22:46         ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-21 22:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > I'm just going to fix the original regression in the shrinker
> > algorithm by restoring the gradual accumulation behaviour, and this
> > whole series of problems can be put to bed.
> 
> Something like this lightly smoke tested patch below. It may be
> slightly more agressive than the original code for really small
> freeable values (i.e. < 100) but otherwise should be roughly
> equivalent to historic accumulation behaviour.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> mm: fix shrinker scan accumulation regression
> 
> From: Dave Chinner <dchinner@redhat.com>

JFYI: I'm testing this patch in our environment for fixing
the memcg memory leak.

It will take a couple of days to get reliable results.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-21 22:46         ` Roman Gushchin
@ 2019-02-22  1:48           ` Rik van Riel
  -1 siblings, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2019-02-22  1:48 UTC (permalink / raw)
  To: Roman Gushchin, Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, dchinner, guroan, Kernel Team, hannes

[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]

On Thu, 2019-02-21 at 17:46 -0500, Roman Gushchin wrote:
> On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > I'm just going to fix the original regression in the shrinker
> > > algorithm by restoring the gradual accumulation behaviour, and
> > > this
> > > whole series of problems can be put to bed.
> > 
> > Something like this lightly smoke tested patch below. It may be
> > slightly more agressive than the original code for really small
> > freeable values (i.e. < 100) but otherwise should be roughly
> > equivalent to historic accumulation behaviour.
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > mm: fix shrinker scan accumulation regression
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> 
> JFYI: I'm testing this patch in our environment for fixing
> the memcg memory leak.
> 
> It will take a couple of days to get reliable results.

Just to clarify, is this test with fls instead of ilog2,
so the last item in a slab cache can get reclaimed as
well?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-22  1:48           ` Rik van Riel
  0 siblings, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2019-02-22  1:48 UTC (permalink / raw)
  To: Roman Gushchin, Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, dchinner, guroan, Kernel Team, hannes

[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]

On Thu, 2019-02-21 at 17:46 -0500, Roman Gushchin wrote:
> On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > I'm just going to fix the original regression in the shrinker
> > > algorithm by restoring the gradual accumulation behaviour, and
> > > this
> > > whole series of problems can be put to bed.
> > 
> > Something like this lightly smoke tested patch below. It may be
> > slightly more agressive than the original code for really small
> > freeable values (i.e. < 100) but otherwise should be roughly
> > equivalent to historic accumulation behaviour.
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > mm: fix shrinker scan accumulation regression
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> 
> JFYI: I'm testing this patch in our environment for fixing
> the memcg memory leak.
> 
> It will take a couple of days to get reliable results.

Just to clarify, is this test with fls instead of ilog2,
so the last item in a slab cache can get reclaimed as
well?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-22  1:48           ` Rik van Riel
@ 2019-02-22  1:57             ` Roman Gushchin
  -1 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-22  1:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, lsf-pc, linux-fsdevel, linux-mm, dchinner, guroan,
	Kernel Team, hannes

On Thu, Feb 21, 2019 at 08:48:27PM -0500, Rik van Riel wrote:
> On Thu, 2019-02-21 at 17:46 -0500, Roman Gushchin wrote:
> > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > > I'm just going to fix the original regression in the shrinker
> > > > algorithm by restoring the gradual accumulation behaviour, and
> > > > this
> > > > whole series of problems can be put to bed.
> > > 
> > > Something like this lightly smoke tested patch below. It may be
> > > slightly more agressive than the original code for really small
> > > freeable values (i.e. < 100) but otherwise should be roughly
> > > equivalent to historic accumulation behaviour.
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > > mm: fix shrinker scan accumulation regression
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > 
> > JFYI: I'm testing this patch in our environment for fixing
> > the memcg memory leak.
> > 
> > It will take a couple of days to get reliable results.
> 
> Just to clarify, is this test with fls instead of ilog2,
> so the last item in a slab cache can get reclaimed as
> well?

I'm testing both version.

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-22  1:57             ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-22  1:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, lsf-pc, linux-fsdevel, linux-mm, dchinner, guroan,
	Kernel Team, hannes

On Thu, Feb 21, 2019 at 08:48:27PM -0500, Rik van Riel wrote:
> On Thu, 2019-02-21 at 17:46 -0500, Roman Gushchin wrote:
> > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > > I'm just going to fix the original regression in the shrinker
> > > > algorithm by restoring the gradual accumulation behaviour, and
> > > > this
> > > > whole series of problems can be put to bed.
> > > 
> > > Something like this lightly smoke tested patch below. It may be
> > > slightly more agressive than the original code for really small
> > > freeable values (i.e. < 100) but otherwise should be roughly
> > > equivalent to historic accumulation behaviour.
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > > mm: fix shrinker scan accumulation regression
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > 
> > JFYI: I'm testing this patch in our environment for fixing
> > the memcg memory leak.
> > 
> > It will take a couple of days to get reliable results.
> 
> Just to clarify, is this test with fls instead of ilog2,
> so the last item in a slab cache can get reclaimed as
> well?

I'm testing both version.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-21 22:46         ` Roman Gushchin
@ 2019-02-28 20:30           ` Roman Gushchin
  -1 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-28 20:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote:
> On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > I'm just going to fix the original regression in the shrinker
> > > algorithm by restoring the gradual accumulation behaviour, and this
> > > whole series of problems can be put to bed.
> > 
> > Something like this lightly smoke tested patch below. It may be
> > slightly more agressive than the original code for really small
> > freeable values (i.e. < 100) but otherwise should be roughly
> > equivalent to historic accumulation behaviour.
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > mm: fix shrinker scan accumulation regression
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> 
> JFYI: I'm testing this patch in our environment for fixing
> the memcg memory leak.
> 
> It will take a couple of days to get reliable results.
> 

So unfortunately the proposed patch is not solving the dying memcg reclaim
issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes,
and also with more a aggressive zero-seek slabs reclaim (always scanning
at least SHRINK_BATCH for zero-seeks shrinkers). In all cases the number
of outstanding memory cgroups grew almost linearly with time and didn't show
any signs of plateauing.

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-28 20:30           ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-28 20:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote:
> On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > I'm just going to fix the original regression in the shrinker
> > > algorithm by restoring the gradual accumulation behaviour, and this
> > > whole series of problems can be put to bed.
> > 
> > Something like this lightly smoke tested patch below. It may be
> > slightly more agressive than the original code for really small
> > freeable values (i.e. < 100) but otherwise should be roughly
> > equivalent to historic accumulation behaviour.
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > mm: fix shrinker scan accumulation regression
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> 
> JFYI: I'm testing this patch in our environment for fixing
> the memcg memory leak.
> 
> It will take a couple of days to get reliable results.
> 

So unfortunately the proposed patch is not solving the dying memcg reclaim
issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes,
and also with more a aggressive zero-seek slabs reclaim (always scanning
at least SHRINK_BATCH for zero-seeks shrinkers). In all cases the number
of outstanding memory cgroups grew almost linearly with time and didn't show
any signs of plateauing.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-28 20:30           ` Roman Gushchin
@ 2019-02-28 21:30             ` Dave Chinner
  -1 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-28 21:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Thu, Feb 28, 2019 at 08:30:49PM +0000, Roman Gushchin wrote:
> On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote:
> > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > > I'm just going to fix the original regression in the shrinker
> > > > algorithm by restoring the gradual accumulation behaviour, and this
> > > > whole series of problems can be put to bed.
> > > 
> > > Something like this lightly smoke tested patch below. It may be
> > > slightly more agressive than the original code for really small
> > > freeable values (i.e. < 100) but otherwise should be roughly
> > > equivalent to historic accumulation behaviour.
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > > mm: fix shrinker scan accumulation regression
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > 
> > JFYI: I'm testing this patch in our environment for fixing
> > the memcg memory leak.
> > 
> > It will take a couple of days to get reliable results.
> > 
> 
> So unfortunately the proposed patch is not solving the dying memcg reclaim
> issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes,
> and also with more a aggressive zero-seek slabs reclaim (always scanning
> at least SHRINK_BATCH for zero-seeks shrinkers).

Which makes sense if it's inodes and/or dentries shared across
multiple memcgs and actively referenced by non-owner memcgs that
prevent dying memcg reclaim. i.e. the shrinkers will not reclaim
frequently referenced objects unless there is extreme memory
pressure put on them.

> In all cases the number
> of outstanding memory cgroups grew almost linearly with time and didn't show
> any signs of plateauing.

What happend to the amount of memory pinned by those dying memcgs?
Did that change in any way? Did the rate of reclaim of objects
referencing dying memcgs improve? What type of objects are still
pinning those dying memcgs? did you run any traces to see how big
those pinned caches were and how much deferal and scanning work was
actually being done on them?

i.e. if all you measured is the number of memcgs over time, then we
don't have any information that tells us whether this patch has had
any effect on the reclaimable memory footprint of those dying memcgs
or what is actually pinning them in memory.

IOWs, we need to know if this patch reduces the dying memcg
references down to just the objects that non-owner memcgs are
keeping active in cache and hence preventing the dying memcgs from
being freed. If this patch does that, then the shrinkers are doing
exactly what they should be doing, and the remaining problem to
solve is reparenting actively referenced objects pinning the dying
memcgs...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-28 21:30             ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-28 21:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Thu, Feb 28, 2019 at 08:30:49PM +0000, Roman Gushchin wrote:
> On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote:
> > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > > I'm just going to fix the original regression in the shrinker
> > > > algorithm by restoring the gradual accumulation behaviour, and this
> > > > whole series of problems can be put to bed.
> > > 
> > > Something like this lightly smoke tested patch below. It may be
> > > slightly more agressive than the original code for really small
> > > freeable values (i.e. < 100) but otherwise should be roughly
> > > equivalent to historic accumulation behaviour.
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > > mm: fix shrinker scan accumulation regression
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > 
> > JFYI: I'm testing this patch in our environment for fixing
> > the memcg memory leak.
> > 
> > It will take a couple of days to get reliable results.
> > 
> 
> So unfortunately the proposed patch is not solving the dying memcg reclaim
> issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes,
> and also with more a aggressive zero-seek slabs reclaim (always scanning
> at least SHRINK_BATCH for zero-seeks shrinkers).

Which makes sense if it's inodes and/or dentries shared across
multiple memcgs and actively referenced by non-owner memcgs that
prevent dying memcg reclaim. i.e. the shrinkers will not reclaim
frequently referenced objects unless there is extreme memory
pressure put on them.

> In all cases the number
> of outstanding memory cgroups grew almost linearly with time and didn't show
> any signs of plateauing.

What happend to the amount of memory pinned by those dying memcgs?
Did that change in any way? Did the rate of reclaim of objects
referencing dying memcgs improve? What type of objects are still
pinning those dying memcgs? did you run any traces to see how big
those pinned caches were and how much deferal and scanning work was
actually being done on them?

i.e. if all you measured is the number of memcgs over time, then we
don't have any information that tells us whether this patch has had
any effect on the reclaimable memory footprint of those dying memcgs
or what is actually pinning them in memory.

IOWs, we need to know if this patch reduces the dying memcg
references down to just the objects that non-owner memcgs are
keeping active in cache and hence preventing the dying memcgs from
being freed. If this patch does that, then the shrinkers are doing
exactly what they should be doing, and the remaining problem to
solve is reparenting actively referenced objects pinning the dying
memcgs...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-28 21:30             ` Dave Chinner
@ 2019-02-28 22:29               ` Roman Gushchin
  -1 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-28 22:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Fri, Mar 01, 2019 at 08:30:32AM +1100, Dave Chinner wrote:
> On Thu, Feb 28, 2019 at 08:30:49PM +0000, Roman Gushchin wrote:
> > On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote:
> > > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > > > I'm just going to fix the original regression in the shrinker
> > > > > algorithm by restoring the gradual accumulation behaviour, and this
> > > > > whole series of problems can be put to bed.
> > > > 
> > > > Something like this lightly smoke tested patch below. It may be
> > > > slightly more agressive than the original code for really small
> > > > freeable values (i.e. < 100) but otherwise should be roughly
> > > > equivalent to historic accumulation behaviour.
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > > 
> > > > mm: fix shrinker scan accumulation regression
> > > > 
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > JFYI: I'm testing this patch in our environment for fixing
> > > the memcg memory leak.
> > > 
> > > It will take a couple of days to get reliable results.
> > > 
> > 
> > So unfortunately the proposed patch is not solving the dying memcg reclaim
> > issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes,
> > and also with more a aggressive zero-seek slabs reclaim (always scanning
> > at least SHRINK_BATCH for zero-seeks shrinkers).
> 
> Which makes sense if it's inodes and/or dentries shared across
> multiple memcgs and actively referenced by non-owner memcgs that
> prevent dying memcg reclaim. i.e. the shrinkers will not reclaim
> frequently referenced objects unless there is extreme memory
> pressure put on them.
> 
> > In all cases the number
> > of outstanding memory cgroups grew almost linearly with time and didn't show
> > any signs of plateauing.
> 
> What happend to the amount of memory pinned by those dying memcgs?
> Did that change in any way? Did the rate of reclaim of objects
> referencing dying memcgs improve? What type of objects are still
> pinning those dying memcgs? did you run any traces to see how big
> those pinned caches were and how much deferal and scanning work was
> actually being done on them?

The amount of pinned memory is approximately proportional to the number
of dying cgroups, in other words it also grows almost linearly.
The rate of reclaim is better than without any patches, and it's
approximately on pair with a version with Rik's patches.

> 
> i.e. if all you measured is the number of memcgs over time, then we
> don't have any information that tells us whether this patch has had
> any effect on the reclaimable memory footprint of those dying memcgs
> or what is actually pinning them in memory.

I'm not saying that the patch is bad, I'm saying it's not sufficient
in our environment.

> 
> IOWs, we need to know if this patch reduces the dying memcg
> references down to just the objects that non-owner memcgs are
> keeping active in cache and hence preventing the dying memcgs from
> being freed. If this patch does that, then the shrinkers are doing
> exactly what they should be doing, and the remaining problem to
> solve is reparenting actively referenced objects pinning the dying
> memcgs...

Yes, I agree. I'll take a look.

Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-28 22:29               ` Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-28 22:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, riel, dchinner, guroan,
	Kernel Team, hannes

On Fri, Mar 01, 2019 at 08:30:32AM +1100, Dave Chinner wrote:
> On Thu, Feb 28, 2019 at 08:30:49PM +0000, Roman Gushchin wrote:
> > On Thu, Feb 21, 2019 at 02:46:17PM -0800, Roman Gushchin wrote:
> > > On Wed, Feb 20, 2019 at 06:27:07PM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 20, 2019 at 04:50:31PM +1100, Dave Chinner wrote:
> > > > > I'm just going to fix the original regression in the shrinker
> > > > > algorithm by restoring the gradual accumulation behaviour, and this
> > > > > whole series of problems can be put to bed.
> > > > 
> > > > Something like this lightly smoke tested patch below. It may be
> > > > slightly more agressive than the original code for really small
> > > > freeable values (i.e. < 100) but otherwise should be roughly
> > > > equivalent to historic accumulation behaviour.
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > > 
> > > > mm: fix shrinker scan accumulation regression
> > > > 
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > JFYI: I'm testing this patch in our environment for fixing
> > > the memcg memory leak.
> > > 
> > > It will take a couple of days to get reliable results.
> > > 
> > 
> > So unfortunately the proposed patch is not solving the dying memcg reclaim
> > issue. I've tested it as is, with s/ilog2()/fls(), suggested by Johannes,
> > and also with more a aggressive zero-seek slabs reclaim (always scanning
> > at least SHRINK_BATCH for zero-seeks shrinkers).
> 
> Which makes sense if it's inodes and/or dentries shared across
> multiple memcgs and actively referenced by non-owner memcgs that
> prevent dying memcg reclaim. i.e. the shrinkers will not reclaim
> frequently referenced objects unless there is extreme memory
> pressure put on them.
> 
> > In all cases the number
> > of outstanding memory cgroups grew almost linearly with time and didn't show
> > any signs of plateauing.
> 
> What happend to the amount of memory pinned by those dying memcgs?
> Did that change in any way? Did the rate of reclaim of objects
> referencing dying memcgs improve? What type of objects are still
> pinning those dying memcgs? did you run any traces to see how big
> those pinned caches were and how much deferal and scanning work was
> actually being done on them?

The amount of pinned memory is approximately proportional to the number
of dying cgroups, in other words it also grows almost linearly.
The rate of reclaim is better than without any patches, and it's
approximately on pair with a version with Rik's patches.

> 
> i.e. if all you measured is the number of memcgs over time, then we
> don't have any information that tells us whether this patch has had
> any effect on the reclaimable memory footprint of those dying memcgs
> or what is actually pinning them in memory.

I'm not saying that the patch is bad, I'm saying it's not sufficient
in our environment.

> 
> IOWs, we need to know if this patch reduces the dying memcg
> references down to just the objects that non-owner memcgs are
> keeping active in cache and hence preventing the dying memcgs from
> being freed. If this patch does that, then the shrinkers are doing
> exactly what they should be doing, and the remaining problem to
> solve is reparenting actively referenced objects pinning the dying
> memcgs...

Yes, I agree. I'll take a look.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-20  4:33         ` Dave Chinner
  2019-02-20  5:31           ` Roman Gushchin
@ 2019-02-20 17:00           ` Rik van Riel
  1 sibling, 0 replies; 33+ messages in thread
From: Rik van Riel @ 2019-02-20 17:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, lsf-pc, linux-mm, mhocko, guroan, Kernel Team, hannes

[-- Attachment #1: Type: text/plain, Size: 1604 bytes --]

On Wed, 2019-02-20 at 15:33 +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 09:06:07PM -0500, Rik van Riel wrote:
> > 
> > You are overlooking the fact that an inode loaded
> > into memory by one cgroup (which is getting torn
> > down) may be in active use by processes in other
> > cgroups.
> 
> No I am not. I am fully aware of this problem (have been since memcg
> day one because of the list_lru tracking issues Glauba and I had to
> sort out when we first realised shared inodes could occur). Sharing
> inodes across cgroups also causes "complexity" in things like cgroup
> writeback control (which cgroup dirty list tracks and does writeback
> of shared inodes?) and so on. Shared inodes across cgroups are
> considered the exception rather than the rule, and they are treated
> in many places with algorithms that assert "this is rare, if it's
> common we're going to be in trouble"....

It is extremely common to have files used from
multiple cgroups. For example:
- The main workload generates a log file, which
  is parsed from a (lower priority) system cgroup.
- A backup program reads files that are also accessed
  by the main workload.
- Systemd restarts a program that runs in a cgroup,
  into a new cgroup. This ends up touching many/most of 
  the same files that were in use in the old instance 
  of the program, running in the old cgroup.

With cgroup use being largely automated, instead of
set up manually, it is becoming more and more common
to see systems where dozens of cgroups are created
and torn down daily.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-20  4:33         ` Dave Chinner
@ 2019-02-20  5:31           ` Roman Gushchin
  2019-02-20 17:00           ` Rik van Riel
  1 sibling, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-20  5:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, lsf-pc, linux-mm, mhocko, guroan, Kernel Team, hannes

On Wed, Feb 20, 2019 at 03:33:32PM +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 09:06:07PM -0500, Rik van Riel wrote:
> > On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote:
> > > On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> > > > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > > > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > > > > first letter.
> > > > > 
> > > > > Please resend again with linux-fsdevel on the cc list, because
> > > > > this
> > > > > isn't a MM topic given the regressions from the shrinker patches
> > > > > have all been on the filesystem side of the shrinkers....
> > > > 
> > > > It looks like there are two separate things going on here.
> > > > 
> > > > The first are an MM issues, one of potentially leaking memory
> > > > by not scanning slabs with few items on them,
> > > 
> > > We don't leak memory. Slabs with very few freeable items on them
> > > just don't get scanned when there is only light memory pressure.
> > > That's /by design/ and it is behaviour we've tried hard over many
> > > years to preserve. Once memory pressure ramps up, they'll be
> > > scanned just like all the other slabs.
> > 
> > That may have been fine before cgroups, but when
> > a system can have (tens of) thousands of slab
> > caches, we DO want to scan slab caches with few
> > freeable items in them.
> > 
> > The threshold for "few items" is 4096, not some
> > actually tiny number. That can add up to a lot
> > of memory if a system has hundreds of cgroups.
> 
> That doesn't sound right. The threshold is supposed to be low single
> digits based on the amount of pressure on the page cache, and it's
> accumulated by deferral until the batch threshold (128) is exceeded.
> 
> Ohhhhh. The penny just dropped - this whole sorry saga has be
> triggered because people are chasing a regression nobody has
> recognised as a regression because they don't actually understand
> how the shrinker algorithms are /supposed/ to work.
> 
> And I'm betting that it's been caused by some other recent FB
> shrinker change.....
> 
> Yup, there it is:
> 
> commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2
> Author: Josef Bacik <jbacik@fb.com>
> Date:   Wed Jan 31 16:16:26 2018 -0800
> 
>     mm: use sc->priority for slab shrink targets
> 
> ....
>     We don't need to know exactly how many pages each shrinker represents,
>     it's objects are all the information we need.  Making this change allows
>     us to place an appropriate amount of pressure on the shrinker pools for
>     their relative size.
> ....
> 
> -       delta = (4 * nr_scanned) / shrinker->seeks;
> -       delta *= freeable;
> -       do_div(delta, nr_eligible + 1);
> +       delta = freeable >> priority;
> +       delta *= 4;
> +       do_div(delta, shrinker->seeks);
> 
> 
> So, prior to this change:
> 
> 	delta ~= (4 * nr_scanned * freeable) / nr_eligible
> 
> IOWs, the ratio of nr_scanned:nr_eligible determined the resolution
> of scan, and that meant delta could (and did!) have values in the
> single digit range.
> 
> The current code introduced by the above patch does:
> 
> 	delta ~= (freeable >> priority) * 4
> 
> Which, as you state, has a threshold of freeable > 4096 to trigger
> scanning under low memory pressure.
> 
> So, that's the original regression that people are trying to fix
> (root cause analysis FTW).  It was introduced in 4.16-rc1. The
> attempts to fix this regression (i.e. the lack of low free object
> shrinker scanning) were introduced into 4.18-rc1, which caused even
> worse regressions and lead us directly to this point.
> 
> Ok, now I see where the real problem people are chasing is, I'll go
> write a patch to fix it.

Sounds good, I'll check if it can prevent the memcg leak.
If it will work, we're fine.

> 
> > Roman's patch, which reclaimed small slabs extra
> > aggressively, introduced issues, but reclaiming
> > small slabs at the same pressure/object as large
> > slabs seems like the desired behavior.
> 
> It's still broken. Both of your patches do the wrong thing because
> they don't address the resolution and accumulation regression and
> instead add another layer of heuristics over the top of the delta
> calculation to hide the lack of resolution.
> 
> > > That's a cgroup referencing and teardown problem, not a memory
> > > reclaim algorithm problem. To treat it as a memory reclaim problem
> > > smears memcg internal implementation bogosities all over the
> > > independent reclaim infrastructure. It violates the concepts of
> > > isolation, modularity, independence, abstraction layering, etc.
> > 
> > You are overlooking the fact that an inode loaded
> > into memory by one cgroup (which is getting torn
> > down) may be in active use by processes in other
> > cgroups.
> 
> No I am not. I am fully aware of this problem (have been since memcg
> day one because of the list_lru tracking issues Glauba and I had to
> sort out when we first realised shared inodes could occur). Sharing
> inodes across cgroups also causes "complexity" in things like cgroup
> writeback control (which cgroup dirty list tracks and does writeback
> of shared inodes?) and so on. Shared inodes across cgroups are
> considered the exception rather than the rule, and they are treated
> in many places with algorithms that assert "this is rare, if it's
> common we're going to be in trouble"....

No, even if sharing inodes can be advertised as a bad practice and
may lead to some sub-optimal results, it shouldn't trigger obvious
kernel issues like memory leaks. Otherwise it becomes a security concern.

Also, in practice, it's common to have a main workload and a couple
of supplementary processes (e.g. monitoring) in sibling cgroups,
which are sharing some inodes (e.g. logs).

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-20  2:06       ` Rik van Riel
@ 2019-02-20  4:33         ` Dave Chinner
  2019-02-20  5:31           ` Roman Gushchin
  2019-02-20 17:00           ` Rik van Riel
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2019-02-20  4:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Roman Gushchin, lsf-pc, linux-mm, mhocko, guroan, Kernel Team, hannes

On Tue, Feb 19, 2019 at 09:06:07PM -0500, Rik van Riel wrote:
> On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote:
> > On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> > > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > > > first letter.
> > > > 
> > > > Please resend again with linux-fsdevel on the cc list, because
> > > > this
> > > > isn't a MM topic given the regressions from the shrinker patches
> > > > have all been on the filesystem side of the shrinkers....
> > > 
> > > It looks like there are two separate things going on here.
> > > 
> > > The first are an MM issues, one of potentially leaking memory
> > > by not scanning slabs with few items on them,
> > 
> > We don't leak memory. Slabs with very few freeable items on them
> > just don't get scanned when there is only light memory pressure.
> > That's /by design/ and it is behaviour we've tried hard over many
> > years to preserve. Once memory pressure ramps up, they'll be
> > scanned just like all the other slabs.
> 
> That may have been fine before cgroups, but when
> a system can have (tens of) thousands of slab
> caches, we DO want to scan slab caches with few
> freeable items in them.
> 
> The threshold for "few items" is 4096, not some
> actually tiny number. That can add up to a lot
> of memory if a system has hundreds of cgroups.

That doesn't sound right. The threshold is supposed to be low single
digits based on the amount of pressure on the page cache, and it's
accumulated by deferral until the batch threshold (128) is exceeded.

Ohhhhh. The penny just dropped - this whole sorry saga has be
triggered because people are chasing a regression nobody has
recognised as a regression because they don't actually understand
how the shrinker algorithms are /supposed/ to work.

And I'm betting that it's been caused by some other recent FB
shrinker change.....

Yup, there it is:

commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2
Author: Josef Bacik <jbacik@fb.com>
Date:   Wed Jan 31 16:16:26 2018 -0800

    mm: use sc->priority for slab shrink targets

....
    We don't need to know exactly how many pages each shrinker represents,
    it's objects are all the information we need.  Making this change allows
    us to place an appropriate amount of pressure on the shrinker pools for
    their relative size.
....

-       delta = (4 * nr_scanned) / shrinker->seeks;
-       delta *= freeable;
-       do_div(delta, nr_eligible + 1);
+       delta = freeable >> priority;
+       delta *= 4;
+       do_div(delta, shrinker->seeks);


So, prior to this change:

	delta ~= (4 * nr_scanned * freeable) / nr_eligible

IOWs, the ratio of nr_scanned:nr_eligible determined the resolution
of scan, and that meant delta could (and did!) have values in the
single digit range.

The current code introduced by the above patch does:

	delta ~= (freeable >> priority) * 4

Which, as you state, has a threshold of freeable > 4096 to trigger
scanning under low memory pressure.

So, that's the original regression that people are trying to fix
(root cause analysis FTW).  It was introduced in 4.16-rc1. The
attempts to fix this regression (i.e. the lack of low free object
shrinker scanning) were introduced into 4.18-rc1, which caused even
worse regressions and lead us directly to this point.

Ok, now I see where the real problem people are chasing is, I'll go
write a patch to fix it.

> Roman's patch, which reclaimed small slabs extra
> aggressively, introduced issues, but reclaiming
> small slabs at the same pressure/object as large
> slabs seems like the desired behavior.

It's still broken. Both of your patches do the wrong thing because
they don't address the resolution and accumulation regression and
instead add another layer of heuristics over the top of the delta
calculation to hide the lack of resolution.

> > That's a cgroup referencing and teardown problem, not a memory
> > reclaim algorithm problem. To treat it as a memory reclaim problem
> > smears memcg internal implementation bogosities all over the
> > independent reclaim infrastructure. It violates the concepts of
> > isolation, modularity, independence, abstraction layering, etc.
> 
> You are overlooking the fact that an inode loaded
> into memory by one cgroup (which is getting torn
> down) may be in active use by processes in other
> cgroups.

No I am not. I am fully aware of this problem (have been since memcg
day one because of the list_lru tracking issues Glauba and I had to
sort out when we first realised shared inodes could occur). Sharing
inodes across cgroups also causes "complexity" in things like cgroup
writeback control (which cgroup dirty list tracks and does writeback
of shared inodes?) and so on. Shared inodes across cgroups are
considered the exception rather than the rule, and they are treated
in many places with algorithms that assert "this is rare, if it's
common we're going to be in trouble"....

> > > The second is the filesystem (and maybe other) shrinker
> > > functions' behavior being somewhat fragile and depending
> > > on closely on current MM behavior, potentially up to
> > > and including MM bugs.
> > > 
> > > The lack of a contract between the MM and the shrinker
> > > callbacks is a recurring issue, and something we may
> > > want to discuss in a joint session.
> > > 
> > > Some reflections on the shrinker/MM interaction:
> > > - Since all memory (in a zone) could potentially be in
> > >   shrinker pools, shrinkers MUST eventually free some
> > >   memory.
> > 
> > Which they cannot guarantee because all the objects they track may
> > be in use. As such, shrinkers have never been asked to guarantee
> > that they can free memory - they've only ever been asked to scan a
> > number of objects and attempt to free those it can during the scan.
> 
> Shrinkers may not be able to free memory NOW, and that
> is ok, but shrinkers need to guarantee that they can
> free memory eventually.

If the memory the shrinker tracks is in use, they can't free
anything. Hence there is no guarantee a shrinker can free anything
from it's cache now or in the future. i.e. it can return freeable =
0 as much as it wants, and the memory reclaim infrastructure just
has to deal with the fact it can't free any memory.

This is where page reclaim would trigger the OOM killer, but that
still won't guarantee a shrinker can free anything.......

> > > - The MM should be able to deal with shrinkers doing
> > >   nothing at this call, but having some work pending 
> > >   (eg. waiting on IO completion), without getting a false
> > >   OOM kill. How can we do this best?
> > 
> > By integrating shrinkers into the same feedback loops as page
> > reclaim. i.e. to allow individual shrinker instance state to be
> > visible to the backoff/congestion decisions that the main page
> > reclaim loops make.
> > 
> > i.e. the problem here is that shrinkers only feedback to the main
> > loop is "how many pages were freed" as a whole. They aren't seen as
> > individual reclaim instances like zones for apge reclaim, they are
> > just a huge amorphous blob that "frees some pages". i.e. They sit off
> > to
> > the side and run their own game between main loop scans and have no
> > capability to run individual backoffs, schedule kswapd to do future
> > work, don't have watermarks to provide reclaim goals, can't
> > communicate progress to the main control algorithm, etc.
> > 
> > IOWs, the first step we need to take here is to get rid of
> > the shrink_slab() abstraction and make shrinkers a first class
> > reclaim citizen....
> 
> I completely agree with that. The main reclaim loop
> should be able to make decisions like "there is plenty
> of IO in flight already, I should wait for some to
> complete instead of starting more", which requires the
> kind of visibility you have outlined.
> 
> I guess we should find some whiteboard time at LSF/MM
> to work out the details, after we have a general discussion
> on this in one of the sessions.

I won't be at LSFMM. The location is absolutely awful in terms of
travel - ~6 days travel time for a 3 day conference is just not
worthwhile.

> Given the need for things like lockless data structures
> in some subsystems, I imagine we would want to do a lot
> of the work here with callbacks, rather than standardized
> data structures.

Just another ops structure.... :P

> > > - Related to the above: stalling in the shrinker code is
> > >   unpredictable, and can take an arbitrarily long amount
> > >   of time. Is there a better way we can make reclaimers
> > >   wait for in-flight work to be completed?
> > 
> > Look at it this way: what do you need to do to implement the main
> > zone reclaim loops as individual shrinker instances? Complex
> > shrinker implementations have to deal with all the same issues as
> > the page reclaim loops (including managing cross-cache dependencies
> > and balancing). If we can't answer this question, then we can't
> > answer the questions that are being asked.
> > 
> > So, at this point, I have to ask: if we need the same functionality
> > for both page reclaim and shrinkers, then why shouldn't the goal be
> > to make page reclaim just another set of opaque shrinker
> > implementations?
> 
> I suspect each LRU could be implemented as a shrinker
> today, with some combination of function pointers and
> data pointers (in case of LRUs, to the lruvec) as control
> data structures.
.....
> The logic of which cgroups we should reclaim memory from
> right now, and which we should skip for now, is already
> handled outside of the code that calls both the LRU and
> the slab shrinking code.
> 
> In short, I see no real obstacle to unifying the two.

Neither do I, except that it's a huge amount of work and there's no
guarantee we'll be able to make any better than what we have now....

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-19 23:26     ` Dave Chinner
@ 2019-02-20  2:06       ` Rik van Riel
  2019-02-20  4:33         ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Rik van Riel @ 2019-02-20  2:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, lsf-pc, linux-mm, mhocko, guroan, Kernel Team, hannes

[-- Attachment #1: Type: text/plain, Size: 9172 bytes --]

On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > > first letter.
> > > 
> > > Please resend again with linux-fsdevel on the cc list, because
> > > this
> > > isn't a MM topic given the regressions from the shrinker patches
> > > have all been on the filesystem side of the shrinkers....
> > 
> > It looks like there are two separate things going on here.
> > 
> > The first are an MM issues, one of potentially leaking memory
> > by not scanning slabs with few items on them,
> 
> We don't leak memory. Slabs with very few freeable items on them
> just don't get scanned when there is only light memory pressure.
> That's /by design/ and it is behaviour we've tried hard over many
> years to preserve. Once memory pressure ramps up, they'll be
> scanned just like all the other slabs.

That may have been fine before cgroups, but when
a system can have (tens of) thousands of slab
caches, we DO want to scan slab caches with few
freeable items in them.

The threshold for "few items" is 4096, not some
actually tiny number. That can add up to a lot
of memory if a system has hundreds of cgroups.

Roman's patch, which reclaimed small slabs extra
aggressively, introduced issues, but reclaiming
small slabs at the same pressure/object as large
slabs seems like the desired behavior.

Waiting until "memory pressure ramps up" is very
much the wrong thing to do, since reclaim priority
is not likely to drop to a small number until the
system is under so much memory pressure that the
workloads on the system suffer noticeable slowdowns.

> > and having
> > such slabs stay around forever after the cgroup they were
> > created for has disappeared,
> 
> That's a cgroup referencing and teardown problem, not a memory
> reclaim algorithm problem. To treat it as a memory reclaim problem
> smears memcg internal implementation bogosities all over the
> independent reclaim infrastructure. It violates the concepts of
> isolation, modularity, independence, abstraction layering, etc.

You are overlooking the fact that an inode loaded
into memory by one cgroup (which is getting torn
down) may be in active use by processes in other
cgroups.

That may prevent us from tearing down all of a
cgroup's slab cache memory at cgroup destruction
time, which turns it into a reclaim problem.

> This all comes back to the fact that modifying the shrinker
> algorithms requires understanding what the shrinker implementations
> do and the constraints they operate under. It is not a "purely mm"
> discussion, and treating it as such results regressions like the
> ones we've recently seen.

That's fair, maybe both topics need to be discussed
in a shared MM/FS session, or even a plenary session.

> > The second is the filesystem (and maybe other) shrinker
> > functions' behavior being somewhat fragile and depending
> > on closely on current MM behavior, potentially up to
> > and including MM bugs.
> > 
> > The lack of a contract between the MM and the shrinker
> > callbacks is a recurring issue, and something we may
> > want to discuss in a joint session.
> > 
> > Some reflections on the shrinker/MM interaction:
> > - Since all memory (in a zone) could potentially be in
> >   shrinker pools, shrinkers MUST eventually free some
> >   memory.
> 
> Which they cannot guarantee because all the objects they track may
> be in use. As such, shrinkers have never been asked to guarantee
> that they can free memory - they've only ever been asked to scan a
> number of objects and attempt to free those it can during the scan.

Shrinkers may not be able to free memory NOW, and that
is ok, but shrinkers need to guarantee that they can
free memory eventually.

Without that guarantee, it will be unsafe to ever place
a majority of system memory under the control of shrinker
functions, if only because the subsystems with those shrinker
functions tend to rely on the VM being able to free pages
when the pageout code is called.

> > - Shrinkers should not block kswapd from making progress.
> >   If kswapd got stuck in NFS inode writeback, and ended up
> >   not being able to free clean pages to receive network
> >   packets, that might cause a deadlock.
> 
> Same can happen if kswapd got stuck on dirty page writeback from
> pageout(). i.e. pageout() can only run from kswapd and it issues IO,
> which can then block in the IO submission path waiting for IO to
> make progress, which may require substantial amounts of memory
> allocation.
> 
> Yes, we can try to not block kswapd as much as possible just like
> page reclaim does, but the fact is kswapd is the only context where
> it is safe to do certain blocking operations to ensure memory
> reclaim can actually make progress.
> 
> i.e. the rules for blocking kswapd need to be consistent across both
> page reclaim and shrinker reclaim, and right now page reclaim can
> and does block kswapd when it is necessary for forwards progress....

Agreed, the rules should be the same for both.

It would be good to come to some sort of agreement,
or even a wish list, on what they should be.

> > - The MM should be able to deal with shrinkers doing
> >   nothing at this call, but having some work pending 
> >   (eg. waiting on IO completion), without getting a false
> >   OOM kill. How can we do this best?
> 
> By integrating shrinkers into the same feedback loops as page
> reclaim. i.e. to allow individual shrinker instance state to be
> visible to the backoff/congestion decisions that the main page
> reclaim loops make.
> 
> i.e. the problem here is that shrinkers only feedback to the main
> loop is "how many pages were freed" as a whole. They aren't seen as
> individual reclaim instances like zones for apge reclaim, they are
> just a huge amorphous blob that "frees some pages". i.e. They sit off
> to
> the side and run their own game between main loop scans and have no
> capability to run individual backoffs, schedule kswapd to do future
> work, don't have watermarks to provide reclaim goals, can't
> communicate progress to the main control algorithm, etc.
> 
> IOWs, the first step we need to take here is to get rid of
> the shrink_slab() abstraction and make shrinkers a first class
> reclaim citizen....

I completely agree with that. The main reclaim loop
should be able to make decisions like "there is plenty
of IO in flight already, I should wait for some to
complete instead of starting more", which requires the
kind of visibility you have outlined.

I guess we should find some whiteboard time at LSF/MM
to work out the details, after we have a general discussion
on this in one of the sessions.

Given the need for things like lockless data structures
in some subsystems, I imagine we would want to do a lot
of the work here with callbacks, rather than standardized
data structures.

> > - Related to the above: stalling in the shrinker code is
> >   unpredictable, and can take an arbitrarily long amount
> >   of time. Is there a better way we can make reclaimers
> >   wait for in-flight work to be completed?
> 
> Look at it this way: what do you need to do to implement the main
> zone reclaim loops as individual shrinker instances? Complex
> shrinker implementations have to deal with all the same issues as
> the page reclaim loops (including managing cross-cache dependencies
> and balancing). If we can't answer this question, then we can't
> answer the questions that are being asked.
> 
> So, at this point, I have to ask: if we need the same functionality
> for both page reclaim and shrinkers, then why shouldn't the goal be
> to make page reclaim just another set of opaque shrinker
> implementations?

I suspect each LRU could be implemented as a shrinker
today, with some combination of function pointers and
data pointers (in case of LRUs, to the lruvec) as control
data structures.

Each shrinker would need some callbacks for things like
"lots of work is in flight already, wait instead of starting
more".

The magic of zone balancing could easily be hidden inside
the shrinker function for lruvecs. If a pgdat is balanced,
the shrinkers for each lruvec inside that pgdat could return
that no work is needed, while if work in only one or two
memory zones is needed, the shrinkers for those lruvecs would
do work, while the shrinkers would return "no work needed"
for the other lruvecs in the same pgdat.

The scan_control and shrink_control structs would probably
need to be merged, which is no obstacle at all.

The logic of which cgroups we should reclaim memory from
right now, and which we should skip for now, is already
handled outside of the code that calls both the LRU and
the slab shrinking code.

In short, I see no real obstacle to unifying the two.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-19 17:31   ` Rik van Riel
  2019-02-19 17:38     ` Michal Hocko
@ 2019-02-19 23:26     ` Dave Chinner
  2019-02-20  2:06       ` Rik van Riel
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-02-19 23:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Roman Gushchin, lsf-pc, linux-mm, mhocko, guroan, Kernel Team, hannes

On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > first letter.
> > 
> > Please resend again with linux-fsdevel on the cc list, because this
> > isn't a MM topic given the regressions from the shrinker patches
> > have all been on the filesystem side of the shrinkers....
> 
> It looks like there are two separate things going on here.
> 
> The first are an MM issues, one of potentially leaking memory
> by not scanning slabs with few items on them,

We don't leak memory. Slabs with very few freeable items on them
just don't get scanned when there is only light memory pressure.
That's /by design/ and it is behaviour we've tried hard over many
years to preserve. Once memory pressure ramps up, they'll be
scanned just like all the other slabs.

e.g. commit 0b1fb40a3b12 ("mm: vmscan: shrink all slab objects if
tight on memory") makes this commentary:

    [....] That said, this
    patch shouldn't change the vmscan behaviour if the memory pressure is
    low, but if we are tight on memory, we will do our best by trying to
    reclaim all available objects, which sounds reasonable.

Which is essentially how we've tried to implement shrinker reclaim
for a long, long time (bugs notwithstanding).

> and having
> such slabs stay around forever after the cgroup they were
> created for has disappeared,

That's a cgroup referencing and teardown problem, not a memory
reclaim algorithm problem. To treat it as a memory reclaim problem
smears memcg internal implementation bogosities all over the
independent reclaim infrastructure. It violates the concepts of
isolation, modularity, independence, abstraction layering, etc.

> and the other of various other
> bugs with shrinker invocation behavior (like the nr_deferred
> fixes you posted a patch for). I believe these are MM topics.

Except they interact directly with external shrinker behaviour. the
conditions of deferral and the problems it is solving are a direct
response to shrinker implementation constraints (e.g. GFP_NOFS
deadlock avoidance for filesystems). i.e. we can't talk about the
deferal algorithm without considering why work is deferred, how much
work should be deferred, when it may be safe/best to execute the
deferred work, etc.

This all comes back to the fact that modifying the shrinker
algorithms requires understanding what the shrinker implementations
do and the constraints they operate under. It is not a "purely mm"
discussion, and treating it as such results regressions like the
ones we've recently seen.

> The second is the filesystem (and maybe other) shrinker
> functions' behavior being somewhat fragile and depending
> on closely on current MM behavior, potentially up to
> and including MM bugs.
> 
> The lack of a contract between the MM and the shrinker
> callbacks is a recurring issue, and something we may
> want to discuss in a joint session.
> 
> Some reflections on the shrinker/MM interaction:
> - Since all memory (in a zone) could potentially be in
>   shrinker pools, shrinkers MUST eventually free some
>   memory.

Which they cannot guarantee because all the objects they track may
be in use. As such, shrinkers have never been asked to guarantee
that they can free memory - they've only ever been asked to scan a
number of objects and attempt to free those it can during the scan.

> - Shrinkers should not block kswapd from making progress.
>   If kswapd got stuck in NFS inode writeback, and ended up
>   not being able to free clean pages to receive network
>   packets, that might cause a deadlock.

Same can happen if kswapd got stuck on dirty page writeback from
pageout(). i.e. pageout() can only run from kswapd and it issues IO,
which can then block in the IO submission path waiting for IO to
make progress, which may require substantial amounts of memory
allocation.

Yes, we can try to not block kswapd as much as possible just like
page reclaim does, but the fact is kswapd is the only context where
it is safe to do certain blocking operations to ensure memory
reclaim can actually make progress.

i.e. the rules for blocking kswapd need to be consistent across both
page reclaim and shrinker reclaim, and right now page reclaim can
and does block kswapd when it is necessary for forwards progress....

> - The MM should be able to deal with shrinkers doing
>   nothing at this call, but having some work pending 
>   (eg. waiting on IO completion), without getting a false
>   OOM kill. How can we do this best?

By integrating shrinkers into the same feedback loops as page
reclaim. i.e. to allow individual shrinker instance state to be
visible to the backoff/congestion decisions that the main page
reclaim loops make.

i.e. the problem here is that shrinkers only feedback to the main
loop is "how many pages were freed" as a whole. They aren't seen as
individual reclaim instances like zones for apge reclaim, they are
just a huge amorphous blob that "frees some pages". i.e. They sit off to
the side and run their own game between main loop scans and have no
capability to run individual backoffs, schedule kswapd to do future
work, don't have watermarks to provide reclaim goals, can't
communicate progress to the main control algorithm, etc.

IOWs, the first step we need to take here is to get rid of
the shrink_slab() abstraction and make shrinkers a first class
reclaim citizen....

> - Related to the above: stalling in the shrinker code is
>   unpredictable, and can take an arbitrarily long amount
>   of time. Is there a better way we can make reclaimers
>   wait for in-flight work to be completed?

Look at it this way: what do you need to do to implement the main
zone reclaim loops as individual shrinker instances? Complex
shrinker implementations have to deal with all the same issues as
the page reclaim loops (including managing cross-cache dependencies
and balancing). If we can't answer this question, then we can't
answer the questions that are being asked.

So, at this point, I have to ask: if we need the same functionality
for both page reclaim and shrinkers, then why shouldn't the goal be
to make page reclaim just another set of opaque shrinker
implementations?

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-19 17:31   ` Rik van Riel
@ 2019-02-19 17:38     ` Michal Hocko
  2019-02-19 23:26     ` Dave Chinner
  1 sibling, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2019-02-19 17:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, Roman Gushchin, lsf-pc, linux-mm, guroan,
	Kernel Team, hannes

On Tue 19-02-19 12:31:10, Rik van Riel wrote:
> On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > first letter.
> > 
> > Please resend again with linux-fsdevel on the cc list, because this
> > isn't a MM topic given the regressions from the shrinker patches
> > have all been on the filesystem side of the shrinkers....
> 
> It looks like there are two separate things going on here.
> 
> The first are an MM issues, one of potentially leaking memory
> by not scanning slabs with few items on them, and having
> such slabs stay around forever after the cgroup they were
> created for has disappeared, and the other of various other
> bugs with shrinker invocation behavior (like the nr_deferred
> fixes you posted a patch for). I believe these are MM topics.
> 
> 
> The second is the filesystem (and maybe other) shrinker
> functions' behavior being somewhat fragile and depending
> on closely on current MM behavior, potentially up to
> and including MM bugs.

I do agree and we should separate the two topics.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-19  2:04 ` Dave Chinner
@ 2019-02-19 17:31   ` Rik van Riel
  2019-02-19 17:38     ` Michal Hocko
  2019-02-19 23:26     ` Dave Chinner
  0 siblings, 2 replies; 33+ messages in thread
From: Rik van Riel @ 2019-02-19 17:31 UTC (permalink / raw)
  To: Dave Chinner, Roman Gushchin
  Cc: lsf-pc, linux-mm, mhocko, guroan, Kernel Team, hannes

[-- Attachment #1: Type: text/plain, Size: 2033 bytes --]

On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > Sorry, resending with the fixed to/cc list. Please, ignore the
> > first letter.
> 
> Please resend again with linux-fsdevel on the cc list, because this
> isn't a MM topic given the regressions from the shrinker patches
> have all been on the filesystem side of the shrinkers....

It looks like there are two separate things going on here.

The first are an MM issues, one of potentially leaking memory
by not scanning slabs with few items on them, and having
such slabs stay around forever after the cgroup they were
created for has disappeared, and the other of various other
bugs with shrinker invocation behavior (like the nr_deferred
fixes you posted a patch for). I believe these are MM topics.


The second is the filesystem (and maybe other) shrinker
functions' behavior being somewhat fragile and depending
on closely on current MM behavior, potentially up to
and including MM bugs.

The lack of a contract between the MM and the shrinker
callbacks is a recurring issue, and something we may
want to discuss in a joint session.

Some reflections on the shrinker/MM interaction:
- Since all memory (in a zone) could potentially be in
  shrinker pools, shrinkers MUST eventually free some
  memory.
- Shrinkers should not block kswapd from making progress.
  If kswapd got stuck in NFS inode writeback, and ended up
  not being able to free clean pages to receive network
  packets, that might cause a deadlock.
- The MM should be able to deal with shrinkers doing
  nothing at this call, but having some work pending 
  (eg. waiting on IO completion), without getting a false
  OOM kill. How can we do this best?
- Related to the above: stalling in the shrinker code is
  unpredictable, and can take an arbitrarily long amount
  of time. Is there a better way we can make reclaimers
  wait for in-flight work to be completed?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
  2019-02-19  0:31 Roman Gushchin
@ 2019-02-19  2:04 ` Dave Chinner
  2019-02-19 17:31   ` Rik van Riel
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-02-19  2:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, linux-mm, mhocko, riel, guroan, Kernel Team, hannes

On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> Sorry, resending with the fixed to/cc list. Please, ignore the first letter.

Please resend again with linux-fsdevel on the cc list, because this
isn't a MM topic given the regressions from the shrinker patches
have all been on the filesystem side of the shrinkers....

-Dave.

> --
> 
> Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
> with accumulating of dying memory cgroups. This is a serious problem:
> on most of our machines we've seen thousands on dying cgroups, and
> the corresponding memory footprint was measured in hundreds of megabytes.
> The problem was also independently discovered by other companies.
> 
> The fixes were reverted due to xfs regression investigated by Dave Chinner.
> Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
> which caused Rik van Riel to propose a patch [3], which aimed to fix the
> regression. The idea is to accumulate small memory pressure and apply it
> periodically, so that we don't overscan small shrinker lists. According
> to Jan Kara's data [4], Rik's patch partially fixed the regression,
> but not entirely.
> 
> The path forward isn't entirely clear now, and the status quo isn't acceptable
> sue to memcg leak bug. Dave and Michal's position is to focus on dying memory
> cgroup case and apply some artificial memory pressure on corresponding slabs
> (probably, during cgroup deletion process). This approach can theoretically
> be less harmful for the subtle scanning balance, and not cause any regressions.
> 
> In my opinion, it's not necessarily true. Slab objects can be shared between
> cgroups, and often can't be reclaimed on cgroup removal without an impact on the
> rest of the system. Applying constant artificial memory pressure precisely only
> on objects accounted to dying cgroups is challenging and will likely
> cause a quite significant overhead. Also, by "forgetting" of some slab objects
> under light or even moderate memory pressure, we're wasting memory, which can be
> used for something useful. Dying cgroups are just making this problem more
> obvious because of their size.
> 
> So, using "natural" memory pressure in a way, that all slabs objects are scanned
> periodically, seems to me as the best solution. The devil is in details, and how
> to do it without causing any regressions, is an open question now.
> 
> Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
> is a potential option to consider.
> 
> It will be nice to discuss the problem on LSF/MM, agree on general path and
> make a potential list of benchmarks, which can be used to prove the solution.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
> [3] https://lkml.org/lkml/2019/1/28/1865
> [4] https://lkml.org/lkml/2019/2/8/336
> 

-- 
Dave Chinner
dchinner@redhat.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-19  0:31 Roman Gushchin
  2019-02-19  2:04 ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2019-02-19  0:31 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, mhocko, riel, dchinner, guroan, Kernel Team, hannes

Sorry, resending with the fixed to/cc list. Please, ignore the first letter.
--

Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
with accumulating of dying memory cgroups. This is a serious problem:
on most of our machines we've seen thousands on dying cgroups, and
the corresponding memory footprint was measured in hundreds of megabytes.
The problem was also independently discovered by other companies.

The fixes were reverted due to xfs regression investigated by Dave Chinner.
Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
which caused Rik van Riel to propose a patch [3], which aimed to fix the
regression. The idea is to accumulate small memory pressure and apply it
periodically, so that we don't overscan small shrinker lists. According
to Jan Kara's data [4], Rik's patch partially fixed the regression,
but not entirely.

The path forward isn't entirely clear now, and the status quo isn't acceptable
sue to memcg leak bug. Dave and Michal's position is to focus on dying memory
cgroup case and apply some artificial memory pressure on corresponding slabs
(probably, during cgroup deletion process). This approach can theoretically
be less harmful for the subtle scanning balance, and not cause any regressions.

In my opinion, it's not necessarily true. Slab objects can be shared between
cgroups, and often can't be reclaimed on cgroup removal without an impact on the
rest of the system. Applying constant artificial memory pressure precisely only
on objects accounted to dying cgroups is challenging and will likely
cause a quite significant overhead. Also, by "forgetting" of some slab objects
under light or even moderate memory pressure, we're wasting memory, which can be
used for something useful. Dying cgroups are just making this problem more
obvious because of their size.

So, using "natural" memory pressure in a way, that all slabs objects are scanned
periodically, seems to me as the best solution. The devil is in details, and how
to do it without causing any regressions, is an open question now.

Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
is a potential option to consider.

It will be nice to discuss the problem on LSF/MM, agree on general path and
make a potential list of benchmarks, which can be used to prove the solution.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
[3] https://lkml.org/lkml/2019/1/28/1865
[4] https://lkml.org/lkml/2019/2/8/336


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
@ 2019-02-18 23:53 Roman Gushchin
  0 siblings, 0 replies; 33+ messages in thread
From: Roman Gushchin @ 2019-02-18 23:53 UTC (permalink / raw)
  To: sf-pc; +Cc: linux-mm, mhocko, riel, dchinner, dairinin, akpm

Recent reverts of memcg leak fixes [1, 2] reintroduced the problem
with accumulating of dying memory cgroups. This is a serious problem:
on most of our machines we've seen thousands on dying cgroups, and
the corresponding memory footprint was measured in hundreds of megabytes.
The problem was also independently discovered by other companies.

The fixes were reverted due to xfs regression investigated by Dave Chinner.
Simultaneously we've seen a very small (0.18%) cpu regression on some hosts,
which caused Rik van Riel to propose a patch [3], which aimed to fix the
regression. The idea is to accumulate small memory pressure and apply it
periodically, so that we don't overscan small shrinker lists. According
to Jan Kara's data [4], Rik's patch partially fixed the regression,
but not entirely.

The path forward isn't entirely clear now, and the status quo isn't acceptable
sue to memcg leak bug. Dave and Michal's position is to focus on dying memory
cgroup case and apply some artificial memory pressure on corresponding slabs
(probably, during cgroup deletion process). This approach can theoretically
be less harmful for the subtle scanning balance, and not cause any regressions.

In my opinion, it's not necessarily true. Slab objects can be shared between
cgroups, and often can't be reclaimed on cgroup removal without an impact on the
rest of the system. Applying constant artificial memory pressure precisely only
on objects accounted to dying cgroups is challenging and will likely
cause a quite significant overhead. Also, by "forgetting" of some slab objects
under light or even moderate memory pressure, we're wasting memory, which can be
used for something useful. Dying cgroups are just making this problem more
obvious because of their size.

So, using "natural" memory pressure in a way, that all slabs objects are scanned
periodically, seems to me as the best solution. The devil is in details, and how
to do it without causing any regressions, is an open question now.

Also, completely re-parenting slabs to parent cgroup (not only shrinker lists)
is a potential option to consider.

It will be nice to discuss the problem on LSF/MM, agree on general path and
make a potential list of benchmarks, which can be used to prove the solution.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a238e83fbb0df31c3b9b67003f8f9d1d1b6c96
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
[3] https://lkml.org/lkml/2019/1/28/1865
[4] https://lkml.org/lkml/2019/2/8/336


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-02-28 22:30 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-19  7:13 [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues Roman Gushchin
2019-02-19  7:13 ` Roman Gushchin
     [not found] ` <20190219092323.GH4525@dhcp22.suse.cz>
2019-02-19 16:21   ` [LSF/MM ATTEND] MM track: dying memory cgroups and slab reclaim issue, memcg, THP Roman Gushchin
2019-02-20  2:47 ` [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues Dave Chinner
2019-02-20  2:47   ` Dave Chinner
2019-02-20  5:50   ` Dave Chinner
2019-02-20  5:50     ` Dave Chinner
2019-02-20  7:27     ` Dave Chinner
2019-02-20  7:27       ` Dave Chinner
2019-02-20 16:20       ` Johannes Weiner
2019-02-20 16:20         ` Johannes Weiner
2019-02-21 22:46       ` Roman Gushchin
2019-02-21 22:46         ` Roman Gushchin
2019-02-22  1:48         ` Rik van Riel
2019-02-22  1:48           ` Rik van Riel
2019-02-22  1:57           ` Roman Gushchin
2019-02-22  1:57             ` Roman Gushchin
2019-02-28 20:30         ` Roman Gushchin
2019-02-28 20:30           ` Roman Gushchin
2019-02-28 21:30           ` Dave Chinner
2019-02-28 21:30             ` Dave Chinner
2019-02-28 22:29             ` Roman Gushchin
2019-02-28 22:29               ` Roman Gushchin
  -- strict thread matches above, loose matches on Subject: below --
2019-02-19  0:31 Roman Gushchin
2019-02-19  2:04 ` Dave Chinner
2019-02-19 17:31   ` Rik van Riel
2019-02-19 17:38     ` Michal Hocko
2019-02-19 23:26     ` Dave Chinner
2019-02-20  2:06       ` Rik van Riel
2019-02-20  4:33         ` Dave Chinner
2019-02-20  5:31           ` Roman Gushchin
2019-02-20 17:00           ` Rik van Riel
2019-02-18 23:53 Roman Gushchin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.