linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive
@ 2019-01-30  4:17 Dave Chinner
  2019-01-30  4:17 ` [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages" Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Dave Chinner @ 2019-01-30  4:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, linux-xfs
  Cc: guro, akpm, mhocko, vdavydov.dev

Hi mm-folks,

TL;DR: these two commits break system wide memory VFS cache reclaim
balance badly, cause severe performance regressions in stable
kernels and they need to be reverted ASAP.

For background, let's start with the bug reports that have come from
desktop users on 4.19 stable kernels. First this one:

https://bugzilla.kernel.org/show_bug.cgi?id=202349

Whereby copying a large amount of data to files on an XFS filesystem
would cause the desktop to freeze for multiple seconds and,
apparently occasionally hang completely. Basically, GPU based
GFP_KERNEL allocations getting stuck in shrinkers under realtively
light memory loads killing desktop interactivity. Kernel 4.19.16

The second:

https://bugzilla.kernel.org/show_bug.cgi?id=202441

Whereby copying a large data set across NFS filesystems at the same
time as running a kernel compile on a local XFS filesystem results
in the kernel compile going from 3m30s to over an hour and file copy
performance tanking.

We ran an abbreviated bisect from 4.18 through to 4.19.18, and found
two things:

	1: there was a major change in page cache reclaim behaviour
	introduced in 4.19-rc5. Basically the page cache would get
	trashed periodically for no apparent reason, the
	characteristic being a sawtooth cache usage pattern.

	2: in 4.19.3, kernel compile performance turned to crap.

The kernel compile regression is essentially caused by memory
reclaim driving the XFS inode shrinker hard in to reclaiming dirty
inodes and getting blocked, essentially slowing reclaim down to the
rate at which a slow SATA drive could write back inodes. There were
also indications of a similar GPU-based GFP_KERNEL allocation
stalls, but most of the testing was done from the CLI with no X so
that could be discounted.

It was reported that less severe slowdowns also occurred on ext2,
ext3, ext4 and jfs, so XFS is really just the messenger here - it is
most definitely not the cause of the problem being seen, so stop and
thing before you go and blame XFS.

Looking at the change history of the mm/ subsystem after the first
bug report, I noticed and red-flagged this commit for deeper
analysis:

172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")

That "simple" change ran a read flag because it makes shrinker
reclaim far, far more agressive at initial priority reclaims (ie..
reclaim priority = 12). And it also means that small caches that
don't need reclaim (because they are small) will be agressively
scanned and reclaimed when there is very little memory pressure,
too. It also means tha tlarge caches are reclaimed very agressively
under light memory pressure - pressure that would have resulted in
single digit scan count now gets run out to batch size, which for
filesystems is 1024 objects. i.e. we increase reclaim filesystem
superblock shrinker pressure by an order of 100x at light reclaim.

That's a *bad thing* because it means we can't retain working sets
of small caches even under light memory pressure - they get
excessively reclaimed in comparison to large caches instead of in
proptortion to the rest of the system caches.

So, yeah, red flag. Big one. And the patch never got sent to
linux-fsdevel so us filesystem people didn't ahve any idea that
there were changes to VFS cache balances coming down the line. Hence
our users reporting problems ar the first sign we get of a
regression...

So when Roger reported that the page cache behaviour changed
massively in 4.19-rc5, and I found that commit was between -rc4 and
-rc5? Yeah, that kinda proved my theory that it changed the
fundamental cache balance of the system and the red flag is real...

So, the second, performance killing change? Well, I think you all
know what's coming:

a76cf1a474d7 mm: don't reclaim inodes with many attached pages

[ Yup, a "MM" tagged patch that changed code in fs/inode.c and wasn't
cc'd to any fileystem list. There's a pattern emerging here. Did
anyone think to cc the guy who originally designed ithe numa aware
shrinker infrastucture and helped design the memcg shrinker
infrastructure on fundamental changes? ]

So, that commit was an attempt to fix the shitty behaviour
introduced by 172b06c32b94 - it's a bandaid over a symptom rather
than something that attempts to correct the actual bug that was
introduced. i.e. the increased inode cache reclaim pressure was now
reclaiming inodes faster than the page cache reclaim was reclaiming
pages on the inode, and so inode cache reclaim is trashing the
working set of the page cache.

This is actually necessary behaviour - when you have lots of
temporary inodes and are turning the inode cache over really fast
(think recursive grep) we want the inode cache to immediately
reclaim the cached pages on the inode because it's typically a
single use file. Why wait for the page cache to detect it's single
use when we already know it's not part of the working file set?

And what's a kernel compile? it's a recursive read of a large number
of files, intermixed with the creation of a bunch of temporary
files.  What happens when you have a mixed large file workload
(background file copy) and lots of small files being created and
removed (kernel compile)?

Yup, we end up in a situation where inode reclaim can no longer
reclaim clean inodes because they have cached pages, yet page reclaim
doesn't keep up  in reclaiming pages because it hasn't realised they
are single use pages yet and hence don't get reclaimed. And
because the page cache preossure is relatively light, we are
putting a huge amount of scanning pressure put on the shrinkers.

The result is the shrinkers are driven into corners where they try
*really hard* to free objects because there's nothing left that is
easy to reclaim. e.g. it drives the XFS inode cache shrinker into
"need to clean dirty reclaimable inodes" land on workloads where the
working set of cached inodes should never, ever get anywhere near
that threshold because there are hge amounts of clean pages and
inodes that should have been reclaimed first.

IOWs, the fix to prevent inode cache reclaim from reclaiming inodes
with pages attached to them essentially breaks a key page cache
memory reclaim interlock that our systems have implicitly depended
on for ages.

And, in reality, changing fundamental memory reclaim balance is not
the way to fix a "dying memcg" memory leak. Trying to solve a "we've
got referenced memory we need to clean up" by faking memory
pressure and winding up shrinker based reclaim so dying memcg's are
reclaimed fast is, well, just insane. It's a nasty hack at best.

e.g. add a garbage collector via a background workqueue that sits on
the dying memcg calling something like:

void drop_slab_memcg(struct mem_cgroup *dying_memcg)
{
        unsigned long freed;

        do {
                struct mem_cgroup *memcg = NULL;

                freed = 0;
                memcg = mem_cgroup_iter(dying_memcg, NULL, NULL);
                do {
                        freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0);
                } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
        } while (freed > 0);
}

(or whatever the NUMA aware, rate limited variant should really be)

so that it kills off all the slab objects accounted to the memcg
as quickly as possible? The memcg teardown code is full of these
"put it on a work queue until something goes away and calls the next
teardown function" operations, so it makes no sense to me to be
relying on system wide memory pressure to drive this reclaim faster.

Sure, it won't get rid of all of the dying memcgs all of the time,
but it's a hell of a lot better changing memory reclaim behaviour
and cache balance for everyone to fix what is, at it's core, a memcg
lifecycle problem, not a memory reclaim problem.

So, revert these broken, misguided commits ASAP, please.

-Dave.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-30  4:17 [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive Dave Chinner
@ 2019-01-30  4:17 ` Dave Chinner
  2019-01-30 12:21   ` Chris Mason
  2019-01-30  4:17 ` [PATCH 2/2] Revert "mm: slowly shrink slabs with a relatively small number of objects" Dave Chinner
  2019-01-30  5:48 ` [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive Roman Gushchin
  2 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2019-01-30  4:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, linux-xfs
  Cc: guro, akpm, mhocko, vdavydov.dev

From: Dave Chinner <dchinner@redhat.com>

This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.

This change causes serious changes to page cache and inode cache
behaviour and balance, resulting in major performance regressions
when combining worklaods such as large file copies and kernel
compiles.

https://bugzilla.kernel.org/show_bug.cgi?id=202441

This change is a hack to work around the problems introduced by
changing how agressive shrinkers are on small caches in commit
172b06c32b94 ("mm: slowly shrink slabs with a relatively small
number of objects"). It creates more problems than it solves, wasn't
adequately reviewed or tested, so it needs to be reverted.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 0cd47fe0dbe5..73432e64f874 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -730,11 +730,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_REMOVED;
 	}
 
-	/*
-	 * Recently referenced inodes and inodes with many attached pages
-	 * get one more pass.
-	 */
-	if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
+	/* recently referenced inodes get one more pass */
+	if (inode->i_state & I_REFERENCED) {
 		inode->i_state &= ~I_REFERENCED;
 		spin_unlock(&inode->i_lock);
 		return LRU_ROTATE;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] Revert "mm: slowly shrink slabs with a relatively small number of objects"
  2019-01-30  4:17 [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive Dave Chinner
  2019-01-30  4:17 ` [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages" Dave Chinner
@ 2019-01-30  4:17 ` Dave Chinner
  2019-01-30  5:48 ` [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive Roman Gushchin
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2019-01-30  4:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, linux-xfs
  Cc: guro, akpm, mhocko, vdavydov.dev

From: Dave Chinner <dchinner@redhat.com>

This reverts commit 172b06c32b949759fe6313abec514bc4f15014f4.

This change changes the agressiveness of shrinker reclaim, causing
small cache and low priority reclaim to greatly increase
scanning pressure on small caches. As a result, light memory
pressure has a disproportionate affect on small caches, and causes
large caches to be reclaimed much faster than previously.

As a result, it greatly perturbs the delicate balance of the VFS
caches (dentry/inode vs file page cache) such that the inode/dentry
caches are reclaimed much, much faster than the page cache and this
drives us into several other caching imbalance related problems.

As such, this is a bad change and needs to be reverted.

[ Needs some massaging to retain the later seekless shrinker
modifications. ]

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a714c4f800e9..e979705bbf32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -491,16 +491,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		delta = freeable / 2;
 	}
 
-	/*
-	 * Make sure we apply some minimal pressure on default priority
-	 * even on small cgroups. Stale objects are not only consuming memory
-	 * by themselves, but can also hold a reference to a dying cgroup,
-	 * preventing it from being reclaimed. A dying cgroup with all
-	 * corresponding structures like per-cpu stats and kmem caches
-	 * can be really big, so it may lead to a significant waste of memory.
-	 */
-	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
-
 	total_scan += delta;
 	if (total_scan < 0) {
 		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive
  2019-01-30  4:17 [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive Dave Chinner
  2019-01-30  4:17 ` [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages" Dave Chinner
  2019-01-30  4:17 ` [PATCH 2/2] Revert "mm: slowly shrink slabs with a relatively small number of objects" Dave Chinner
@ 2019-01-30  5:48 ` Roman Gushchin
  2 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-01-30  5:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-kernel, linux-fsdevel, linux-xfs, akpm, mhocko,
	vdavydov.dev

Hi, Dave!

Instead of reverting (which will bring back the memcg memory leak),
can you, please, try Rik's patch: https://lkml.org/lkml/2019/1/28/1865 ?

It should protect small cgroups from being scanned too hard by the memory
pressure, however keeping the pressure big enough to avoid memory leaks.

Thanks!

On Wed, Jan 30, 2019 at 03:17:05PM +1100, Dave Chinner wrote:
> Hi mm-folks,
> 
> TL;DR: these two commits break system wide memory VFS cache reclaim
> balance badly, cause severe performance regressions in stable
> kernels and they need to be reverted ASAP.
> 
> For background, let's start with the bug reports that have come from
> desktop users on 4.19 stable kernels. First this one:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=202349
> 
> Whereby copying a large amount of data to files on an XFS filesystem
> would cause the desktop to freeze for multiple seconds and,
> apparently occasionally hang completely. Basically, GPU based
> GFP_KERNEL allocations getting stuck in shrinkers under realtively
> light memory loads killing desktop interactivity. Kernel 4.19.16
> 
> The second:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=202441
> 
> Whereby copying a large data set across NFS filesystems at the same
> time as running a kernel compile on a local XFS filesystem results
> in the kernel compile going from 3m30s to over an hour and file copy
> performance tanking.
> 
> We ran an abbreviated bisect from 4.18 through to 4.19.18, and found
> two things:
> 
> 	1: there was a major change in page cache reclaim behaviour
> 	introduced in 4.19-rc5. Basically the page cache would get
> 	trashed periodically for no apparent reason, the
> 	characteristic being a sawtooth cache usage pattern.
> 
> 	2: in 4.19.3, kernel compile performance turned to crap.
> 
> The kernel compile regression is essentially caused by memory
> reclaim driving the XFS inode shrinker hard in to reclaiming dirty
> inodes and getting blocked, essentially slowing reclaim down to the
> rate at which a slow SATA drive could write back inodes. There were
> also indications of a similar GPU-based GFP_KERNEL allocation
> stalls, but most of the testing was done from the CLI with no X so
> that could be discounted.
> 
> It was reported that less severe slowdowns also occurred on ext2,
> ext3, ext4 and jfs, so XFS is really just the messenger here - it is
> most definitely not the cause of the problem being seen, so stop and
> thing before you go and blame XFS.
> 
> Looking at the change history of the mm/ subsystem after the first
> bug report, I noticed and red-flagged this commit for deeper
> analysis:
> 
> 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> 
> That "simple" change ran a read flag because it makes shrinker
> reclaim far, far more agressive at initial priority reclaims (ie..
> reclaim priority = 12). And it also means that small caches that
> don't need reclaim (because they are small) will be agressively
> scanned and reclaimed when there is very little memory pressure,
> too. It also means tha tlarge caches are reclaimed very agressively
> under light memory pressure - pressure that would have resulted in
> single digit scan count now gets run out to batch size, which for
> filesystems is 1024 objects. i.e. we increase reclaim filesystem
> superblock shrinker pressure by an order of 100x at light reclaim.
> 
> That's a *bad thing* because it means we can't retain working sets
> of small caches even under light memory pressure - they get
> excessively reclaimed in comparison to large caches instead of in
> proptortion to the rest of the system caches.
> 
> So, yeah, red flag. Big one. And the patch never got sent to
> linux-fsdevel so us filesystem people didn't ahve any idea that
> there were changes to VFS cache balances coming down the line. Hence
> our users reporting problems ar the first sign we get of a
> regression...
> 
> So when Roger reported that the page cache behaviour changed
> massively in 4.19-rc5, and I found that commit was between -rc4 and
> -rc5? Yeah, that kinda proved my theory that it changed the
> fundamental cache balance of the system and the red flag is real...
> 
> So, the second, performance killing change? Well, I think you all
> know what's coming:
> 
> a76cf1a474d7 mm: don't reclaim inodes with many attached pages
> 
> [ Yup, a "MM" tagged patch that changed code in fs/inode.c and wasn't
> cc'd to any fileystem list. There's a pattern emerging here. Did
> anyone think to cc the guy who originally designed ithe numa aware
> shrinker infrastucture and helped design the memcg shrinker
> infrastructure on fundamental changes? ]
> 
> So, that commit was an attempt to fix the shitty behaviour
> introduced by 172b06c32b94 - it's a bandaid over a symptom rather
> than something that attempts to correct the actual bug that was
> introduced. i.e. the increased inode cache reclaim pressure was now
> reclaiming inodes faster than the page cache reclaim was reclaiming
> pages on the inode, and so inode cache reclaim is trashing the
> working set of the page cache.
> 
> This is actually necessary behaviour - when you have lots of
> temporary inodes and are turning the inode cache over really fast
> (think recursive grep) we want the inode cache to immediately
> reclaim the cached pages on the inode because it's typically a
> single use file. Why wait for the page cache to detect it's single
> use when we already know it's not part of the working file set?
> 
> And what's a kernel compile? it's a recursive read of a large number
> of files, intermixed with the creation of a bunch of temporary
> files.  What happens when you have a mixed large file workload
> (background file copy) and lots of small files being created and
> removed (kernel compile)?
> 
> Yup, we end up in a situation where inode reclaim can no longer
> reclaim clean inodes because they have cached pages, yet page reclaim
> doesn't keep up  in reclaiming pages because it hasn't realised they
> are single use pages yet and hence don't get reclaimed. And
> because the page cache preossure is relatively light, we are
> putting a huge amount of scanning pressure put on the shrinkers.
> 
> The result is the shrinkers are driven into corners where they try
> *really hard* to free objects because there's nothing left that is
> easy to reclaim. e.g. it drives the XFS inode cache shrinker into
> "need to clean dirty reclaimable inodes" land on workloads where the
> working set of cached inodes should never, ever get anywhere near
> that threshold because there are hge amounts of clean pages and
> inodes that should have been reclaimed first.
> 
> IOWs, the fix to prevent inode cache reclaim from reclaiming inodes
> with pages attached to them essentially breaks a key page cache
> memory reclaim interlock that our systems have implicitly depended
> on for ages.
> 
> And, in reality, changing fundamental memory reclaim balance is not
> the way to fix a "dying memcg" memory leak. Trying to solve a "we've
> got referenced memory we need to clean up" by faking memory
> pressure and winding up shrinker based reclaim so dying memcg's are
> reclaimed fast is, well, just insane. It's a nasty hack at best.
> 
> e.g. add a garbage collector via a background workqueue that sits on
> the dying memcg calling something like:
> 
> void drop_slab_memcg(struct mem_cgroup *dying_memcg)
> {
>         unsigned long freed;
> 
>         do {
>                 struct mem_cgroup *memcg = NULL;
> 
>                 freed = 0;
>                 memcg = mem_cgroup_iter(dying_memcg, NULL, NULL);
>                 do {
>                         freed += shrink_slab_memcg(GFP_KERNEL, 0, memcg, 0);
>                 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>         } while (freed > 0);
> }
> 
> (or whatever the NUMA aware, rate limited variant should really be)
> 
> so that it kills off all the slab objects accounted to the memcg
> as quickly as possible? The memcg teardown code is full of these
> "put it on a work queue until something goes away and calls the next
> teardown function" operations, so it makes no sense to me to be
> relying on system wide memory pressure to drive this reclaim faster.
> 
> Sure, it won't get rid of all of the dying memcgs all of the time,
> but it's a hell of a lot better changing memory reclaim behaviour
> and cache balance for everyone to fix what is, at it's core, a memcg
> lifecycle problem, not a memory reclaim problem.
> 
> So, revert these broken, misguided commits ASAP, please.
> 
> -Dave.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-30  4:17 ` [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages" Dave Chinner
@ 2019-01-30 12:21   ` Chris Mason
  2019-01-31  1:34     ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Mason @ 2019-01-30 12:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-kernel, linux-fsdevel, linux-xfs, Roman Gushchin,
	akpm, mhocko, vdavydov.dev



On 29 Jan 2019, at 23:17, Dave Chinner wrote:

> From: Dave Chinner <dchinner@redhat.com>
>
> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
>
> This change causes serious changes to page cache and inode cache
> behaviour and balance, resulting in major performance regressions
> when combining worklaods such as large file copies and kernel
> compiles.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=202441

I'm a little confused by the latest comment in the bz:

https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24

Are these reverts sufficient?

Roman beat me to suggesting Rik's followup.  We hit a different problem 
in prod with small slabs, and have a lot of instrumentation on Rik's 
code helping.

-chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-30 12:21   ` Chris Mason
@ 2019-01-31  1:34     ` Dave Chinner
  2019-01-31  9:10       ` Michal Hocko
  2019-01-31 15:48       ` Chris Mason
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2019-01-31  1:34 UTC (permalink / raw)
  To: Chris Mason
  Cc: linux-mm, linux-kernel, linux-fsdevel, linux-xfs, Roman Gushchin,
	akpm, mhocko, vdavydov.dev

On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> 
> 
> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> >
> > This change causes serious changes to page cache and inode cache
> > behaviour and balance, resulting in major performance regressions
> > when combining worklaods such as large file copies and kernel
> > compiles.
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> 
> I'm a little confused by the latest comment in the bz:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24

Which says the first patch that changed the shrinker behaviour is
the underlying cause of the regression.

> Are these reverts sufficient?

I think so.

> Roman beat me to suggesting Rik's followup.  We hit a different problem 
> in prod with small slabs, and have a lot of instrumentation on Rik's 
> code helping.

I think that's just another nasty, expedient hack that doesn't solve
the underlying problem. Solving the underlying problem does not
require changing core reclaim algorithms and upsetting a page
reclaim/shrinker balance that has been stable and worked well for
just about everyone for years.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-31  1:34     ` Dave Chinner
@ 2019-01-31  9:10       ` Michal Hocko
  2019-01-31 18:57         ` Roman Gushchin
  2019-01-31 15:48       ` Chris Mason
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2019-01-31  9:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, linux-mm, linux-kernel, linux-fsdevel, linux-xfs,
	Roman Gushchin, akpm, vdavydov.dev

On Thu 31-01-19 12:34:03, Dave Chinner wrote:
> On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> > 
> > 
> > On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> > >
> > > This change causes serious changes to page cache and inode cache
> > > behaviour and balance, resulting in major performance regressions
> > > when combining worklaods such as large file copies and kernel
> > > compiles.
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > 
> > I'm a little confused by the latest comment in the bz:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> 
> Which says the first patch that changed the shrinker behaviour is
> the underlying cause of the regression.
> 
> > Are these reverts sufficient?
> 
> I think so.
> 
> > Roman beat me to suggesting Rik's followup.  We hit a different problem 
> > in prod with small slabs, and have a lot of instrumentation on Rik's 
> > code helping.
> 
> I think that's just another nasty, expedient hack that doesn't solve
> the underlying problem. Solving the underlying problem does not
> require changing core reclaim algorithms and upsetting a page
> reclaim/shrinker balance that has been stable and worked well for
> just about everyone for years.

I tend to agree with Dave here. Slab pressure balancing is quite subtle
and easy to get wrong. If we want to plug the problem with offline
memcgs then the fix should be targeted at that problem. So maybe we want
to emulate high pressure on offline memcgs only. There might be other
issues to resolve for small caches but let's start with something more
targeted first please.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-31  1:34     ` Dave Chinner
  2019-01-31  9:10       ` Michal Hocko
@ 2019-01-31 15:48       ` Chris Mason
  2019-02-01 23:39         ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Chris Mason @ 2019-01-31 15:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-kernel, linux-fsdevel, linux-xfs, Roman Gushchin,
	akpm, mhocko, vdavydov.dev

On 30 Jan 2019, at 20:34, Dave Chinner wrote:

> On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
>>
>>
>> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
>>
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
>>>
>>> This change causes serious changes to page cache and inode cache
>>> behaviour and balance, resulting in major performance regressions
>>> when combining worklaods such as large file copies and kernel
>>> compiles.
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=202441
>>
>> I'm a little confused by the latest comment in the bz:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
>
> Which says the first patch that changed the shrinker behaviour is
> the underlying cause of the regression.
>
>> Are these reverts sufficient?
>
> I think so.

Based on the latest comment:

"If I had been less strict in my testing I probably would have 
discovered that the problem was present earlier than 4.19.3. Mr Gushins 
commit made it more visible.
I'm going back to work after two days off, so I might not be able to 
respond inside your working hours, but I'll keep checking in on this as 
I get a chance."

I don't think the reverts are sufficient.

>
>> Roman beat me to suggesting Rik's followup.  We hit a different 
>> problem
>> in prod with small slabs, and have a lot of instrumentation on Rik's
>> code helping.
>
> I think that's just another nasty, expedient hack that doesn't solve
> the underlying problem. Solving the underlying problem does not
> require changing core reclaim algorithms and upsetting a page
> reclaim/shrinker balance that has been stable and worked well for
> just about everyone for years.
>

Things are definitely breaking down in non-specialized workloads, and 
have been for a long time.

-chris

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-31  9:10       ` Michal Hocko
@ 2019-01-31 18:57         ` Roman Gushchin
  2019-01-31 22:19           ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-01-31 18:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Chris Mason, linux-mm, linux-kernel, linux-fsdevel,
	linux-xfs, akpm, vdavydov.dev

On Thu, Jan 31, 2019 at 10:10:11AM +0100, Michal Hocko wrote:
> On Thu 31-01-19 12:34:03, Dave Chinner wrote:
> > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> > > 
> > > 
> > > On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> > > 
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> > > >
> > > > This change causes serious changes to page cache and inode cache
> > > > behaviour and balance, resulting in major performance regressions
> > > > when combining worklaods such as large file copies and kernel
> > > > compiles.
> > > >
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > > 
> > > I'm a little confused by the latest comment in the bz:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> > 
> > Which says the first patch that changed the shrinker behaviour is
> > the underlying cause of the regression.
> > 
> > > Are these reverts sufficient?
> > 
> > I think so.
> > 
> > > Roman beat me to suggesting Rik's followup.  We hit a different problem 
> > > in prod with small slabs, and have a lot of instrumentation on Rik's 
> > > code helping.
> > 
> > I think that's just another nasty, expedient hack that doesn't solve
> > the underlying problem. Solving the underlying problem does not
> > require changing core reclaim algorithms and upsetting a page
> > reclaim/shrinker balance that has been stable and worked well for
> > just about everyone for years.
> 
> I tend to agree with Dave here. Slab pressure balancing is quite subtle
> and easy to get wrong. If we want to plug the problem with offline
> memcgs then the fix should be targeted at that problem. So maybe we want
> to emulate high pressure on offline memcgs only. There might be other
> issues to resolve for small caches but let's start with something more
> targeted first please.

First, the path proposed by Dave is not regression-safe too. A slab object
can be used by other cgroups as well, so creating an artificial pressure on
the dying cgroup might perfectly affect the rest of the system. We do reparent
slab lists on offlining, so there is even no easy way to iterate over them.
Also, creating an artifical pressure will create unnecessary CPU load.

So I'd really prefer to make the "natural" memory pressure to be applied
in a way, that doesn't leave any stalled objects behind.

Second, the code around slab pressure is not "worked well for years": as I can
see the latest major change was made about a year ago by Josef Bacik
(9092c71bb724 "mm: use sc->priority for slab shrink targets").

The existing balance, even if it works perfectly for some cases, isn't something
set in stone. We're really under-scanning small cgroups, and I strongly believe
that what Rik is proposing is a right thing to do. If we don't scan objects
in small cgroups unless we have really strong memory pressure, we're basically
wasting memory.

And it really makes no sense to reclaim inodes with tons of attached pagecache
as easy as "empty" inodes. At the end, all we need is to free some memory, and
treating a sub-page object equal to many thousands page object is just strange.
If it's simple "wrong" and I do miss something, please, explain. Maybe we need
something more complicated than in my patch, but saying that existing code is
just perfect and can't be touched at all makes no sense to me.

So, assuming all this, can we, please, first check if Rik's patch is addressing
the regression?

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-31 18:57         ` Roman Gushchin
@ 2019-01-31 22:19           ` Dave Chinner
  2019-02-04 21:47             ` Dave Chinner
  2019-02-07 10:27             ` Jan Kara
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2019-01-31 22:19 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Chris Mason, linux-mm, linux-kernel, linux-fsdevel,
	linux-xfs, akpm, vdavydov.dev

On Thu, Jan 31, 2019 at 06:57:10PM +0000, Roman Gushchin wrote:
> On Thu, Jan 31, 2019 at 10:10:11AM +0100, Michal Hocko wrote:
> > On Thu 31-01-19 12:34:03, Dave Chinner wrote:
> > > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> > > > 
> > > > 
> > > > On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> > > > 
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > >
> > > > > This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> > > > >
> > > > > This change causes serious changes to page cache and inode cache
> > > > > behaviour and balance, resulting in major performance regressions
> > > > > when combining worklaods such as large file copies and kernel
> > > > > compiles.
> > > > >
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > > > 
> > > > I'm a little confused by the latest comment in the bz:
> > > > 
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> > > 
> > > Which says the first patch that changed the shrinker behaviour is
> > > the underlying cause of the regression.
> > > 
> > > > Are these reverts sufficient?
> > > 
> > > I think so.
> > > 
> > > > Roman beat me to suggesting Rik's followup.  We hit a different problem 
> > > > in prod with small slabs, and have a lot of instrumentation on Rik's 
> > > > code helping.
> > > 
> > > I think that's just another nasty, expedient hack that doesn't solve
> > > the underlying problem. Solving the underlying problem does not
> > > require changing core reclaim algorithms and upsetting a page
> > > reclaim/shrinker balance that has been stable and worked well for
> > > just about everyone for years.
> > 
> > I tend to agree with Dave here. Slab pressure balancing is quite subtle
> > and easy to get wrong. If we want to plug the problem with offline
> > memcgs then the fix should be targeted at that problem. So maybe we want
> > to emulate high pressure on offline memcgs only. There might be other
> > issues to resolve for small caches but let's start with something more
> > targeted first please.
> 
> First, the path proposed by Dave is not regression-safe too. A slab object
> can be used by other cgroups as well, so creating an artificial pressure on
> the dying cgroup might perfectly affect the rest of the system.

Isolating regressions to the memcg dying path is far better than the
current changes you've made, which have already been *proven* to
affect the rest of the system.

> We do reparent
> slab lists on offlining, so there is even no easy way to iterate over them.
> Also, creating an artifical pressure will create unnecessary CPU load.

Yes, so do your changes - they increase small slab scanning by 100x
which means reclaim CPU has most definitely increased.

However, the memcg reaper *doesn't need to be perfect* to solve the
"takes too long to clean up dying memcgs". Even if it leaves shared
objects behind (which we want to do!), it still trims those memcgs
down to /just the shared objects still in use/.  And given that
objects shared by memcgs are in the minority (according to past
discussions about the difficulies of accounting them correctly) I
think this is just fine.

Besides, those reamining shared objects are the ones we want to
naturally age out under memory pressure, but otherwise the memcgs
will have been shaken clean of all other objects accounted to them.
i.e. the "dying memcg" memory footprint goes down massively and the
"long term buildup" of dying memcgs basically goes away.

> So I'd really prefer to make the "natural" memory pressure to be applied
> in a way, that doesn't leave any stalled objects behind.

Except that "natural" memory pressure isn't enough to do this, so
you've artificially inflated that "natural" pressure and screwed up
the reclaim for everyone else.

> Second, the code around slab pressure is not "worked well for years": as I can
> see the latest major change was made about a year ago by Josef Bacik
> (9092c71bb724 "mm: use sc->priority for slab shrink targets").

Strawman. False equivalence.

The commit you mention was effectively a minor tweak - instead of
using the ratio of pages scanned to pages reclaimed to define the
amount of slab work, (a second order measure of reclaim urgency), it
was changed to use the primary reclaim urgency directive that the
page scanning uses - the reclaim priority.

This *isn't a major change* in algorithm - it conveys essentially
exactly the same information, and does not change the overall page
cache vs shrinker reclaim balance. It just behaves more correctly in
corner cases where there is very little page cache/lots of slab and
vice versa.  It barely even changed the amount or balance of
reclaim being done under normal circumstances.

> The existing balance, even if it works perfectly for some cases, isn't something
> set in stone. We're really under-scanning small cgroups, and I strongly believe
> that what Rik is proposing is a right thing to do.

You keep saying "small cgroups". That is incorrect.

The shrinker knows nothing about the size of the cache. All it knows
is how many /freeable objects/ are in the cache. A large cache can
have zero freeable objects, and to the shrinker look just like a
small cache with just a few freeable objects. You're trying to
derive cache characterisations from accounting data that carries no
such information.

IOWs, adding yet another broken, racy, unpredictable "deferred scan
count" to triggers when there is few freeable objects in the cache
is not an improvement. The code is completely buggy, because
shrinkers can run in parallel and so shrinker->small_scan can be
modified by multiple shrinker instances running at the same time.
That's the problem the shrinker->nr_deferred[] array and the atomic
exchanges solve (which has other problems we really need to fix
before we even start thinking about changing shrinker balances at
all).

> If we don't scan objects
> in small cgroups unless we have really strong memory pressure, we're basically
> wasting memory.

Maybe for memcgs, but that's exactly the oppose of what we want to
do for global caches (e.g. filesystem metadata caches). We need to
make sure that a single, heavily pressured cache doesn't evict small
caches that lower pressure but are equally important for
performance.

e.g. I've noticed recently a significant increase in RMW cycles in
XFS inode cache writeback during various benchmarks. It hasn't
affected performance because the machine has IO and CPU to burn, but
on slower machines and storage, it will have a major impact.

The reason these RMW cycles occur is that the dirty inode in memory
needs to be flushed to the backing buffer before it can be written.
That backing buffer is in a metadata slab cache controlled by a
shrinker (not the superblock shrinker which reclaims inodes).  It's
*much smaller* than the inode cache, but it also needs to be turned
over much more slowly than the inode cache. If it gets turned over
too quickly because of inode cache pressure, then flushing dirty
inodes (in the superblock inode cache shrinker!) needs to read the
backing buffer back in to memory before it can write the inode and
reclaim it.

And when you change the shrinker infrastructure to reclaim small
caches more aggressively? It changes the balance of cache reclaim
within the filesystem, and so opens up these "should have been in
the cache but wasn't" performance problems. Where are all the
problems that users have reported? In the superblock shrinker,
reclaiming XFS inodes, waiting on IO to be done on really slow Io
subsystems. Basically, we're stuck because the shrinker changes have
screwed up the intra-cache reclaim balance.

IOWs, there's complex dependencies between shrinkers and memory
reclaim, and changing the balance of large vs small caches can have
very adverse affects when memory pressure occurs. For XFS,
increasing small cache pressure vs large cache pressure has the
effect of *increasing the memory demand* of inode reclaim because
writeback has to allocate buffers, and that's one of the reasons
it's the canary for naive memory reclaim algorithm changes.

Just because something "works for your memcg heavy workload" doesn't
mean it works for everyone, or is even a desirable algorithmic
change.

> And it really makes no sense to reclaim inodes with tons of attached pagecache
> as easy as "empty" inodes.

You didn't even know this happened until recently. Have you looked
up it's history?  e.g. from the initial git commit in 2.6.12, almost
15 years ago now, there's this comment on prune_icache():

 * Any inodes which are pinned purely because of attached pagecache have their
 * pagecache removed.  We expect the final iput() on that inode to add it to
 * the front of the inode_unused list.  So look for it there and if the
 * inode is still freeable, proceed.  The right inode is found 99.9% of the
 * time in testing on a 4-way.

Indeed, did you know this inode cache based page reclaim is what
PGINODESTEAL and KSWAPD_INODESTEAL /proc/vmstat record for us? e.g.
looking at my workstation:

$ grep inodesteal /proc/vmstat
pginodesteal 14617964
kswapd_inodesteal 6674859
$

Page cache reclaim from clean, unreferenced, aged-out inodes
actually happens an awful lot in real life.

> So, assuming all this, can we, please, first check if Rik's patch is addressing
> the regression?

Nope, it's broken and buggy, and reintroduces problems with racing
deferred counts that were fixed years ago when I originally
wrote the numa aware shrinker infrastructure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-31 15:48       ` Chris Mason
@ 2019-02-01 23:39         ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2019-02-01 23:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: linux-mm, linux-kernel, linux-fsdevel, linux-xfs, Roman Gushchin,
	akpm, mhocko, vdavydov.dev

On Thu, Jan 31, 2019 at 03:48:11PM +0000, Chris Mason wrote:
> On 30 Jan 2019, at 20:34, Dave Chinner wrote:
> 
> > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
> >>
> >>
> >> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
> >>
> >>> From: Dave Chinner <dchinner@redhat.com>
> >>>
> >>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
> >>>
> >>> This change causes serious changes to page cache and inode cache
> >>> behaviour and balance, resulting in major performance regressions
> >>> when combining worklaods such as large file copies and kernel
> >>> compiles.
> >>>
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202441
> >>
> >> I'm a little confused by the latest comment in the bz:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
> >
> > Which says the first patch that changed the shrinker behaviour is
> > the underlying cause of the regression.
> >
> >> Are these reverts sufficient?
> >
> > I think so.
> 
> Based on the latest comment:
> 
> "If I had been less strict in my testing I probably would have 
> discovered that the problem was present earlier than 4.19.3. Mr Gushins 
> commit made it more visible.
> I'm going back to work after two days off, so I might not be able to 
> respond inside your working hours, but I'll keep checking in on this as 
> I get a chance."
> 
> I don't think the reverts are sufficient.

Roger has tested the two reverts more heavily against 5.0.0-rc3.
Without the reverts, the machine locks up hard. With the two reverts
applied, it runs along smoothly under extremely heavy load.

https://bugzilla.kernel.org/show_bug.cgi?id=202441#c26

So, yes, these changes need to be reverted.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-31 22:19           ` Dave Chinner
@ 2019-02-04 21:47             ` Dave Chinner
  2019-02-07 10:27             ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2019-02-04 21:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, Chris Mason, linux-mm, linux-kernel, linux-fsdevel,
	linux-xfs, akpm, vdavydov.dev

On Fri, Feb 01, 2019 at 09:19:04AM +1100, Dave Chinner wrote:
> > So, assuming all this, can we, please, first check if Rik's patch is addressing
> > the regression?
> 
> Nope, it's broken and buggy, and reintroduces problems with racing
> deferred counts that were fixed years ago when I originally
> wrote the numa aware shrinker infrastructure.

So, the first thing to do here is fix the non-deterministic deferred
reclaim behaviour of the shrinker. Any deferred reclaim because of
things like GFP_NOFS contexts get landed on the the very next
shrinker instance that runs, be it kswapd, direct reclaim, or
whether it can even perform reclaim or not (e.g. it might be a
GFP_NOFS context itself).

As a result, there is no predicting when a shrinker instance might
get landed with a huge amount of work that isn't it's own. We can't
even guarantee that kswapd reclaim context sees this deferred work,
because if there is another reclaimer on that node running at the
same time, it will have scooped away the deferred work and kswapd
will only do a small amount of work.

How small? Well a node with 1.8m freeable items, reclaim priority
12 (i.e. lowest priority), and seeks = 2 will result in a scan count
of:

	delta = 1,800,000 >> 12 = 440
	delta *= 4 = 1600
	delta /= 2 = 800.

the shrinker will only scan 800 objects when there is light memory
pressure. That's only 0.04% of the cache, which is insignificant.
That's fine for direct reclaim, but for kswapd we need it to do more
work when there is deferred work.

So, say we have another 1m objects of deferred work (very common
on filesystem (and therefore GFP_NOFS) heavy workloads), we'll do:

	total_scan = deferred_objects;
	....
	total_scan += delta;

(ignoring clamping)

which means that we'll actually try to scan 1,000,800 objects from
the cache on the next pass. This may be clamped down to
(freeable_objects / 2), but that's still 900,000 objects in this
case.

IOWs, we just increased the reclaim work of this shrinker instance
by a factor of 1000x.  That's where all the long tail shrinker
reclaim latencies are coming from, and where a large amount of the
"XFS is doing inode IO from the shrinker" come from as they drive it
straight through all the clean inodes and into dirty reclaimable
inodes. With the additional "small cache" pressure being added and
then deferred since 4.18-rc5, this is much more likely to happen
with direct reclaim because the deferred count from GFP_NOFS
allocations are wound up faster.

So, if we want to prevent direct reclaim from this obvious long tail
latency problem, we have to stop direct reclaim from ever doing
deferred work. i.e. we need to move deferred work to kswapd and, for
XFS, we then have to ensure that kswapd will not block on dirty
inodes so it can do all this work as quickly as possible. And then
for kswapd, we need to limit the amount of deferred work so that it
doesn't spend all it's time emptying a single cache at low
priorities, but will attempt to perform all the deferred work if
reclaim priority winds up far enough.

This gives us a solid, predictable "deferred work" infrastructure
for the shrinkers. It gets rid of the nasty behaviour, and paves the
way for adding different sorts of deferred work (like Rik's "small
cache" pressure increase) to kswapd rather than in random reclaim
contexts. It also allows us to use a different "how much work should
we do" calculation for kswapd. i.e. one that is appropriate for
background, non-blocking scanning rather than being tailored to
limiting the work that any one direct reclaim context must do.

So, if I just set the XFS inode cache shrinker to skip inode
writeback (as everyone seems to want to do), fsmark is OOM-killed at
27M inodes, right when the log runs out of space, the tail-pushing
thread goes near to being CPU bound, and inode writeback from
reclaim is necessary to retire the dirty inode from the log before
it can be reclaimed. It's easily reproducable, and sometimes the
oom-killer chooses daemons rather than fsmark and it has killed the
system on occasion. Reverting the two patches in this thread makes
the OOM kill problem go away - it just turns it back into a
performance issue.

So, on top of the reverts, the patch below that reworks the deferred
shrinker work to kswapd is the first patch I've been able to get a
"XFs inode shrinker doesn't block kswapd" patch through my
benchmarks and memory stress workloads without triggering OOM-kills
randomly or seeing substantial performance regressions. Indeed, it
appears to behave better that the existing code (fsmark inode create
is marginally faster, simoops long tails have completely gone(*)).

This indicates to me that we really should be considering fixing the
deferred work problems before adding new types of deferred work into
the shrinker infrastructure (for whatever reason). Get the
infrastructure, reliable, predictable and somewhat deterministic,
then we can start trialling pressure/balance changes knowing exactly
where we are directing that extra work....

Cheers,

Dave.

(*) Chris, FYI, the last output before symoops died because "too
many open files" - p99 latency is nearly identical to p50 latency:

Run time: 10873 seconds
Read latency (p50: 3,084,288) (p95: 3,158,016) (p99: 3,256,320)
Write latency (p50: 7,479,296) (p95: 8,101,888) (p99: 8,437,760)
Allocation latency (p50: 479,744) (p95: 744,448) (p99: 1,016,832)
work rate = 8.63/sec (avg 9.05/sec) (p50: 9.19) (p95: 9.91) (p99: 11.42)
alloc stall rate = 0.02/sec (avg: 0.01) (p50: 0.01) (p95: 0.01) (p99: 0.01)

This is the same machine that I originally ran simoops on back in
~4.9 when you first proposed async kswapd behaviour for XFS. Typical
long tail latencies back then were:

[https://www.spinics.net/lists/linux-xfs/msg02235.html]

Run time: 1140 seconds
Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424)
Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376)
Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960)
work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12)

-- 
Dave Chinner
david@fromorbit.com

mm: shrinker deferral cleanup

From: Dave Chinner <dchinner@redhat.com>

Shrinker defers to random GFP_KERNEL reclaim context, which means so
poor direct reclaimer coul dbe loaded with huge amounts of work just
because it's the first reclaimer in a while.

This can be seen from the shrinker trace point output, where a
random reclaim contexts take all the deferred scan count and try to
run it, then put it all back in the global pool when they are done.
Racing shrinkers see none of that deferred work, to the point where
kswapd may never see any load on the shrinker at all because it's
always being held by a direct reclaimer.

SO, first things first: only do deferred work in kswapd context. We
know that this is GFP_KERNEL context, so work deferred from GFP_NOFS
contexts will always be able to run from kswapd.

This also gets rid of the need to specifically avoid windup because
we only have one thread that will process the deferred work, and it
will be capped in what it can do in a single by the reclaim priority
it operates under. i.e. reclaim priority prevents deferred work from
being done all at once under light memory pressure. If we have realy
heavy pressure, then we're aiming to kill as much cache as we can,
so at that point windup no longer matters.

Next, factor of the calculation of the amount of work to do from the
rest of the code. This makes it easier to see what is work
calculation and what are constraints, clamping and behavioural
restrictions. Rename the variables to be more meaningful, too,
and convert everything to uint64_t because all the hoops we jump
through to keep things in 32 bits for 32 bit systems makes this all
just a mess.

Next, allow the shrinker "freeable object count" callout tell the
shrinker it won't be able to do any work. e.g. GFP_NOFS context on a
filesystem shrinker. THis means it can simply calculate the work to
defer to kswapd and move on. Fast, and doesn't require calling into
the scan code to find out that we can't actually do any work.

Next, cleanup the tracing to be less ... obtuse. We care about the
work being done, the amount of work that was done, and how much we
still have deferred to do. The rest of it is mostly useless.

Finally, remove the blocking SYNC_WAIT from kswapd context in the
XFS inode shrinker. Still block direct reclaim, but allow kswapd to
scan primarily for clean, immediately reclaimable inodes without
regard to any other reclaim that is on-going. This means kswapd
won't get stuck behind blocked direct reclaim, nor will it issue IO
unless there

Further experiments:
- kick kswapd when deferred gets too big
- store deferred priority rather than a count? Windup always ends up
  with more deferred work than there is freeable items, so a
  do_div(freeable, deferred_priority) setup might make sense.
- get kswapd reclaim priority priority wound up if shrinker
  is not making enough progress on deferred work.
- factor out deferral code

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/super.c                    |   8 +++
 fs/xfs/xfs_icache.c           |   7 +-
 include/linux/shrinker.h      |   2 +
 include/trace/events/vmscan.h |  69 +++++++++----------
 mm/vmscan.c                   | 156 +++++++++++++++++++++++++-----------------
 5 files changed, 141 insertions(+), 101 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 48e25eba8465..59bfb285a856 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -139,6 +139,14 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 	smp_rmb();
 
+	/*
+	 * If we know we can't reclaim, let the shrinker know so it can account
+	 * for deferred reclaim that kswapd must do, but doesn't have to call
+	 * super_cache_count to find this out.
+	 */
+	if (!(sc->gfp_mask & __GFP_FS))
+		sc->will_defer = true;
+
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 245483cc282b..60723ae79ec2 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1373,11 +1373,16 @@ xfs_reclaim_inodes_nr(
 	struct xfs_mount	*mp,
 	int			nr_to_scan)
 {
+	int			flags = SYNC_TRYLOCK;
+
 	/* kick background reclaimer and push the AIL */
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	if (!current_is_kswapd())
+		flags |= SYNC_WAIT;
+
+	return xfs_reclaim_inodes_ag(mp, flags, &nr_to_scan);
 }
 
 /*
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..a4216dcdd59e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -31,6 +31,8 @@ struct shrink_control {
 
 	/* current memcg being shrunk (for memcg aware shrinkers) */
 	struct mem_cgroup *memcg;
+
+	bool will_defer;
 };
 
 #define SHRINK_STOP (~0UL)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a1cb91342231..a4f34cde779a 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -195,84 +195,81 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 
 TRACE_EVENT(mm_shrink_slab_start,
 	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
-		long nr_objects_to_shrink, unsigned long cache_items,
-		unsigned long long delta, unsigned long total_scan,
-		int priority),
+		int64_t deferred_count, int64_t freeable_objects,
+		int64_t scan_count, int priority),
 
-	TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan,
+	TP_ARGS(shr, sc, deferred_count, freeable_objects, scan_count,
 		priority),
 
 	TP_STRUCT__entry(
 		__field(struct shrinker *, shr)
 		__field(void *, shrink)
 		__field(int, nid)
-		__field(long, nr_objects_to_shrink)
-		__field(gfp_t, gfp_flags)
-		__field(unsigned long, cache_items)
-		__field(unsigned long long, delta)
-		__field(unsigned long, total_scan)
+		__field(int64_t, deferred_count)
+		__field(int64_t, freeable_objects)
+		__field(int64_t, scan_count)
 		__field(int, priority)
+		__field(gfp_t, gfp_flags)
 	),
 
 	TP_fast_assign(
 		__entry->shr = shr;
 		__entry->shrink = shr->scan_objects;
 		__entry->nid = sc->nid;
-		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
-		__entry->gfp_flags = sc->gfp_mask;
-		__entry->cache_items = cache_items;
-		__entry->delta = delta;
-		__entry->total_scan = total_scan;
+		__entry->deferred_count = deferred_count;
+		__entry->freeable_objects = freeable_objects;
+		__entry->scan_count = scan_count;
 		__entry->priority = priority;
+		__entry->gfp_flags = sc->gfp_mask;
 	),
 
-	TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+	TP_printk("%pF %p: nid: %d objects to scan %lld freeable items %lld deferred count %lld priority %d gfp_flags %s",
 		__entry->shrink,
 		__entry->shr,
 		__entry->nid,
-		__entry->nr_objects_to_shrink,
-		show_gfp_flags(__entry->gfp_flags),
-		__entry->cache_items,
-		__entry->delta,
-		__entry->total_scan,
-		__entry->priority)
+		__entry->scan_count,
+		__entry->freeable_objects,
+		__entry->deferred_count,
+		__entry->priority,
+		show_gfp_flags(__entry->gfp_flags))
 );
 
 TRACE_EVENT(mm_shrink_slab_end,
-	TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval,
-		long unused_scan_cnt, long new_scan_cnt, long total_scan),
+	TP_PROTO(struct shrinker *shr, int nid, int64_t freed_objects,
+		int64_t unused_scan_cnt, int64_t new_deferred_count,
+		int64_t old_deferred_count),
 
-	TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt,
-		total_scan),
+	TP_ARGS(shr, nid, freed_objects, unused_scan_cnt, new_deferred_count,
+		old_deferred_count),
 
 	TP_STRUCT__entry(
 		__field(struct shrinker *, shr)
 		__field(int, nid)
 		__field(void *, shrink)
-		__field(long, unused_scan)
-		__field(long, new_scan)
-		__field(int, retval)
-		__field(long, total_scan)
+		__field(long long, unused_scan)
+		__field(long long, new_deferred_count)
+		__field(long long, freed_objects)
+		__field(long long, old_deferred_count)
 	),
 
 	TP_fast_assign(
 		__entry->shr = shr;
 		__entry->nid = nid;
 		__entry->shrink = shr->scan_objects;
+		__entry->freed_objects = freed_objects;
 		__entry->unused_scan = unused_scan_cnt;
-		__entry->new_scan = new_scan_cnt;
-		__entry->retval = shrinker_retval;
-		__entry->total_scan = total_scan;
+		__entry->new_deferred_count = new_deferred_count;
+		__entry->old_deferred_count = old_deferred_count;
 	),
 
-	TP_printk("%pF %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+	TP_printk("%pF %p: nid: %d freed objects %lld unused scan count %lld new deferred count %lld old deferred count %lld",
 		__entry->shrink,
 		__entry->shr,
 		__entry->nid,
+		__entry->freed_objects,
 		__entry->unused_scan,
-		__entry->new_scan,
-		__entry->total_scan,
-		__entry->retval)
+		__entry->new_deferred_count,
+		__entry->old_deferred_count)
 );
 
 TRACE_EVENT(mm_vmscan_lru_isolate,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e979705bbf32..7db6d8242613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -447,37 +447,21 @@ void unregister_shrinker(struct shrinker *shrinker)
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
-#define SHRINK_BATCH 128
-
-static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
-				    struct shrinker *shrinker, int priority)
+/*
+ * calculate the number of new objects to scan this this time around
+ */
+static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
+			    struct shrinker *shrinker, int priority,
+			    uint64_t *freeable_objects)
 {
-	unsigned long freed = 0;
-	unsigned long long delta;
-	long total_scan;
-	long freeable;
-	long nr;
-	long new_nr;
 	int nid = shrinkctl->nid;
-	long batch_size = shrinker->batch ? shrinker->batch
-					  : SHRINK_BATCH;
-	long scanned = 0, next_deferred;
-
-	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-		nid = 0;
+	uint64_t delta;
+	uint64_t freeable;
 
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
 	if (freeable == 0 || freeable == SHRINK_EMPTY)
 		return freeable;
 
-	/*
-	 * copy the current shrinker scan count into a local variable
-	 * and zero it so that other concurrent shrinker invocations
-	 * don't also do this scanning work.
-	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
-
-	total_scan = nr;
 	if (shrinker->seeks) {
 		delta = freeable >> priority;
 		delta *= 4;
@@ -491,40 +475,81 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		delta = freeable / 2;
 	}
 
-	total_scan += delta;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = freeable;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
+	*freeable_objects = freeable;
+	return delta > 0 ? delta : 0;
+}
 
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
-	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
-	 */
-	if (delta < freeable / 4)
-		total_scan = min(total_scan, freeable / 2);
+#define SHRINK_BATCH 128
+
+static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
+				    struct shrinker *shrinker, int priority)
+{
+	int nid = shrinkctl->nid;
+	int batch_size = shrinker->batch ? shrinker->batch
+					  : SHRINK_BATCH;
+	int64_t scan_count;
+	int64_t freeable_objects = 0;
+	int64_t scanned_objects = 0;
+	int64_t next_deferred = 0;
+	int64_t new_dcount;
+	int64_t freed = 0;
+	int64_t deferred_count = 0;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	shrinkctl->will_defer = false;
+	scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
+					&freeable_objects);
+	if (scan_count == 0 || scan_count == SHRINK_EMPTY)
+		return scan_count;
+
+/*
+ * If kswapd, we take all the deferred work and do it here. We don't let direct
+ * reclaim do this, because then it means some poor sod is going to have to do
+ * somebody else's GFP_NOFS reclaim, and it hides the real amount of reclaim
+ * work from concurrent kswapd operations. hence we do the work in the wrong
+ * place, at the wrong time, and it's largely unpredictable.
+ *
+ * By doing the deferred work only in kswapd, we can schedule the work according
+ * the the reclaim priority - low priority reclaim will do less deferred work,
+ * hence we'll do more of the deferred work the more desperate we become for
+ * free memory. This avoids the need for needing to specifically avoid deferred
+ * work windup as low amount os memory pressure won't excessive trim caches
+ * anymore.
+ */
+	if (current_is_kswapd()) {
+		int64_t	deferred_scan;
+
+		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
+
+		/* we want to scan 5-10% of the deferred work here at minimum */
+		deferred_scan = deferred_count;
+		if (priority)
+			do_div(deferred_scan, priority);
+
+		scan_count += deferred_scan;
+	}
 
 	/*
-	 * Avoid risking looping forever due to too large nr value:
+	 * Avoid risking looping forever due to too much deferred work:
 	 * never try to free more than twice the estimate number of
 	 * freeable entries.
 	 */
-	if (total_scan > freeable * 2)
-		total_scan = freeable * 2;
+	if (scan_count > freeable_objects * 2)
+		scan_count = freeable_objects * 2;
+
 
-	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
-				   freeable, delta, total_scan, priority);
+	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
+				   freeable_objects, scan_count, priority);
+
+	/*
+	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
+	 * defer the work to kswapd. kswapd runs under GFP_KERNEL, so should
+	 * never have shrinker defer wok in that context.
+	 */
+	if (shrinkctl->will_defer)
+		goto done;
 
 	/*
 	 * Normally, we should not scan less than batch_size objects in one
@@ -541,10 +566,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * scanning at high prio and therefore should try to reclaim as much as
 	 * possible.
 	 */
-	while (total_scan >= batch_size ||
-	       total_scan >= freeable) {
-		unsigned long ret;
-		unsigned long nr_to_scan = min(batch_size, total_scan);
+	while (scan_count >= batch_size ||
+	       scan_count >= freeable_objects) {
+		int64_t ret;
+		int64_t nr_to_scan = min_t(int64_t, batch_size, scan_count);
 
 		shrinkctl->nr_to_scan = nr_to_scan;
 		shrinkctl->nr_scanned = nr_to_scan;
@@ -554,28 +579,31 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		freed += ret;
 
 		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
-		total_scan -= shrinkctl->nr_scanned;
-		scanned += shrinkctl->nr_scanned;
+		scan_count -= shrinkctl->nr_scanned;
+		scanned_objects += shrinkctl->nr_scanned;
 
 		cond_resched();
 	}
 
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
-	else
-		next_deferred = 0;
+done:
+	if (deferred_count)
+		next_deferred = deferred_count - scanned_objects;
+	else if (scan_count > 0)
+		next_deferred = scan_count;
+
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
 	 * scan, there is no need to do an update.
 	 */
 	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
+		new_dcount = atomic64_add_return(next_deferred,
 						&shrinker->nr_deferred[nid]);
 	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+		new_dcount = atomic64_read(&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, nid, freed, scan_count, deferred_count,
+				new_dcount);
 	return freed;
 }
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-01-31 22:19           ` Dave Chinner
  2019-02-04 21:47             ` Dave Chinner
@ 2019-02-07 10:27             ` Jan Kara
  2019-02-08  5:37               ` Andrew Morton
  2019-02-11 15:34               ` Wolfgang Walter
  1 sibling, 2 replies; 20+ messages in thread
From: Jan Kara @ 2019-02-07 10:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Michal Hocko, Chris Mason, linux-mm,
	linux-kernel, linux-fsdevel, linux-xfs, akpm, vdavydov.dev

On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> Maybe for memcgs, but that's exactly the oppose of what we want to
> do for global caches (e.g. filesystem metadata caches). We need to
> make sure that a single, heavily pressured cache doesn't evict small
> caches that lower pressure but are equally important for
> performance.
> 
> e.g. I've noticed recently a significant increase in RMW cycles in
> XFS inode cache writeback during various benchmarks. It hasn't
> affected performance because the machine has IO and CPU to burn, but
> on slower machines and storage, it will have a major impact.

Just as a data point, our performance testing infrastructure has bisected
down to the commits discussed in this thread as the cause of about 40%
regression in XFS file delete performance in bonnie++ benchmark.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-02-07 10:27             ` Jan Kara
@ 2019-02-08  5:37               ` Andrew Morton
  2019-02-08  9:55                 ` Jan Kara
  2019-02-08 21:25                 ` Dave Chinner
  2019-02-11 15:34               ` Wolfgang Walter
  1 sibling, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2019-02-08  5:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Roman Gushchin, Michal Hocko, Chris Mason,
	linux-mm, linux-kernel, linux-fsdevel, linux-xfs, vdavydov.dev

On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:

> On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > Maybe for memcgs, but that's exactly the oppose of what we want to
> > do for global caches (e.g. filesystem metadata caches). We need to
> > make sure that a single, heavily pressured cache doesn't evict small
> > caches that lower pressure but are equally important for
> > performance.
> > 
> > e.g. I've noticed recently a significant increase in RMW cycles in
> > XFS inode cache writeback during various benchmarks. It hasn't
> > affected performance because the machine has IO and CPU to burn, but
> > on slower machines and storage, it will have a major impact.
> 
> Just as a data point, our performance testing infrastructure has bisected
> down to the commits discussed in this thread as the cause of about 40%
> regression in XFS file delete performance in bonnie++ benchmark.
> 

Has anyone done significant testing with Rik's maybe-fix?



From: Rik van Riel <riel@surriel.com>
Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs

There are a few issues with the way the number of slab objects to scan is
calculated in do_shrink_slab.  First, for zero-seek slabs, we could leave
the last object around forever.  That could result in pinning a dying
cgroup into memory, instead of reclaiming it.  The fix for that is
trivial.

Secondly, small slabs receive much more pressure, relative to their size,
than larger slabs, due to "rounding up" the minimum number of scanned
objects to batch_size.

We can keep the pressure on all slabs equal relative to their size by
accumulating the scan pressure on small slabs over time, resulting in
sometimes scanning an object, instead of always scanning several.

This results in lower system CPU use, and a lower major fault rate, as
actively used entries from smaller caches get reclaimed less aggressively,
and need to be reloaded/recreated less often.

[akpm@linux-foundation.org: whitespace fixes, per Roman]
[riel@surriel.com: couple of fixes]
  Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com
Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com
Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Chris Mason <clm@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jonathan Lemon <bsd@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---


--- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
+++ a/include/linux/shrinker.h
@@ -65,6 +65,7 @@ struct shrinker {
 
 	long batch;	/* reclaim batch size, 0 = default */
 	int seeks;	/* seeks to recreate an obj */
+	int small_scan;	/* accumulate pressure on slabs with few objects */
 	unsigned flags;
 
 	/* These are for internal use */
--- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
+++ a/mm/vmscan.c
@@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru
 		 * them aggressively under memory pressure to keep
 		 * them from causing refetches in the IO caches.
 		 */
-		delta = freeable / 2;
+		delta = (freeable + 1) / 2;
 	}
 
 	/*
 	 * Make sure we apply some minimal pressure on default priority
-	 * even on small cgroups. Stale objects are not only consuming memory
+	 * even on small cgroups, by accumulating pressure across multiple
+	 * slab shrinker runs. Stale objects are not only consuming memory
 	 * by themselves, but can also hold a reference to a dying cgroup,
 	 * preventing it from being reclaimed. A dying cgroup with all
 	 * corresponding structures like per-cpu stats and kmem caches
 	 * can be really big, so it may lead to a significant waste of memory.
 	 */
-	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
+	if (!delta && shrinker->seeks) {
+		unsigned long nr_considered;
+
+		shrinker->small_scan += freeable;
+		nr_considered = shrinker->small_scan >> priority;
+
+		delta = 4 * nr_considered;
+		do_div(delta, shrinker->seeks);
+
+		if (delta)
+			shrinker->small_scan -= nr_considered << priority;
+	}
 
 	total_scan += delta;
 	if (total_scan < 0) {
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-02-08  5:37               ` Andrew Morton
@ 2019-02-08  9:55                 ` Jan Kara
  2019-02-08 12:50                   ` Jan Kara
  2019-02-08 21:25                 ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kara @ 2019-02-08  9:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Roman Gushchin, Michal Hocko,
	Chris Mason, linux-mm, linux-kernel, linux-fsdevel, linux-xfs,
	vdavydov.dev

On Thu 07-02-19 21:37:27, Andrew Morton wrote:
> On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > > Maybe for memcgs, but that's exactly the oppose of what we want to
> > > do for global caches (e.g. filesystem metadata caches). We need to
> > > make sure that a single, heavily pressured cache doesn't evict small
> > > caches that lower pressure but are equally important for
> > > performance.
> > > 
> > > e.g. I've noticed recently a significant increase in RMW cycles in
> > > XFS inode cache writeback during various benchmarks. It hasn't
> > > affected performance because the machine has IO and CPU to burn, but
> > > on slower machines and storage, it will have a major impact.
> > 
> > Just as a data point, our performance testing infrastructure has bisected
> > down to the commits discussed in this thread as the cause of about 40%
> > regression in XFS file delete performance in bonnie++ benchmark.
> > 
> 
> Has anyone done significant testing with Rik's maybe-fix?

I will give it a spin with bonnie++ today. We'll see what comes out.

								Honza

> 
> 
> 
> From: Rik van Riel <riel@surriel.com>
> Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs
> 
> There are a few issues with the way the number of slab objects to scan is
> calculated in do_shrink_slab.  First, for zero-seek slabs, we could leave
> the last object around forever.  That could result in pinning a dying
> cgroup into memory, instead of reclaiming it.  The fix for that is
> trivial.
> 
> Secondly, small slabs receive much more pressure, relative to their size,
> than larger slabs, due to "rounding up" the minimum number of scanned
> objects to batch_size.
> 
> We can keep the pressure on all slabs equal relative to their size by
> accumulating the scan pressure on small slabs over time, resulting in
> sometimes scanning an object, instead of always scanning several.
> 
> This results in lower system CPU use, and a lower major fault rate, as
> actively used entries from smaller caches get reclaimed less aggressively,
> and need to be reloaded/recreated less often.
> 
> [akpm@linux-foundation.org: whitespace fixes, per Roman]
> [riel@surriel.com: couple of fixes]
>   Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com
> Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com
> Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
> Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Chris Mason <clm@fb.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Jonathan Lemon <bsd@fb.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>
> 
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
> 
> --- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> +++ a/include/linux/shrinker.h
> @@ -65,6 +65,7 @@ struct shrinker {
>  
>  	long batch;	/* reclaim batch size, 0 = default */
>  	int seeks;	/* seeks to recreate an obj */
> +	int small_scan;	/* accumulate pressure on slabs with few objects */
>  	unsigned flags;
>  
>  	/* These are for internal use */
> --- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> +++ a/mm/vmscan.c
> @@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru
>  		 * them aggressively under memory pressure to keep
>  		 * them from causing refetches in the IO caches.
>  		 */
> -		delta = freeable / 2;
> +		delta = (freeable + 1) / 2;
>  	}
>  
>  	/*
>  	 * Make sure we apply some minimal pressure on default priority
> -	 * even on small cgroups. Stale objects are not only consuming memory
> +	 * even on small cgroups, by accumulating pressure across multiple
> +	 * slab shrinker runs. Stale objects are not only consuming memory
>  	 * by themselves, but can also hold a reference to a dying cgroup,
>  	 * preventing it from being reclaimed. A dying cgroup with all
>  	 * corresponding structures like per-cpu stats and kmem caches
>  	 * can be really big, so it may lead to a significant waste of memory.
>  	 */
> -	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
> +	if (!delta && shrinker->seeks) {
> +		unsigned long nr_considered;
> +
> +		shrinker->small_scan += freeable;
> +		nr_considered = shrinker->small_scan >> priority;
> +
> +		delta = 4 * nr_considered;
> +		do_div(delta, shrinker->seeks);
> +
> +		if (delta)
> +			shrinker->small_scan -= nr_considered << priority;
> +	}
>  
>  	total_scan += delta;
>  	if (total_scan < 0) {
> _
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-02-08  9:55                 ` Jan Kara
@ 2019-02-08 12:50                   ` Jan Kara
  2019-02-08 22:49                     ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2019-02-08 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Roman Gushchin, Michal Hocko,
	Chris Mason, linux-mm, linux-kernel, linux-fsdevel, linux-xfs,
	vdavydov.dev

On Fri 08-02-19 10:55:07, Jan Kara wrote:
> On Thu 07-02-19 21:37:27, Andrew Morton wrote:
> > On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:
> > 
> > > On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > > > Maybe for memcgs, but that's exactly the oppose of what we want to
> > > > do for global caches (e.g. filesystem metadata caches). We need to
> > > > make sure that a single, heavily pressured cache doesn't evict small
> > > > caches that lower pressure but are equally important for
> > > > performance.
> > > > 
> > > > e.g. I've noticed recently a significant increase in RMW cycles in
> > > > XFS inode cache writeback during various benchmarks. It hasn't
> > > > affected performance because the machine has IO and CPU to burn, but
> > > > on slower machines and storage, it will have a major impact.
> > > 
> > > Just as a data point, our performance testing infrastructure has bisected
> > > down to the commits discussed in this thread as the cause of about 40%
> > > regression in XFS file delete performance in bonnie++ benchmark.
> > > 
> > 
> > Has anyone done significant testing with Rik's maybe-fix?
> 
> I will give it a spin with bonnie++ today. We'll see what comes out.

OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other
differences). This machine does not show so big differences in bonnie++
numbers but the difference is still clearly visible. The results are
(averages of 5 runs):

		 Revert			Base			Rik
SeqCreate del    78.04 (   0.00%)	98.18 ( -25.81%)	90.90 ( -16.48%)
RandCreate del   87.68 (   0.00%)	95.01 (  -8.36%)	87.66 (   0.03%)

'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages"
and "mm: slowly shrink slabs with a relatively small number of objects"
reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with
Rik's patch applied.

The numbers are time to do a batch of deletes so lower is better. You can see
that the patch did help somewhat but it was not enough to close the gap
when files are deleted in 'readdir' order.

								Honza

> > From: Rik van Riel <riel@surriel.com>
> > Subject: mm, slab, vmscan: accumulate gradual pressure on small slabs
> > 
> > There are a few issues with the way the number of slab objects to scan is
> > calculated in do_shrink_slab.  First, for zero-seek slabs, we could leave
> > the last object around forever.  That could result in pinning a dying
> > cgroup into memory, instead of reclaiming it.  The fix for that is
> > trivial.
> > 
> > Secondly, small slabs receive much more pressure, relative to their size,
> > than larger slabs, due to "rounding up" the minimum number of scanned
> > objects to batch_size.
> > 
> > We can keep the pressure on all slabs equal relative to their size by
> > accumulating the scan pressure on small slabs over time, resulting in
> > sometimes scanning an object, instead of always scanning several.
> > 
> > This results in lower system CPU use, and a lower major fault rate, as
> > actively used entries from smaller caches get reclaimed less aggressively,
> > and need to be reloaded/recreated less often.
> > 
> > [akpm@linux-foundation.org: whitespace fixes, per Roman]
> > [riel@surriel.com: couple of fixes]
> >   Link: http://lkml.kernel.org/r/20190129142831.6a373403@imladris.surriel.com
> > Link: http://lkml.kernel.org/r/20190128143535.7767c397@imladris.surriel.com
> > Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
> > Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > Tested-by: Chris Mason <clm@fb.com>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Cc: Jonathan Lemon <bsd@fb.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: <stable@vger.kernel.org>
> > 
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > 
> > 
> > --- a/include/linux/shrinker.h~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> > +++ a/include/linux/shrinker.h
> > @@ -65,6 +65,7 @@ struct shrinker {
> >  
> >  	long batch;	/* reclaim batch size, 0 = default */
> >  	int seeks;	/* seeks to recreate an obj */
> > +	int small_scan;	/* accumulate pressure on slabs with few objects */
> >  	unsigned flags;
> >  
> >  	/* These are for internal use */
> > --- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs
> > +++ a/mm/vmscan.c
> > @@ -488,18 +488,30 @@ static unsigned long do_shrink_slab(stru
> >  		 * them aggressively under memory pressure to keep
> >  		 * them from causing refetches in the IO caches.
> >  		 */
> > -		delta = freeable / 2;
> > +		delta = (freeable + 1) / 2;
> >  	}
> >  
> >  	/*
> >  	 * Make sure we apply some minimal pressure on default priority
> > -	 * even on small cgroups. Stale objects are not only consuming memory
> > +	 * even on small cgroups, by accumulating pressure across multiple
> > +	 * slab shrinker runs. Stale objects are not only consuming memory
> >  	 * by themselves, but can also hold a reference to a dying cgroup,
> >  	 * preventing it from being reclaimed. A dying cgroup with all
> >  	 * corresponding structures like per-cpu stats and kmem caches
> >  	 * can be really big, so it may lead to a significant waste of memory.
> >  	 */
> > -	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
> > +	if (!delta && shrinker->seeks) {
> > +		unsigned long nr_considered;
> > +
> > +		shrinker->small_scan += freeable;
> > +		nr_considered = shrinker->small_scan >> priority;
> > +
> > +		delta = 4 * nr_considered;
> > +		do_div(delta, shrinker->seeks);
> > +
> > +		if (delta)
> > +			shrinker->small_scan -= nr_considered << priority;
> > +	}
> >  
> >  	total_scan += delta;
> >  	if (total_scan < 0) {
> > _
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-02-08  5:37               ` Andrew Morton
  2019-02-08  9:55                 ` Jan Kara
@ 2019-02-08 21:25                 ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2019-02-08 21:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Roman Gushchin, Michal Hocko, Chris Mason, linux-mm,
	linux-kernel, linux-fsdevel, linux-xfs, vdavydov.dev

On Thu, Feb 07, 2019 at 09:37:27PM -0800, Andrew Morton wrote:
> On Thu, 7 Feb 2019 11:27:50 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > > Maybe for memcgs, but that's exactly the oppose of what we want to
> > > do for global caches (e.g. filesystem metadata caches). We need to
> > > make sure that a single, heavily pressured cache doesn't evict small
> > > caches that lower pressure but are equally important for
> > > performance.
> > > 
> > > e.g. I've noticed recently a significant increase in RMW cycles in
> > > XFS inode cache writeback during various benchmarks. It hasn't
> > > affected performance because the machine has IO and CPU to burn, but
> > > on slower machines and storage, it will have a major impact.
> > 
> > Just as a data point, our performance testing infrastructure has bisected
> > down to the commits discussed in this thread as the cause of about 40%
> > regression in XFS file delete performance in bonnie++ benchmark.
> > 
> 
> Has anyone done significant testing with Rik's maybe-fix?

Apart from pointing out all the bugs and incorrect algorithmic
assumptions it makes, no.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-02-08 12:50                   ` Jan Kara
@ 2019-02-08 22:49                     ` Andrew Morton
  2019-02-09  3:42                       ` Roman Gushchin
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2019-02-08 22:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Roman Gushchin, Michal Hocko, Chris Mason,
	linux-mm, linux-kernel, linux-fsdevel, linux-xfs, vdavydov.dev

On Fri, 8 Feb 2019 13:50:49 +0100 Jan Kara <jack@suse.cz> wrote:

> > > Has anyone done significant testing with Rik's maybe-fix?
> > 
> > I will give it a spin with bonnie++ today. We'll see what comes out.
> 
> OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other
> differences). This machine does not show so big differences in bonnie++
> numbers but the difference is still clearly visible. The results are
> (averages of 5 runs):
> 
> 		 Revert			Base			Rik
> SeqCreate del    78.04 (   0.00%)	98.18 ( -25.81%)	90.90 ( -16.48%)
> RandCreate del   87.68 (   0.00%)	95.01 (  -8.36%)	87.66 (   0.03%)
> 
> 'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages"
> and "mm: slowly shrink slabs with a relatively small number of objects"
> reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with
> Rik's patch applied.
> 
> The numbers are time to do a batch of deletes so lower is better. You can see
> that the patch did help somewhat but it was not enough to close the gap
> when files are deleted in 'readdir' order.

OK, thanks.

I guess we need a rethink on Roman's fixes.   I'll queued the reverts.


BTW, one thing I don't think has been discussed (or noticed) is the
effect of "mm: don't reclaim inodes with many attached pages" on 32-bit
highmem machines.  Look why someone added that code in the first place:

: commit f9a316fa9099053a299851762aedbf12881cff42
: Author: Andrew Morton <akpm@digeo.com>
: Date:   Thu Oct 31 04:09:37 2002 -0800
: 
:     [PATCH] strip pagecache from to-be-reaped inodes
:     
:     With large highmem machines and many small cached files it is possible
:     to encounter ZONE_NORMAL allocation failures.  This can be demonstrated
:     with a large number of one-byte files on a 7G machine.
:     
:     All lowmem is filled with icache and all those inodes have a small
:     amount of highmem pagecache which makes them unfreeable.
:     
:     The patch strips the pagecache from inodes as they come off the tail of
:     the inode_unused list.
:     
:     I play tricks in there peeking at the head of the inode_unused list to
:     pick up the inode again after running iput().  The alternatives seemed
:     to involve more widespread changes.
:     
:     Or running invalidate_inode_pages() under inode_lock which would be a
:     bad thing from a scheduling latency and lock contention point of view.

I guess I shold have added a comment.  Doh.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-02-08 22:49                     ` Andrew Morton
@ 2019-02-09  3:42                       ` Roman Gushchin
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-02-09  3:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Michal Hocko, Chris Mason, linux-mm,
	linux-kernel, linux-fsdevel, linux-xfs, vdavydov.dev

On Fri, Feb 08, 2019 at 02:49:44PM -0800, Andrew Morton wrote:
> On Fri, 8 Feb 2019 13:50:49 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > > > Has anyone done significant testing with Rik's maybe-fix?
> > > 
> > > I will give it a spin with bonnie++ today. We'll see what comes out.
> > 
> > OK, I did a bonnie++ run with Rik's patch (on top of 4.20 to rule out other
> > differences). This machine does not show so big differences in bonnie++
> > numbers but the difference is still clearly visible. The results are
> > (averages of 5 runs):
> > 
> > 		 Revert			Base			Rik
> > SeqCreate del    78.04 (   0.00%)	98.18 ( -25.81%)	90.90 ( -16.48%)
> > RandCreate del   87.68 (   0.00%)	95.01 (  -8.36%)	87.66 (   0.03%)
> > 
> > 'Revert' is 4.20 with "mm: don't reclaim inodes with many attached pages"
> > and "mm: slowly shrink slabs with a relatively small number of objects"
> > reverted. 'Base' is the kernel without any reverts. 'Rik' is a 4.20 with
> > Rik's patch applied.
> > 
> > The numbers are time to do a batch of deletes so lower is better. You can see
> > that the patch did help somewhat but it was not enough to close the gap
> > when files are deleted in 'readdir' order.
> 
> OK, thanks.
> 
> I guess we need a rethink on Roman's fixes.   I'll queued the reverts.

Agree.

I still believe that we should cause the machine-wide memory pressure
to clean up any remains of dead cgroups, and Rik's patch is a step into
the right direction. But we need to make some experiments and probably
some code changes here to guarantee that we don't regress on performance.

> 
> 
> BTW, one thing I don't think has been discussed (or noticed) is the
> effect of "mm: don't reclaim inodes with many attached pages" on 32-bit
> highmem machines.  Look why someone added that code in the first place:
> 
> : commit f9a316fa9099053a299851762aedbf12881cff42
> : Author: Andrew Morton <akpm@digeo.com>
> : Date:   Thu Oct 31 04:09:37 2002 -0800
> : 
> :     [PATCH] strip pagecache from to-be-reaped inodes
> :     
> :     With large highmem machines and many small cached files it is possible
> :     to encounter ZONE_NORMAL allocation failures.  This can be demonstrated
> :     with a large number of one-byte files on a 7G machine.
> :     
> :     All lowmem is filled with icache and all those inodes have a small
> :     amount of highmem pagecache which makes them unfreeable.
> :     
> :     The patch strips the pagecache from inodes as they come off the tail of
> :     the inode_unused list.
> :     
> :     I play tricks in there peeking at the head of the inode_unused list to
> :     pick up the inode again after running iput().  The alternatives seemed
> :     to involve more widespread changes.
> :     
> :     Or running invalidate_inode_pages() under inode_lock which would be a
> :     bad thing from a scheduling latency and lock contention point of view.
> 
> I guess I shold have added a comment.  Doh.
> 

It's a very useful link.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
  2019-02-07 10:27             ` Jan Kara
  2019-02-08  5:37               ` Andrew Morton
@ 2019-02-11 15:34               ` Wolfgang Walter
  1 sibling, 0 replies; 20+ messages in thread
From: Wolfgang Walter @ 2019-02-11 15:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Roman Gushchin, Michal Hocko, Chris Mason,
	linux-mm, linux-kernel, linux-fsdevel, linux-xfs, akpm,
	vdavydov.dev

Am Donnerstag, 7. Februar 2019, 11:27:50 schrieb Jan Kara:
> On Fri 01-02-19 09:19:04, Dave Chinner wrote:
> > Maybe for memcgs, but that's exactly the oppose of what we want to
> > do for global caches (e.g. filesystem metadata caches). We need to
> > make sure that a single, heavily pressured cache doesn't evict small
> > caches that lower pressure but are equally important for
> > performance.
> > 
> > e.g. I've noticed recently a significant increase in RMW cycles in
> > XFS inode cache writeback during various benchmarks. It hasn't
> > affected performance because the machine has IO and CPU to burn, but
> > on slower machines and storage, it will have a major impact.
> 
> Just as a data point, our performance testing infrastructure has bisected
> down to the commits discussed in this thread as the cause of about 40%
> regression in XFS file delete performance in bonnie++ benchmark.

We also bisected our big IO-performance problem of an imap-server (starting 
with 4.19.3) down to

	mm: don't reclaim inodes with many attached pages
	commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0 upstream.

On other servers the filesystems sometimes seems to hang for 10 seconds and 
more.

We also see a performance regression compared to 4.14 even with this patch 
reverted, but much less dramatic.

Now I saw this thread and I'll try to revert

172b06c32b949759fe6313abec514bc4f15014f4

and see if this helps.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-02-11 15:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-30  4:17 [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive Dave Chinner
2019-01-30  4:17 ` [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages" Dave Chinner
2019-01-30 12:21   ` Chris Mason
2019-01-31  1:34     ` Dave Chinner
2019-01-31  9:10       ` Michal Hocko
2019-01-31 18:57         ` Roman Gushchin
2019-01-31 22:19           ` Dave Chinner
2019-02-04 21:47             ` Dave Chinner
2019-02-07 10:27             ` Jan Kara
2019-02-08  5:37               ` Andrew Morton
2019-02-08  9:55                 ` Jan Kara
2019-02-08 12:50                   ` Jan Kara
2019-02-08 22:49                     ` Andrew Morton
2019-02-09  3:42                       ` Roman Gushchin
2019-02-08 21:25                 ` Dave Chinner
2019-02-11 15:34               ` Wolfgang Walter
2019-01-31 15:48       ` Chris Mason
2019-02-01 23:39         ` Dave Chinner
2019-01-30  4:17 ` [PATCH 2/2] Revert "mm: slowly shrink slabs with a relatively small number of objects" Dave Chinner
2019-01-30  5:48 ` [PATCH 0/2] [REGRESSION v4.19-20] mm: shrinkers are now way too aggressive Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).