From: Michal Hocko <mhocko@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-mm@kvack.org, linux-xfs@vger.kernel.org,
Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH] [Regression, v5.0] mm: boosted kswapd reclaim b0rks system cache balance
Date: Wed, 7 Aug 2019 11:30:56 +0200 [thread overview]
Message-ID: <20190807093056.GS11812@dhcp22.suse.cz> (raw)
In-Reply-To: <20190807091858.2857-1-david@fromorbit.com>
[Cc Mel and Vlastimil as it seems like fallout from 1c30844d2dfe2]
On Wed 07-08-19 19:18:58, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When running a simple, steady state 4kB file creation test to
> simulate extracting tarballs larger than memory full of small files
> into the filesystem, I noticed that once memory fills up the cache
> balance goes to hell.
>
> The workload is creating one dirty cached inode for every dirty
> page, both of which should require a single IO each to clean and
> reclaim, and creation of inodes is throttled by the rate at which
> dirty writeback runs at (via balance dirty pages). Hence the ingest
> rate of new cached inodes and page cache pages is identical and
> steady. As a result, memory reclaim should quickly find a steady
> balance between page cache and inode caches.
>
> It doesn't.
>
> The moment memory fills, the page cache is reclaimed at a much
> faster rate than the inode cache, and evidence suggests taht the
> inode cache shrinker is not being called when large batches of pages
> are being reclaimed. In roughly the same time period that it takes
> to fill memory with 50% pages and 50% slab caches, memory reclaim
> reduces the page cache down to just dirty pages and slab caches fill
> the entirity of memory.
>
> At the point where the page cache is reduced to just the dirty
> pages, there is a clear change in write IO patterns. Up to this
> point it has been running at a steady 1500 write IOPS for ~200MB/s
> of write throughtput (data, journal and metadata). Once the page
> cache is trimmed to just dirty pages, the write IOPS immediately
> start spiking to 5-10,000 IOPS and there is a noticable change in
> IO sizes and completion times. The SSD is fast enough to soak these
> up, so the measured performance is only slightly affected (numbers
> below). It results in > ~50% throughput slowdowns on a spinning
> disk with a NVRAM RAID cache in front of it, though. I didn't
> capture the numbers at the time, and it takes far to long for me to
> care to run it again and get them.
>
> SSD perf degradation as the LRU empties to just dirty pages:
>
> FSUse% Count Size Files/sec App Overhead
> ......
> 0 4320000 4096 51349.6 1370049
> 0 4480000 4096 48492.9 1362823
> 0 4640000 4096 48473.8 1435623
> 0 4800000 4096 46846.6 1370959
> 0 4960000 4096 47086.6 1567401
> 0 5120000 4096 46288.8 1368056
> 0 5280000 4096 46003.2 1391604
> 0 5440000 4096 46033.4 1458440
> 0 5600000 4096 45035.1 1484132
> 0 5760000 4096 43757.6 1492594
> 0 5920000 4096 40739.4 1552104
> 0 6080000 4096 37309.4 1627355
> 0 6240000 4096 42003.3 1517357
> .....
> real 3m28.093s
> user 0m57.852s
> sys 14m28.193s
>
> Average rate: 51724.6+/-2.4e+04 files/sec.
>
> At first memory full point:
>
> MemFree: 432676 kB
> Active(file): 89820 kB
> Inactive(file): 7330892 kB
> Dirty: 1603576 kB
> Writeback: 2908 kB
> Slab: 6579384 kB
> SReclaimable: 3727612 kB
> SUnreclaim: 2851772 kB
>
> A few seconds later at about half the page cache reclaimed:
>
> MemFree: 1880588 kB
> Active(file): 89948 kB
> Inactive(file): 3021796 kB
> Dirty: 1097072 kB
> Writeback: 2600 kB
> Slab: 8900912 kB
> SReclaimable: 5060104 kB
> SUnreclaim: 3840808 kB
>
> And at about the 6080000 count point in the results above, right to
> the end of the test:
>
> MemFree: 574900 kB
> Active(file): 89856 kB
> Inactive(file): 483120 kB
> Dirty: 372436 kB
> Writeback: 324 kB
> KReclaimable: 6506496 kB
> Slab: 11898956 kB
> SReclaimable: 6506496 kB
> SUnreclaim: 5392460 kB
>
>
> So the LRU is largely full of dirty pages, and we're getting spikes
> of random writeback from memory reclaim so it's all going to shit.
> Behaviour never recovers, the page cache remains pinned at just dirty
> pages, and nothing I could tune would make any difference.
> vfs_cache_pressure makes no difference - I would it up so high it
> should trim the entire inode caches in a singel pass, yet it didn't
> do anything. It was clear from tracing and live telemetry that the
> shrinkers were pretty much not running except when there was
> absolutely no memory free at all, and then they did the minimum
> necessary to free memory to make progress.
>
> So I went looking at the code, trying to find places where pages got
> reclaimed and the shrinkers weren't called. There's only one -
> kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm: reclaim
> small amounts of memory when an external fragmentation event
> occurs"). I'm not even using THP or allocating huge pages, so this
> code should not be active or having any effect onmemory reclaim at
> all, yet the majority of reclaim is being done with "boost" and so
> it's not reclaiming slab caches at all. It will only free clean
> pages from the LRU.
>
> And so when we do run out memory, it switches to normal reclaim,
> which hits dirty pages on the LRU and does some shrinker work, too,
> but then appears to switch back to boosted reclaim one watermarks
> are reached.
>
> The patch below restores page cache vs inode cache balance for this
> steady state workload. It balances out at about 40% page cache, 60%
> slab cache, and sustained performance is 10-15% higher than without
> this patch because the IO patterns remain in control of dirty
> writeback and the filesystem, not kswapd.
>
> Performance with boosted reclaim also running shrinkers over the
> same steady state portion of the test as above.
>
> FSUse% Count Size Files/sec App Overhead
> ......
> 0 4320000 4096 51341.9 1409983
> 0 4480000 4096 51157.5 1486421
> 0 4640000 4096 52041.5 1421837
> 0 4800000 4096 52124.2 1442169
> 0 4960000 4096 56266.6 1388865
> 0 5120000 4096 52892.2 1357650
> 0 5280000 4096 51879.5 1326590
> 0 5440000 4096 52178.7 1362889
> 0 5600000 4096 53863.0 1345956
> 0 5760000 4096 52538.7 1321422
> 0 5920000 4096 53144.5 1336966
> 0 6080000 4096 53647.7 1372146
> 0 6240000 4096 52434.7 1362534
>
> .....
> real 3m11.543s
> user 0m57.506s
> sys 14m20.913s
>
> Average rate: 57634.2+/-2.8e+04 files/sec.
>
> So it completed ~10% faster (both wall time and create rate) and had
> far better IO patterns so the differences would be even more
> pronounced on slow storage.
>
> This patch is not a fix, just a demonstration of the fact that the
> heuristics this "boosted reclaim for compaction" are based on are
> flawed, will have nasty side effects for users that don't use THP
> and so needs revisiting.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> mm/vmscan.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9034570febd9..702e6523f8ad 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2748,10 +2748,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> node_lru_pages += lru_pages;
>
> - if (sc->may_shrinkslab) {
> + //if (sc->may_shrinkslab) {
> shrink_slab(sc->gfp_mask, pgdat->node_id,
> memcg, sc->priority);
> - }
> + //}
>
> /* Record the group's reclaim efficiency */
> vmpressure(sc->gfp_mask, memcg, false,
> --
> 2.23.0.rc1
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2019-08-07 9:31 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-07 9:18 [PATCH] [Regression, v5.0] mm: boosted kswapd reclaim b0rks system cache balance Dave Chinner
2019-08-07 9:30 ` Michal Hocko [this message]
2019-08-07 15:03 ` Mel Gorman
2019-08-07 20:56 ` Mel Gorman
2019-08-07 22:32 ` Dave Chinner
2019-08-07 23:48 ` Mel Gorman
2019-08-08 0:26 ` Dave Chinner
2019-08-08 15:36 ` Christoph Hellwig
2019-08-08 17:04 ` Mel Gorman
2019-08-07 22:08 ` Dave Chinner
2019-08-07 22:33 ` Dave Chinner
2019-08-07 23:55 ` Mel Gorman
2019-08-08 0:30 ` Dave Chinner
2019-08-08 5:51 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190807093056.GS11812@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).