linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: Roman Gushchin <guro@fb.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>,
	Shakeel Butt <shakeelb@google.com>,
	Dave Chinner <david@fromorbit.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>,
	Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware
Date: Thu, 3 Dec 2020 09:52:00 -0800	[thread overview]
Message-ID: <CAHbLzkpUYsWLrA10ewyaeb2NiH56ZUTK2oBmb0X-_Om0M4B75w@mail.gmail.com> (raw)
In-Reply-To: <20201203025234.GD1375014@carbon.DHCP.thefacebook.com>

On Wed, Dec 2, 2020 at 6:52 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Dec 02, 2020 at 10:27:16AM -0800, Yang Shi wrote:
> >
> > Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> > it turned out there were huge amount accumulated nr_deferred objects seen by the
> > shrinker.
> >
> > On our production machine, I saw absurd number of nr_deferred shown as the below
> > tracing result:
> >
> > <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> > super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> > 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> > 9300 cache items 1667 delta 11 total_scan 833
> >
> > There are 2.5 trillion deferred objects on one node, assuming all of them
> > are dentry (192 bytes per object), so the total size of deferred on
> > one node is ~480TB. It is definitely ridiculous.
> >
> > I managed to reproduce this problem with kernel build workload plus negative dentry
> > generator.
> >
> > First step, run the below kernel build test script:
> >
> > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> >
> > cd /root/Buildarea/linux-stable
> >
> > for i in `seq 1500`; do
> >         cgcreate -g memory:kern_build
> >         echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
> >
> >         echo 3 > /proc/sys/vm/drop_caches
> >         cgexec -g memory:kern_build make clean > /dev/null 2>&1
> >         cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
> >
> >         cgdelete -g memory:kern_build
> > done
> >
> > Then run the below negative dentry generator script:
> >
> > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> >
> > mkdir /sys/fs/cgroup/memory/test
> > echo $$ > /sys/fs/cgroup/memory/test/tasks
> >
> > for i in `seq $NR_CPUS`; do
> >         while true; do
> >                 FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
> >                 cat $FILE 2>/dev/null
> >         done &
> > done
> >
> > Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> > showed:
> >
> >       kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> > objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
> >       kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> > scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
> >
> > There were huge number of deferred objects before the shrinker was called, the behavior
> > does match the code but it might be not desirable from the user's stand of point.
> >
> > The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
> >     * GFP_NOFS allocation
> >     * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)
> >
> > However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
> > is per shrinker, this may have some bad effects:
> >     * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
> >       reclaim may get nr_deferred accumulated to a huge number, then other innocent
> >       memcgs may take the fall. In our case the main workload was hit.
> >     * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
> >       ridiculously as the tracing result showed.
> >     * Easy to get out of control. Although shrinkers take into account deferred objects,
> >       but it can go out of control easily. One misconfigured memcg could incur absurd
> >       amount of deferred objects in a period of time.
> >     * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
> >       hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
> >       minutes. We observed latency spike due to the prolonged reclaim.
> >
> > These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
> > The patchset is the outcome of that discussion.
> >
> > So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
> >     * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
> >       does. Instead it is an atomic_long_t array, each element represent one shrinker
> >       even though the shrinker is not memcg aware, this simplifies the implementation.
> >       For memcg aware shrinkers, the deferred objects are just accumulated to its own
> >       memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
> >       shrinkers still use global nr_deferred from struct shrinker.
> >     * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
> >       with LRUs.
> >     * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
> >       reparenting to root memcg.
> >     * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
> >       series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)
> >
> > The downside is each memcg has to allocate extra memory to store the nr_deferred array.
> > On our production environment, there are typically around 40 shrinkers, so each memcg
> > needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.
> >
> > We have been running the patched kernel on some hosts of our fleet (test and production) for
> > months, it works very well. The monitor data shows the working set is sustained as expected.
>
> Hello Yang!
>
> The rationale is very well described and makes perfect sense to me.
> I fully support the idea to make nr_deferred per-memcg.
> Thank you for such a detailed description!
>
> More comments in individual patches.

Thank you very much.

>
> Thanks!
>
> >
> > Yang Shi (9):
> >       mm: vmscan: simplify nr_deferred update code
> >       mm: vmscan: use nid from shrink_control for tracepoint
> >       mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex
> >       mm: vmscan: use a new flag to indicate shrinker is registered
> >       mm: memcontrol: add per memcg shrinker nr_deferred
> >       mm: vmscan: use per memcg nr_deferred of shrinker
> >       mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
> >       mm: memcontrol: reparent nr_deferred when memcg offline
> >       mm: vmscan: shrink deferred objects proportional to priority
> >
> >  include/linux/memcontrol.h |   9 +++++
> >  include/linux/shrinker.h   |   8 ++++
> >  mm/memcontrol.c            | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  mm/vmscan.c                | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
> >  4 files changed, 274 insertions(+), 74 deletions(-)
> >

      reply	other threads:[~2020-12-03 17:53 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
2020-12-02 18:27 ` [PATCH 1/9] mm: vmscan: simplify nr_deferred update code Yang Shi
2020-12-03  2:56   ` Roman Gushchin
2020-12-02 18:27 ` [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
2020-12-03  3:13   ` Xiaqing (A)
2020-12-11 19:20     ` Yang Shi
2020-12-02 18:27 ` [PATCH 3/9] mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex Yang Shi
2020-12-02 18:27 ` [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
2020-12-03  3:01   ` Roman Gushchin
2020-12-03  4:59     ` Yang Shi
2020-12-03 20:08       ` Roman Gushchin
2020-12-03 22:25         ` Yang Shi
2020-12-04 18:52           ` Johannes Weiner
2020-12-04 21:24             ` Yang Shi
2020-12-02 18:27 ` [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
2020-12-03  3:06   ` Roman Gushchin
2020-12-03  4:54     ` Yang Shi
2020-12-03 18:03       ` Yang Shi
2020-12-03 20:07         ` Roman Gushchin
2020-12-03 22:49           ` Yang Shi
2020-12-03 23:30             ` Roman Gushchin
2020-12-04  0:22               ` Yang Shi
2020-12-10 15:33   ` Johannes Weiner
2020-12-10 19:12     ` Yang Shi
2020-12-11 17:52       ` Yang Shi
2020-12-10 21:59     ` Yang Shi
2020-12-02 18:27 ` [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
2020-12-03  3:08   ` Roman Gushchin
2020-12-03  5:01     ` Yang Shi
2020-12-03 11:40   ` Kirill Tkhai
2020-12-08 17:13     ` Yang Shi
2020-12-09 15:41       ` Kirill Tkhai
2020-12-09 17:32         ` Yang Shi
2020-12-10 15:13           ` Johannes Weiner
2020-12-10 15:17             ` Kirill Tkhai
2020-12-15 16:44               ` Johannes Weiner
2020-12-02 18:27 ` [PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
2020-12-02 18:27 ` [PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
2020-12-02 18:27 ` [PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
2020-12-03  2:52 ` [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Roman Gushchin
2020-12-03 17:52   ` Yang Shi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkpUYsWLrA10ewyaeb2NiH56ZUTK2oBmb0X-_Om0M4B75w@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=shakeelb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).