linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: Roman Gushchin <guro@fb.com>, Kirill Tkhai <ktkhai@virtuozzo.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Shakeel Butt <shakeelb@google.com>,
	Dave Chinner <david@fromorbit.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Linux MM <linux-mm@kvack.org>,
	Linux FS-devel Mailing List <linux-fsdevel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware
Date: Thu, 25 Feb 2021 09:00:16 -0800	[thread overview]
Message-ID: <CAHbLzkrEfeoofwJjncFDepcOxEKzqiAo8T7mowX2jJVCz5ikEA@mail.gmail.com> (raw)
In-Reply-To: <20210217001322.2226796-1-shy828301@gmail.com>

Hi Andrew,

Just checking in whether this series is on your radar. The patch 1/13
~ patch 12/13 have been reviewed and acked. Vlastimil had had some
comments on patch 13/13, I'm not sure if he is going to continue
reviewing that one. I hope the last patch could get into the -mm tree
along with the others so that it can get a broader test. What do you
think about it?

Thanks,
Yang

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi <shy828301@gmail.com> wrote:
>
>
> Changelog
> v7 --> v8:
>     * Added lockdep assert in expand_shrinker_info() per Roman.
>     * Added patch 05/13 to use kvfree_rcu() instead of call_rcu() per Roman
>       and Kirill.
>     * Moved rwsem acquire/release out of unregister_memcg_shrinker() per Roman.
>     * Renamed count_nr_deferred_{memcg} to xchg_nr_deferred_{memcg} per Roman.
>     * Fixed the next_deferred logic per Vlastimil.
>     * Misc minor code cleanup, refactor and spelling correction per Roman
>       and Shakeel.
>     * Collected more ack and review tags from Roman, Shakeel and Vlastimil.
> v6 --> v7:
>     * Expanded shrinker_info in a batch of BITS_PER_LONG per Kirill.
>     * Added patch 06/12 to introduce a helper for dereferencing shrinker_info
>       per Kirill.
>     * Renamed set_nr_deferred_memcg to add_nr_deferred_memcg per Kirill.
>     * Collected Acked-by from Kirill.
> v5 --> v6:
>     * Rebased on top of https://lore.kernel.org/linux-mm/1611216029-34397-1-git-send-email-abaci-bugfix@linux.alibaba.com/
>       per Kirill.
>     * Don't register shrinker idr with NULL and remove idr_replace() per Vlastimil.
>     * Move nr_deferred before map to guarantee the alignment per Vlastimil.
>     * Misc minor code cleanup and refactor per Kirill and Vlastimil.
>     * Added Acked-by from Vlastimil for path #1, #2, #3, #5, #9 and #10.
> v4 --> v5:
>     * Incorporated the comments from Kirill.
>     * Rebased to v5.11-rc5.
> v3 --> v4:
>     * Removed "memcg_" prefix for shrinker_maps related functions per Roman.
>     * Use write lock instead of read lock per Kirill. Also removed Johannes's ack
>       since write lock is used.
>     * Incorporated the comments from Kirill.
>     * Removed RFC.
>     * Rebased to v5.11-rc4.
> v2 --> v3:
>     * Moved shrinker_maps related code to vmscan.c per Dave.
>     * Removed memcg_shrinker_map_size. Calcuated the size of map via shrinker_nr_max
>       per Johannes.
>     * Consolidated shrinker_deferred with shrinker_maps into one struct per Dave.
>     * Simplified the nr_deferred related code.
>     * Dropped the memory barrier from v2.
>     * Moved nr_deferred reparent code to vmscan.c per Dave.
>     * Added test coverage information in patch #11. Dave is concerned about the
>       potential regression. I didn't notice regression with my tests, but suggestions
>       about more test coverage is definitely welcome. And it may help spot regression
>       with this patch in -mm tree then linux-next tree so I keep it in this version.
>     * The code cleanup and consolidation resulted in the series grow to 11 patches.
>     * Rebased onto 5.11-rc2.
> v1 --> v2:
>     * Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
>     * Folded patch #1 into patch #6 per Roman.
>     * Added memory barrier to prevent shrink_slab_memcg from seeing NULL shrinker_maps/
>       shrinker_deferred per Kirill.
>     * Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred
>       allocations from expand with shrinker_rwsem per Johannes.
>
> Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> it turned out there were huge amount accumulated nr_deferred objects seen by the
> shrinker.
>
> On our production machine, I saw absurd number of nr_deferred shown as the below
> tracing result:
>
> <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> 9300 cache items 1667 delta 11 total_scan 833
>
> There are 2.5 trillion deferred objects on one node, assuming all of them
> are dentry (192 bytes per object), so the total size of deferred on
> one node is ~480TB. It is definitely ridiculous.
>
> I managed to reproduce this problem with kernel build workload plus negative dentry
> generator.
>
> First step, run the below kernel build test script:
>
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
>
> cd /root/Buildarea/linux-stable
>
> for i in `seq 1500`; do
>         cgcreate -g memory:kern_build
>         echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
>
>         echo 3 > /proc/sys/vm/drop_caches
>         cgexec -g memory:kern_build make clean > /dev/null 2>&1
>         cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
>
>         cgdelete -g memory:kern_build
> done
>
> Then run the below negative dentry generator script:
>
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
>
> mkdir /sys/fs/cgroup/memory/test
> echo $$ > /sys/fs/cgroup/memory/test/tasks
>
> for i in `seq $NR_CPUS`; do
>         while true; do
>                 FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
>                 cat $FILE 2>/dev/null
>         done &
> done
>
> Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> showed:
>
>         kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
>         kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
>
> There were huge number of deferred objects before the shrinker was called, the behavior
> does match the code but it might be not desirable from the user's stand of point.
>
> The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
>     * GFP_NOFS allocation
>     * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)
>
> However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
> is per shrinker, this may have some bad effects:
>     * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
>       reclaim may get nr_deferred accumulated to a huge number, then other innocent
>       memcgs may take the fall. In our case the main workload was hit.
>     * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
>       ridiculously as the tracing result showed.
>     * Easy to get out of control. Although shrinkers take into account deferred objects,
>       but it can go out of control easily. One misconfigured memcg could incur absurd
>       amount of deferred objects in a period of time.
>     * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
>       hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
>       minutes. We observed latency spike due to the prolonged reclaim.
>
> These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
> The patchset is the outcome of that discussion.
>
> So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
>     * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
>       does. Instead it is an atomic_long_t array, each element represent one shrinker
>       even though the shrinker is not memcg aware, this simplifies the implementation.
>       For memcg aware shrinkers, the deferred objects are just accumulated to its own
>       memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
>       shrinkers still use global nr_deferred from struct shrinker.
>     * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
>       with LRUs.
>     * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
>       reparenting to root memcg.
>     * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
>       series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)
>
> The downside is each memcg has to allocate extra memory to store the nr_deferred array.
> On our production environment, there are typically around 40 shrinkers, so each memcg
> needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.
>
> We have been running the patched kernel on some hosts of our fleet (test and production) for
> months, it works very well. The monitor data shows the working set is sustained as expected.
>
> Yang Shi (13):
>       mm: vmscan: use nid from shrink_control for tracepoint
>       mm: vmscan: consolidate shrinker_maps handling code
>       mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
>       mm: vmscan: remove memcg_shrinker_map_size
>       mm: vmscan: use kvfree_rcu instead of call_rcu
>       mm: memcontrol: rename shrinker_map to shrinker_info
>       mm: vmscan: add shrinker_info_protected() helper
>       mm: vmscan: use a new flag to indicate shrinker is registered
>       mm: vmscan: add per memcg shrinker nr_deferred
>       mm: vmscan: use per memcg nr_deferred of shrinker
>       mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
>       mm: memcontrol: reparent nr_deferred when memcg offline
>       mm: vmscan: shrink deferred objects proportional to priority
>
>  include/linux/memcontrol.h |  23 +++---
>  include/linux/shrinker.h   |   7 +-
>  mm/huge_memory.c           |   4 +-
>  mm/list_lru.c              |   6 +-
>  mm/memcontrol.c            | 130 +------------------------------
>  mm/vmscan.c                | 394 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
>  6 files changed, 319 insertions(+), 245 deletions(-)
>

  parent reply	other threads:[~2021-02-25 17:02 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-17  0:13 [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware Yang Shi
2021-02-17  0:13 ` [v8 PATCH 01/13] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
2021-02-17  0:13 ` [v8 PATCH 02/13] mm: vmscan: consolidate shrinker_maps handling code Yang Shi
2021-02-17  0:13 ` [v8 PATCH 03/13] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
2021-03-08  6:40   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 04/13] mm: vmscan: remove memcg_shrinker_map_size Yang Shi
2021-03-08  6:49   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu Yang Shi
2021-02-17  1:59   ` Roman Gushchin
2021-02-17  6:25   ` Kirill Tkhai
2021-03-08  6:13   ` Shakeel Butt
2021-03-08 14:54     ` Paul E. McKenney
2021-03-08 18:15       ` Yang Shi
2021-03-08 16:49     ` Roman Gushchin
2021-03-08 20:22       ` Yang Shi
2021-03-08 21:11         ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 06/13] mm: memcontrol: rename shrinker_map to shrinker_info Yang Shi
2021-03-08  6:50   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 07/13] mm: vmscan: add shrinker_info_protected() helper Yang Shi
2021-03-08  6:52   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 08/13] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
2021-02-17  2:00   ` Roman Gushchin
2021-03-08 17:48   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred Yang Shi
2021-02-17  2:09   ` Roman Gushchin
2021-02-17  6:34   ` Kirill Tkhai
2021-03-08 19:12   ` Shakeel Butt
2021-03-08 20:30     ` Yang Shi
2021-03-08 21:11       ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
2021-02-17  2:10   ` Roman Gushchin
2021-02-17  6:39   ` Kirill Tkhai
2021-03-08 19:14   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 11/13] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
2021-03-08 21:57   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 12/13] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
2021-03-08 23:42   ` Shakeel Butt
2021-02-17  0:13 ` [v8 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
2021-02-25 17:00 ` Yang Shi [this message]
2021-03-01 15:05   ` [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware Johannes Weiner
2021-03-01 17:03     ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkrEfeoofwJjncFDepcOxEKzqiAo8T7mowX2jJVCz5ikEA@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=shakeelb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).