From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56232C433E0 for ; Thu, 25 Feb 2021 17:01:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E63E464E02 for ; Thu, 25 Feb 2021 17:01:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232949AbhBYRB0 (ORCPT ); Thu, 25 Feb 2021 12:01:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233117AbhBYRBK (ORCPT ); Thu, 25 Feb 2021 12:01:10 -0500 Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BE326C06174A; Thu, 25 Feb 2021 09:00:29 -0800 (PST) Received: by mail-ed1-x533.google.com with SMTP id cf12so7011441edb.8; Thu, 25 Feb 2021 09:00:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=+m1d5CY/dF8o5udkQaQvUfO44nkt75LbnmvMpeNlDbs=; b=VQepAW4qGEpIG4zriDbpRNyKdaSQlv+/1iG5RzDzSQ3mQv9w2D+vWLgyjWjpON75KK e6pwT+zx0vR64mtkqfmq7bUQKAB7l50cfts7x52XcM35f8oWj0FLVIKT0NaMIQcPNhay 72J+HExew308Z3SXPWY7IqcfvWz8c+iFbOUqSJY7KSpEuW4ArubR8xjELpgzYHQftdDB HnNu35M+vu27C3JLxTGqPKvm5AEFbpxH8gmD1qXqhza9R4KuihJeqtrA/z5sCHRDt3tS twE6xuxgpDyuiGISzjE3ErWR06rI3DsRD+5j8kWHbxH+SrcUPh2KXG2S8wUk6nTh3GuE Z4hQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=+m1d5CY/dF8o5udkQaQvUfO44nkt75LbnmvMpeNlDbs=; b=rbwGVEj2PHO94fH75DngoFqG+Axv/C0rILa7rxvNbPcMaU/UwxjzHGm62hx/ec/VdX Jq+rpO/GbjB/pVXvJ+uEiz+dUGYJPzM4Wy/qDqkZhp/gVMMV4yj+auQfc7uXWTXdHKRi zGu2w+wnoJAzjYkDBWTP1xbvG7zyxymrI7uEm42aP/IAiOTZifQp9xECUx9CSaz3toRS 37wrg4ef7Gwrgm2tjzqDgsWCy1c3O/17ThAeWi6dZSaDdzx0ozU0802NZJKgQov6i6dB GZLZoaFOykKZE7/1aaewfi98ScrqPmcF/E/pNqi2TQSIWNR5x6zvz4057vmpRjrBu2R3 JvnA== X-Gm-Message-State: AOAM533zMOf47ESVmiGdycrRV/WN2R93BbpBhRmtrfgmBb2Nah7vDzKX J5RmYCx+V6bgDQn+4rn+gkOKz1fUwvhIuiBWBCM= X-Google-Smtp-Source: ABdhPJy6oLq5mlHnobE3bk/kmGmTG4Pibg+rGHaVCnM0iER+VUl9nMOHV/R0KgGhB4WKo/IsIPMDLwDUj1Zr9Fo+j74= X-Received: by 2002:a50:bec3:: with SMTP id e3mr3904100edk.290.1614272428345; Thu, 25 Feb 2021 09:00:28 -0800 (PST) MIME-Version: 1.0 References: <20210217001322.2226796-1-shy828301@gmail.com> In-Reply-To: <20210217001322.2226796-1-shy828301@gmail.com> From: Yang Shi Date: Thu, 25 Feb 2021 09:00:16 -0800 Message-ID: Subject: Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware To: Roman Gushchin , Kirill Tkhai , Vlastimil Babka , Shakeel Butt , Dave Chinner , Johannes Weiner , Michal Hocko , Andrew Morton Cc: Linux MM , Linux FS-devel Mailing List , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Hi Andrew, Just checking in whether this series is on your radar. The patch 1/13 ~ patch 12/13 have been reviewed and acked. Vlastimil had had some comments on patch 13/13, I'm not sure if he is going to continue reviewing that one. I hope the last patch could get into the -mm tree along with the others so that it can get a broader test. What do you think about it? Thanks, Yang On Tue, Feb 16, 2021 at 4:13 PM Yang Shi wrote: > > > Changelog > v7 --> v8: > * Added lockdep assert in expand_shrinker_info() per Roman. > * Added patch 05/13 to use kvfree_rcu() instead of call_rcu() per Roman > and Kirill. > * Moved rwsem acquire/release out of unregister_memcg_shrinker() per Roman. > * Renamed count_nr_deferred_{memcg} to xchg_nr_deferred_{memcg} per Roman. > * Fixed the next_deferred logic per Vlastimil. > * Misc minor code cleanup, refactor and spelling correction per Roman > and Shakeel. > * Collected more ack and review tags from Roman, Shakeel and Vlastimil. > v6 --> v7: > * Expanded shrinker_info in a batch of BITS_PER_LONG per Kirill. > * Added patch 06/12 to introduce a helper for dereferencing shrinker_info > per Kirill. > * Renamed set_nr_deferred_memcg to add_nr_deferred_memcg per Kirill. > * Collected Acked-by from Kirill. > v5 --> v6: > * Rebased on top of https://lore.kernel.org/linux-mm/1611216029-34397-1-git-send-email-abaci-bugfix@linux.alibaba.com/ > per Kirill. > * Don't register shrinker idr with NULL and remove idr_replace() per Vlastimil. > * Move nr_deferred before map to guarantee the alignment per Vlastimil. > * Misc minor code cleanup and refactor per Kirill and Vlastimil. > * Added Acked-by from Vlastimil for path #1, #2, #3, #5, #9 and #10. > v4 --> v5: > * Incorporated the comments from Kirill. > * Rebased to v5.11-rc5. > v3 --> v4: > * Removed "memcg_" prefix for shrinker_maps related functions per Roman. > * Use write lock instead of read lock per Kirill. Also removed Johannes's ack > since write lock is used. > * Incorporated the comments from Kirill. > * Removed RFC. > * Rebased to v5.11-rc4. > v2 --> v3: > * Moved shrinker_maps related code to vmscan.c per Dave. > * Removed memcg_shrinker_map_size. Calcuated the size of map via shrinker_nr_max > per Johannes. > * Consolidated shrinker_deferred with shrinker_maps into one struct per Dave. > * Simplified the nr_deferred related code. > * Dropped the memory barrier from v2. > * Moved nr_deferred reparent code to vmscan.c per Dave. > * Added test coverage information in patch #11. Dave is concerned about the > potential regression. I didn't notice regression with my tests, but suggestions > about more test coverage is definitely welcome. And it may help spot regression > with this patch in -mm tree then linux-next tree so I keep it in this version. > * The code cleanup and consolidation resulted in the series grow to 11 patches. > * Rebased onto 5.11-rc2. > v1 --> v2: > * Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman. > * Folded patch #1 into patch #6 per Roman. > * Added memory barrier to prevent shrink_slab_memcg from seeing NULL shrinker_maps/ > shrinker_deferred per Kirill. > * Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred > allocations from expand with shrinker_rwsem per Johannes. > > Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads, > it turned out there were huge amount accumulated nr_deferred objects seen by the > shrinker. > > On our production machine, I saw absurd number of nr_deferred shown as the below > tracing result: > > <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start: > super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink > 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs > 9300 cache items 1667 delta 11 total_scan 833 > > There are 2.5 trillion deferred objects on one node, assuming all of them > are dentry (192 bytes per object), so the total size of deferred on > one node is ~480TB. It is definitely ridiculous. > > I managed to reproduce this problem with kernel build workload plus negative dentry > generator. > > First step, run the below kernel build test script: > > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l` > > cd /root/Buildarea/linux-stable > > for i in `seq 1500`; do > cgcreate -g memory:kern_build > echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes > > echo 3 > /proc/sys/vm/drop_caches > cgexec -g memory:kern_build make clean > /dev/null 2>&1 > cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1 > > cgdelete -g memory:kern_build > done > > Then run the below negative dentry generator script: > > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l` > > mkdir /sys/fs/cgroup/memory/test > echo $$ > /sys/fs/cgroup/memory/test/tasks > > for i in `seq $NR_CPUS`; do > while true; do > FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64` > cat $FILE 2>/dev/null > done & > done > > Then kswapd will shrink half of dentry cache in just one loop as the below tracing result > showed: > > kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 > objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12 > kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused > scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928 > > There were huge number of deferred objects before the shrinker was called, the behavior > does match the code but it might be not desirable from the user's stand of point. > > The excessive amount of nr_deferred might be accumulated due to various reasons, for example: > * GFP_NOFS allocation > * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata) > > However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects > is per shrinker, this may have some bad effects: > * Poor isolation among memcgs. Some memcgs which happen to have frequent limit > reclaim may get nr_deferred accumulated to a huge number, then other innocent > memcgs may take the fall. In our case the main workload was hit. > * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow > ridiculously as the tracing result showed. > * Easy to get out of control. Although shrinkers take into account deferred objects, > but it can go out of control easily. One misconfigured memcg could incur absurd > amount of deferred objects in a period of time. > * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be > hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take > minutes. We observed latency spike due to the prolonged reclaim. > > These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/. > The patchset is the outcome of that discussion. > > So this patchset makes nr_deferred per-memcg to tackle the problem. It does: > * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map > does. Instead it is an atomic_long_t array, each element represent one shrinker > even though the shrinker is not memcg aware, this simplifies the implementation. > For memcg aware shrinkers, the deferred objects are just accumulated to its own > memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware > shrinkers still use global nr_deferred from struct shrinker. > * Once the memcg is offlined, its nr_deferred will be reparented to its parent along > with LRUs. > * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of > reparenting to root memcg. > * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's > series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/) > > The downside is each memcg has to allocate extra memory to store the nr_deferred array. > On our production environment, there are typically around 40 shrinkers, so each memcg > needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine. > > We have been running the patched kernel on some hosts of our fleet (test and production) for > months, it works very well. The monitor data shows the working set is sustained as expected. > > Yang Shi (13): > mm: vmscan: use nid from shrink_control for tracepoint > mm: vmscan: consolidate shrinker_maps handling code > mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation > mm: vmscan: remove memcg_shrinker_map_size > mm: vmscan: use kvfree_rcu instead of call_rcu > mm: memcontrol: rename shrinker_map to shrinker_info > mm: vmscan: add shrinker_info_protected() helper > mm: vmscan: use a new flag to indicate shrinker is registered > mm: vmscan: add per memcg shrinker nr_deferred > mm: vmscan: use per memcg nr_deferred of shrinker > mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers > mm: memcontrol: reparent nr_deferred when memcg offline > mm: vmscan: shrink deferred objects proportional to priority > > include/linux/memcontrol.h | 23 +++--- > include/linux/shrinker.h | 7 +- > mm/huge_memory.c | 4 +- > mm/list_lru.c | 6 +- > mm/memcontrol.c | 130 +------------------------------ > mm/vmscan.c | 394 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------ > 6 files changed, 319 insertions(+), 245 deletions(-) >