From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E36AFC433FE for ; Thu, 3 Dec 2020 17:52:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 340D3206F0 for ; Thu, 3 Dec 2020 17:52:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 340D3206F0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9EE436B006C; Thu, 3 Dec 2020 12:52:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 99F958D0007; Thu, 3 Dec 2020 12:52:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 840E18D0005; Thu, 3 Dec 2020 12:52:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 6D5E76B006C for ; Thu, 3 Dec 2020 12:52:15 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 369CE180AD80F for ; Thu, 3 Dec 2020 17:52:15 +0000 (UTC) X-FDA: 77552715030.13.money94_0e037f6273bd Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id 1606A18140B60 for ; Thu, 3 Dec 2020 17:52:15 +0000 (UTC) X-HE-Tag: money94_0e037f6273bd X-Filterd-Recvd-Size: 10080 Received: from mail-ej1-f68.google.com (mail-ej1-f68.google.com [209.85.218.68]) by imf06.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Dec 2020 17:52:14 +0000 (UTC) Received: by mail-ej1-f68.google.com with SMTP id d17so4733384ejy.9 for ; Thu, 03 Dec 2020 09:52:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O01jwosGNB3gmCv8Rr4mZTzNti9b8qJdxUNXInpIzQk=; b=o+XVpz4oieeV5yxvjK6g2STKFxWlsM8j8+16U3yUt6+LDuW85hsojm2r9JhatXQeNj /OrTUlJ/lwy33pq3BKCiWPYj76iolnf8yaUiVDS9dnh0+4fZ6VOI8UOMRS3Xl4G6SxHj pEgGTDCF0r/40bQQ9MUbrfHAABPexyRIeHL8zm9hh7XicUR3vOzWtQjICjAZdhoWogJu f1XKqOrBw2Em2ELxu7eU25wK+iEJsVzFZST7DqVvxqKQbZHkHPzZ4nVmInLAtKaOrNfY +wwIQddOQY2YyXJ03mbkJx0684jgb5PDgZEjDS6kaTuuvqt/5r/v5Qiet8YmY1UrLqvr rN4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O01jwosGNB3gmCv8Rr4mZTzNti9b8qJdxUNXInpIzQk=; b=IbFysPzBsC4rVsLjUEY2ECpmrIQbkS2uJj8FOBdI07iv/9PN89hESUe1iGD9lz3tWw aXHDafb/iMzP0kZSuEMFO9UXcKO16JUkzys7nmbDYeRo1zG+xrc0B8GmxTgO2Up7/Sh+ orKHWpZ2J36kTgIxyPA6SWHvYvtXbGqGkR4EB57KbIApC9SHN6xtK5i0o2vNjCY2iBHz rObCtXHUluK/rYYMM8V1rKFUhcxUthdhH72vonoGJ/c1jkStamTLpyv3CqgN4fNo4CTI rgdXHJ0iy6/ljz4X9rQYskALoENmhBJCr63+VfWstXZ2jYn64qOYovCdWqqna+Si/vY+ O7wA== X-Gm-Message-State: AOAM533fM98WvHnNWqdENW/qBXUJ7Mw811C2S7eDvcZHntQF9Ie2toZd 6CiW/Ss/FMpjGNOYuxkQ6w/vrjiZfXeBHhqtEHg= X-Google-Smtp-Source: ABdhPJyCGdGio02I1W/cxNZb8Z3pGKLpbYNaJ+56I4KPMZANly0H42flWsO1JEb9ma9Ng2SsVq6zrhItIm/aNIXLJEE= X-Received: by 2002:a17:906:6a45:: with SMTP id n5mr3644014ejs.514.1607017933270; Thu, 03 Dec 2020 09:52:13 -0800 (PST) MIME-Version: 1.0 References: <20201202182725.265020-1-shy828301@gmail.com> <20201203025234.GD1375014@carbon.DHCP.thefacebook.com> In-Reply-To: <20201203025234.GD1375014@carbon.DHCP.thefacebook.com> From: Yang Shi Date: Thu, 3 Dec 2020 09:52:00 -0800 Message-ID: Subject: Re: [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware To: Roman Gushchin Cc: Kirill Tkhai , Shakeel Butt , Dave Chinner , Johannes Weiner , Michal Hocko , Andrew Morton , Linux MM , Linux FS-devel Mailing List , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 2, 2020 at 6:52 PM Roman Gushchin wrote: > > On Wed, Dec 02, 2020 at 10:27:16AM -0800, Yang Shi wrote: > > > > Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads, > > it turned out there were huge amount accumulated nr_deferred objects seen by the > > shrinker. > > > > On our production machine, I saw absurd number of nr_deferred shown as the below > > tracing result: > > > > <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start: > > super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink > > 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs > > 9300 cache items 1667 delta 11 total_scan 833 > > > > There are 2.5 trillion deferred objects on one node, assuming all of them > > are dentry (192 bytes per object), so the total size of deferred on > > one node is ~480TB. It is definitely ridiculous. > > > > I managed to reproduce this problem with kernel build workload plus negative dentry > > generator. > > > > First step, run the below kernel build test script: > > > > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l` > > > > cd /root/Buildarea/linux-stable > > > > for i in `seq 1500`; do > > cgcreate -g memory:kern_build > > echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes > > > > echo 3 > /proc/sys/vm/drop_caches > > cgexec -g memory:kern_build make clean > /dev/null 2>&1 > > cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1 > > > > cgdelete -g memory:kern_build > > done > > > > Then run the below negative dentry generator script: > > > > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l` > > > > mkdir /sys/fs/cgroup/memory/test > > echo $$ > /sys/fs/cgroup/memory/test/tasks > > > > for i in `seq $NR_CPUS`; do > > while true; do > > FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64` > > cat $FILE 2>/dev/null > > done & > > done > > > > Then kswapd will shrink half of dentry cache in just one loop as the below tracing result > > showed: > > > > kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 > > objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12 > > kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused > > scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928 > > > > There were huge number of deferred objects before the shrinker was called, the behavior > > does match the code but it might be not desirable from the user's stand of point. > > > > The excessive amount of nr_deferred might be accumulated due to various reasons, for example: > > * GFP_NOFS allocation > > * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata) > > > > However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects > > is per shrinker, this may have some bad effects: > > * Poor isolation among memcgs. Some memcgs which happen to have frequent limit > > reclaim may get nr_deferred accumulated to a huge number, then other innocent > > memcgs may take the fall. In our case the main workload was hit. > > * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow > > ridiculously as the tracing result showed. > > * Easy to get out of control. Although shrinkers take into account deferred objects, > > but it can go out of control easily. One misconfigured memcg could incur absurd > > amount of deferred objects in a period of time. > > * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be > > hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take > > minutes. We observed latency spike due to the prolonged reclaim. > > > > These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/. > > The patchset is the outcome of that discussion. > > > > So this patchset makes nr_deferred per-memcg to tackle the problem. It does: > > * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map > > does. Instead it is an atomic_long_t array, each element represent one shrinker > > even though the shrinker is not memcg aware, this simplifies the implementation. > > For memcg aware shrinkers, the deferred objects are just accumulated to its own > > memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware > > shrinkers still use global nr_deferred from struct shrinker. > > * Once the memcg is offlined, its nr_deferred will be reparented to its parent along > > with LRUs. > > * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of > > reparenting to root memcg. > > * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's > > series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/) > > > > The downside is each memcg has to allocate extra memory to store the nr_deferred array. > > On our production environment, there are typically around 40 shrinkers, so each memcg > > needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine. > > > > We have been running the patched kernel on some hosts of our fleet (test and production) for > > months, it works very well. The monitor data shows the working set is sustained as expected. > > Hello Yang! > > The rationale is very well described and makes perfect sense to me. > I fully support the idea to make nr_deferred per-memcg. > Thank you for such a detailed description! > > More comments in individual patches. Thank you very much. > > Thanks! > > > > > Yang Shi (9): > > mm: vmscan: simplify nr_deferred update code > > mm: vmscan: use nid from shrink_control for tracepoint > > mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex > > mm: vmscan: use a new flag to indicate shrinker is registered > > mm: memcontrol: add per memcg shrinker nr_deferred > > mm: vmscan: use per memcg nr_deferred of shrinker > > mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers > > mm: memcontrol: reparent nr_deferred when memcg offline > > mm: vmscan: shrink deferred objects proportional to priority > > > > include/linux/memcontrol.h | 9 +++++ > > include/linux/shrinker.h | 8 ++++ > > mm/memcontrol.c | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---- > > mm/vmscan.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- > > 4 files changed, 274 insertions(+), 74 deletions(-) > >