From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37393C4361B for ; Mon, 14 Dec 2020 22:37:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A870522509 for ; Mon, 14 Dec 2020 22:37:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A870522509 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CF6CF6B0036; Mon, 14 Dec 2020 17:37:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C80DB6B005D; Mon, 14 Dec 2020 17:37:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B49926B0068; Mon, 14 Dec 2020 17:37:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0143.hostedemail.com [216.40.44.143]) by kanga.kvack.org (Postfix) with ESMTP id 97CDB6B0036 for ; Mon, 14 Dec 2020 17:37:41 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 616B53634 for ; Mon, 14 Dec 2020 22:37:41 +0000 (UTC) X-FDA: 77593351122.06.car99_0313de92741e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 3CBCF10037CCB for ; Mon, 14 Dec 2020 22:37:41 +0000 (UTC) X-HE-Tag: car99_0313de92741e X-Filterd-Recvd-Size: 9732 Received: from mail-pj1-f67.google.com (mail-pj1-f67.google.com [209.85.216.67]) by imf30.hostedemail.com (Postfix) with ESMTP for ; Mon, 14 Dec 2020 22:37:40 +0000 (UTC) Received: by mail-pj1-f67.google.com with SMTP id l23so7666979pjg.1 for ; Mon, 14 Dec 2020 14:37:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=mLNmzPhP43nhieOnqMvcp0QWJXAMpr96qTKWJoY/ZIg=; b=hsK5xsXCE7I+3lhyuJuZcI9LYbDckkT6H+F0ogIVtBwePpZbxa3Yqrdk1vc4p6P/24 ohr8nrs7xoJmm4QoJIEewpmRbmaH9EDC0dXA6LWOQ6c/nRPXDHWuQhnqt9OpH7ZDzw+R Y5mcWVQKDTm1bwybGMskr73aJGjZ0yRQs54YEYRfhJ7SeZR6J/JINmqXAXwWn/6QAZvt AaUU8KhtpkL39iRL/3lhnMp6qaoNc2tDdfPcvaG/aj1c+2OiQXgE4t6xIooR7nj4PiIc Utss+mYeyolkHRDlecxDA0Oadxkbvo2Fq+E6/h+qc6w/GNivvnLa9MREP5yan47573ih KPMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=mLNmzPhP43nhieOnqMvcp0QWJXAMpr96qTKWJoY/ZIg=; b=TXCeQc5Ti8MJ0ZuIl1P/wOu1cFZpsCadshxRAoSUuEMaMt4L+obVDtdmPxdM3QZUSY Drx1CMk+4hjAlXSWkTVzEOUlqkmLBTnt/66PgRQsN8XclEeIggz/MbHv+/vkoBfyiBft hbkEKqHO3auYZn0cD2XeHGYPcpbtpl1QV7F/yzVHlOmou2lz2Yr18bh0Wz9awbdQyrLS 9CaZAxLKkf6GPqIPaX4YTW+/31aNy8NlfDu/yHKYO+ukGshs9cjpfsvhPLRIypueIlhD 8xViMjM1j5HiI/3YFNy3Vag5ZReHYZIqpYsf9s68riCStMm01ug7ymBKyu5x52Wk+zd3 +jsg== X-Gm-Message-State: AOAM5338bzkZ7ceRY6m3d1LkwyY2klAZkquMx8YPfDvMkXN/Dedk0ABc aowY8NBtCXGmWY/EhPJU1+s= X-Google-Smtp-Source: ABdhPJxhqnvXV24KCfpjegu3atpqcSCyWTv5HW4biVwRItemM//+qtBugS5uzamRkpNNdcwZpjifKQ== X-Received: by 2002:a17:902:8ec4:b029:db:f9ef:564f with SMTP id x4-20020a1709028ec4b02900dbf9ef564fmr8697453plo.19.1607985459681; Mon, 14 Dec 2020 14:37:39 -0800 (PST) Received: from localhost.localdomain (c-73-93-239-127.hsd1.ca.comcast.net. [73.93.239.127]) by smtp.gmail.com with ESMTPSA id d4sm20610758pfo.127.2020.12.14.14.37.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Dec 2020 14:37:38 -0800 (PST) From: Yang Shi To: guro@fb.com, ktkhai@virtuozzo.com, shakeelb@google.com, david@fromorbit.com, hannes@cmpxchg.org, mhocko@suse.com, akpm@linux-foundation.org Cc: shy828301@gmail.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Date: Mon, 14 Dec 2020 14:37:13 -0800 Message-Id: <20201214223722.232537-1-shy828301@gmail.com> X-Mailer: git-send-email 2.26.2 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Changelog v1 --> v2: * Use shrinker->flags to store the new SHRINKER_REGISTERED flag per R= oman. * Folded patch #1 into patch #6 per Roman. * Added memory barrier to prevent shrink_slab_memcg from seeing NULL = shrinker_maps/ shrinker_deferred per Kirill. * Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_de= ferred allocations from expand with shrinker_rwsem per Johannes. Recently huge amount one-off slab drop was seen on some vfs metadata heav= y workloads, it turned out there were huge amount accumulated nr_deferred objects seen= by the shrinker. On our production machine, I saw absurd number of nr_deferred shown as th= e below tracing result:=20 <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs 9300 cache items 1667 delta 11 total_scan 833 There are 2.5 trillion deferred objects on one node, assuming all of them are dentry (192 bytes per object), so the total size of deferred on one node is ~480TB. It is definitely ridiculous. I managed to reproduce this problem with kernel build workload plus negat= ive dentry generator. First step, run the below kernel build test script: NR_CPUS=3D`cat /proc/cpuinfo | grep -e processor | wc -l` cd /root/Buildarea/linux-stable for i in `seq 1500`; do cgcreate -g memory:kern_build echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes echo 3 > /proc/sys/vm/drop_caches cgexec -g memory:kern_build make clean > /dev/null 2>&1 cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1 cgdelete -g memory:kern_build done Then run the below negative dentry generator script: NR_CPUS=3D`cat /proc/cpuinfo | grep -e processor | wc -l` mkdir /sys/fs/cgroup/memory/test echo $$ > /sys/fs/cgroup/memory/test/tasks for i in `seq $NR_CPUS`; do while true; do FILE=3D`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64= ` cat $FILE 2>/dev/null done & done Then kswapd will shrink half of dentry cache in just one loop as the belo= w tracing result showed: kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cach= e_scan+0x0/0x190 0000000024acf00c: nid: 0 objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 de= lta 45746 total_scan 46844936 priority 12 kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_= scan+0x0/0x190 0000000024acf00c: nid: 0 unused scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinke= r return val 46844928 There were huge number of deferred objects before the shrinker was called= , the behavior does match the code but it might be not desirable from the user's stand o= f point. The excessive amount of nr_deferred might be accumulated due to various r= easons, for example: * GFP_NOFS allocation * Significant times of small amount scan (< scan_batch, 1024 for vfs = metadata) However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the d= eferred objects is per shrinker, this may have some bad effects: * Poor isolation among memcgs. Some memcgs which happen to have frequ= ent limit reclaim may get nr_deferred accumulated to a huge number, then othe= r innocent memcgs may take the fall. In our case the main workload was hit. * Unbounded deferred objects. There is no cap for deferred objects, i= t can outgrow ridiculously as the tracing result showed. * Easy to get out of control. Although shrinkers take into account de= ferred objects, but it can go out of control easily. One misconfigured memcg could = incur absurd=20 amount of deferred objects in a period of time. * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, = etc. There may be hundred GB slab caches for vfe metadata heavy workload, shrink half= of them may take minutes. We observed latency spike due to the prolonged reclaim. These issues also have been discussed in https://lore.kernel.org/linux-mm= /20200916185823.5347-1-shy828301@gmail.com/. The patchset is the outcome of that discussion. So this patchset makes nr_deferred per-memcg to tackle the problem. It do= es: * Have memcg_shrinker_deferred per memcg per node, just like what shr= inker_map does. Instead it is an atomic_long_t array, each element represent = one shrinker even though the shrinker is not memcg aware, this simplifies the im= plementation. For memcg aware shrinkers, the deferred objects are just accumulate= d to its own memcg. The shrinkers just see nr_deferred from its own memcg. Non m= emcg aware shrinkers still use global nr_deferred from struct shrinker. * Once the memcg is offlined, its nr_deferred will be reparented to i= ts parent along with LRUs. * The root memcg has memcg_shrinker_deferred array too. It simplifies= the handling of reparenting to root memcg. * Cap nr_deferred to 2x of the length of lru. The idea is borrowed fr= om Dave Chinner's series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-da= vid@fromorbit.com/) The downside is each memcg has to allocate extra memory to store the nr_d= eferred array. On our production environment, there are typically around 40 shrinkers, s= o each memcg needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine. We have been running the patched kernel on some hosts of our fleet (test = and production) for months, it works very well. The monitor data shows the working set is sus= tained as expected. Yang Shi (9): mm: vmscan: use nid from shrink_control for tracepoint mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocat= ion mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_map= s for online memcg mm: vmscan: use a new flag to indicate shrinker is registered mm: memcontrol: add per memcg shrinker nr_deferred mm: vmscan: use per memcg nr_deferred of shrinker mm: vmscan: don't need allocate shrinker->nr_deferred for memcg awa= re shrinkers mm: memcontrol: reparent nr_deferred when memcg offline mm: vmscan: shrink deferred objects proportional to priority include/linux/memcontrol.h | 9 +++++ include/linux/shrinker.h | 11 ++++-- mm/internal.h | 1 + mm/memcontrol.c | 156 +++++++++++++++++++++++++++++++++++++++= +++++++++++++++++++++++++++++------ mm/vmscan.c | 193 +++++++++++++++++++++++++++++++++++++++= +++++++++++++++++++---------------------------------- 5 files changed, 285 insertions(+), 85 deletions(-)